Challenges in funding and developing genomic software: roots and remedies

The computer software used for genomic analysis has become a crucial component of the infrastructure for life sciences. However, genomic software is still typically developed in an ad hoc manner, with inadequate funding, and by academic researchers not trained in software development, at substantial costs to the research community. I examine the roots of the incongruity between the importance of and the degree of investment in genomic software, and I suggest several potential remedies for current problems. As genomics continues to grow, new strategies for funding and developing the software that powers the field will become increasingly essential.

infrastructure of biological research than large shared data sets or public databases, yet the model for funding and developing computer software differs substantially. Most widely used genomic software is developed by independent investigators working in academic or not-for-profit institutions with support from government grants. This software is generally freely available to the community, typically with no subscription or licensing fees and nonrestrictive terms of use. At the same time, it is often meagerly funded, unreliable, hard-to-use, poorly documented, and/or poorly supported. How did we, as a community, arrive at this odd situation? Why is scientific software supported differently from other forms of scientific infrastructure? Why are adequate funds not set aside for this important work?
In this article, I offer my perspective on the unique problem of funding and developing software for genomics, based on my 25 years in the field-as a developer and user of software, a professional programmer and principal investigator, an applicant for and reviewer of grant proposals, and an employee of government, university, and private research institutions. I first examine  Siepel), as one of the millions of children who were introduced to home computers during the 1980s and 1990s, some of whom would go on to write much of the software that powers genomics today; floppy disk for PAUP version 3.1.1,©1993; Sun Microsystems SPARCstation 1 with Mosaic web browser faintly visible on screen, 1994 [3]; screen shot from the MASE alignment program [4]. d Prof. David Haussler of UC Santa Cruz with the original Dell computer cluster that his team used to assemble the human genome, 2000. Photo (c) UC Santa Cruz, used with permission what makes genomic software development unusual and how the field has come to be the way it is. Overall, I argue that, despite some important strengths of our current model for software development, we as a community have "painted ourselves into a corner" in terms of developing robust, well-engineered software and are paying for it; we are, in a sense, addicted to free software. Finally, I suggest some possible remedies that attempt to strike a balance between addressing important deficiencies in the current model and maintaining its core strengths. My discussion of these topics necessarily have a US bias, but I believe that many of my points are internationally valid. Also, although this article focuses on genomics, similar trends occur in other areas of computational biology, such as structural biology and proteomics, as well as in some other areas of scientific software development.

Software for genomics is critical to the research infrastructure for the life sciences
During the past 25 years, genomic software development has grown from an obscure cottage industry to an essential part of the infrastructure of biological research. Researchers across the globe rely on computational tools for read mapping, genome assembly, multiple alignment, phylogenetics, population genomics, and visualization of genomic data, among many other applications. Importantly, these tools are no longer used only by genomic specialists, but across all the life sciences, including disparate fields such as ecology and evolution, molecular and cell biology, clinical genetics, plant breeding, biophysics, and bioengineering. To take one measure of impact, the papers describing popular genomic software tools are among the most highly cited publications in the scientific literature [13,14]. For example, Table 1 lists 66 well-known genomic software tools, from various application areas, each of which has been cited at least 2000 times and, in some cases, many tens of thousands of times (for reference, only about 1 in 100,000 scientific papers is cited more than 2000 times , based on estimates from Open Academic Graph [152], a bibliographic database of~700 million publications (analysis restricted to biology-related publications)). Indeed, nowadays it is rare to encounter a scientific publication that makes use of DNA, RNA, or protein sequences but does not reference one or more tools of this type.
Because the reach of genomics software is so vast, it is difficult to measure its economic importance. Nevertheless, the US government spends at least~$16 billion per year on basic research in the life sciences (spending on research and development by the US Federal government was estimated at $118 billion in 2017, of which $32 billion was dedicated to basic research. The life sciences account for approximately half of all spending, suggesting approximately $16 billion is spent on basic life sciences research [153]). If even 10% of these funds are devoted to projects that rely in part on genomics and genomic software, which seems plausible, then this software would be instrumental in supporting more than a billion dollars per year in research. Furthermore, total R&D expenditures in the US are estimated at about four times those of the federal government, and scaling up to worldwide R&D expenditures requires about another factor of three. (Total R&D expenditures in the US, including those in the private sector and at other governmental levels, are estimated at about $500 billion annually. The US leads the world in spending on science, but China is not far behind, and several other countries-including Japan, Germany, South Korea, France, and the UK-also account for substantial amounts. Together, the top ten countries spend about $1.5 trillion per year on R&D [154]). Therefore, a rough calculation suggests that the worldwide research that depends, at least in part, on genomic software is likely to cost tens of billions of dollars annually.

Software for genomics lacks a sustainable model for development and maintenance
Despite the overwhelming importance of genomic software, there is broad agreement among practitioners that the current model for its development has serious flaws. As noted above, most genomic software is developed by academic groups and funded by government grants, yet there are relatively few dedicated granting opportunities for genomic software development, and those that exist have relatively low levels of funding (see Table 2 for examples of recent and current funding programs). More typically, software development efforts in genomics have to be cloaked as research, for example, by describing the development of a software tool as a single aim or subaim of a research grant that is ostensibly focused on biological discovery. Additional funding for computational genomics has been made available through consortium projects, community databases, and browsers (for example, through U24, U41, and U54 opportunities at the US National Institutes of Health (NIH)), but the scope of this work is often quite constrained. Despite that the most widely used tools have been developed by individual laboratories pursuing investigator-initiated work (Table 1), the funding for projects of this kind remains limited.
It is particularly difficult for academic researchers to obtain funding to extend, refine, or support software tools that have already proven to be widely useful to the community-for example, to improve performance, usability, robustness, or documentation, or to provide support for bug fixes and user questions. Except in a few special cases (for example, the Continued Development  and Maintenance of Software opportunity previously offered by the NIH; Table 2), grant review panels tend to consider projects of this kind to be insufficiently novel to be supported either by dedicated research grants or as components of grants focused on biological discovery. One might expect that this type of engineering-focused work would more naturally be provided by the private sector, as with laboratory equipment or reagents but, despite decades of anticipation, there is still no thriving commercial market for genomics software. It is true that biotech and pharmaceutical companies often have their own in-house software development groups, but there seems to be, at best, weak demand for these products in the larger research community. Moreover, current trends point in the wrong direction, with several relevant grant opportunities having recently been discontinued (Table 2) and little indication of the emergence of a robust commercial market.
In part owing to these financial limitations, it is difficult to recruit and retain professional software developers in academic settings. Perhaps the most severe challenge is that the salary structures and budget models for academic institutions are generally not set up to accommodate six-figure salaries for workers who are not principal investigators or high-level administrators. As a result, software engineers typically accept a substantial salary reduction-of sometimes 50% or more-for the "privilege" of working in scientific research, as opposed to working for an established or start-up high-tech company. (The average salary for an entry-level software engineer in San Francisco, CA is about $110,000 [155].) Furthermore, academic research institutions often do not provide attractive career paths for software developers, offering them, for example, limited options for career advancement, few awards or accolades, and at most small communities of career-matched peers.
Instead, software development is often done by graduate students and postdoctoral researchers who have other priorities and, in many cases, no direct training in the area. Some principal investigators also devote considerable amounts of their own time to software development, but these activities must be balanced against many other responsibilities, including teaching, mentoring, writing scientific papers, and raising funds. Therefore, genomic software development tends to be done on a low budget, with many short-cuts to software engineering best practices.
Software packages developed in this way tend to be sparsely documented, difficult to install and use, restricted to specific platforms, and unreliable. In addition, the support and maintenance of released packages tends to be inconsistent, typically relying on email contact with busy and distracted principal investigators or trainees, and often effectively ending when a key student or postdoctoral researcher changes jobs. All these factors combine to produce a great deal of wasted time and frustration for the users of genomic software. They also contribute to severe challenges in reproducibility in genomic analysis. Indeed, a recent review of nearly 25,000 "omics" software resources published from 2000 to 2017 found that 26% were no longer accessible through URLs published in the corresponding papers [156]. Among accessible tools, 28% could not be installed, and another 21% were deemed "difficult to install." Together, it appears that, as a field, we are on an unsustainable path for genomic software development. We do not set aside adequate funding for it, we fail to encourage and enforce good engineering practices, we have inadequate structures for recruiting and retaining the workers we need, and we continually pay a high price in reliability, usability, and performance.
Other aspects of the infrastructure for genomics have alternative funding models Interestingly, other aspects of the infrastructure for genomics have followed rather different models. DNA sequencing instruments, for example, have for decades been primarily developed and marketed by companies such as Applied Biosystems (now part of Thermo Fisher Scientific), Illumina, Oxford Nanopore Technologies and, until it was recently absorbed by Illumina, Pacific Biosciences of California. The microarray market was (and remains) similarly commercial, at least following an initial experimental phase, with companies such as Affymetrix (also now part of Thermo Fisher Scientific) and Agilent Technologies dominant. Laboratory equipment is provided by companies such as PerkinElmer, Bio-Rad Laboratories, and Becton Dickinson (BD), and computer hardware is provided by Intel, AMD, Apple, Microsoft, Dell, Samsung, Acer, Hewlett-Packard, and many others. These are areas of technology development with substantial "bricks and mortar" needs, including major manufacturing operations, and they address sufficiently large markets with sufficiently high profit margins such that free enterprise is able to meet the needs of scientific research. Despite the general feeling of corporate skepticism among academic scientists, these companies are viewed, by and large, as positive forces for innovation that are complementary to academic science. By contrast, large, widely used public databases, such as GenBank, EMBL-Bank, and PDB, tend to be directly supported by government agencies or by long-standing government grants. Even smaller database projects located at universities or private research institutes, such as FlyBase, the Saccharomyces Genome Database (SGD), or the Mouse Genome Database (MGD), tend to have substantial, repeatedly renewed government grants. Thus, it seems that there is an implicit understanding in genomics that the management of large public data sets should be centralized and government-supported, while the hardware and instruments used for generating and analyzing data should be provided by the free market. Why is software different from both?

Roots: dawn of the modern era for computational genomics
When I started working in computational genomics in 1994, as a research assistant at Los Alamos National Laboratory (LANL), the software landscape in the field had a distinctly different feel. Free software was much less plentiful and co-existed symbiotically with widely used commercial products. In the HIV Sequence Database group in which I worked, we had access to purchased copies of MacClade [35], PAUP [4], and the Genetics Computer Group's (GCG) Wisconsin Package, alongside free software such as MASE [4], BLAST [21], and PHYLIP [34] (Fig. 1c). In addition, "serious" computational scientists at the time generally used expensive proprietary UNIX systems rather than commodity hardware. Linux was still a hobbyist's operating system and largely invisible in research settings. Similarly, computer clusters were not yet in wide use; instead, universities and research institutes made heavy use of standalone supercomputers for demanding computations. The World Wide Web was in its infancy and had not yet become essential for day-to-day research. The field would soon change dramatically. In the midand late-1990s, the Internet revolutionized software development and, along with it, computational genomics. The rapid growth of the Internet catalyzed the Open Source Software (OSS) and Free Software movements [157], and the widespread adoption of Linux/GNU operating systems. These platforms, in turn, led to a major shift in research computing away from proprietary Unix systems and toward low-cost Linux systems running on commodity hardware. Computer clusters built from inexpensive components rapidly replaced high-end supercomputers (Fig. 1d). At the same time, the Internet made it much easier, cheaper, and faster to ship software: download buttons replaced telephone orders of floppy disks or CDs. This easy and prolific dissemination of code on the Internet fit well with the ethos of scientific research, which tends to favor openness and shared resources and to view profit-making with suspicion. Soon, there was an explosion of free and open-source software for genomics.
In my view, these trends were intensified by a generational shift in the research science community. By the mid-1990s, the ranks of PhD students and young scientists were swelling with a new cohort that had learned to program computers as children, during the PC boom of the 1980s. These young, computer-savvy researchers saw little point in paying for software that they could write themselves. In addition, many found a subversive excitement in producing their own software and releasing the code, free to anyone, on the emerging Internet. In this brave new world, smart kids could go from an idea to a working implementation to worldwide distribution within days, with no need for investors, marketing teams, or salespeople. Young scientists programmed madly in research laboratories and coffee shops, often at odd hours, communicating by email in a new ultranetworked world, while some of their bosses still occupied a musty world of paper journals, written letters, and landline phones. This generational shift occurred across all of science and engineering, but it was perhaps especially pronounced in biology, where the previous generation-except for a few influential pioneers-had been generally slow to embrace computing technologies.
Whatever its cause, this creative and entrepreneurial spirit helped to generate the rich landscape of free, academic software that we now enjoy in genomics. The "artisanal" model of software development in genomics also has had the benefit of enabling rapid development of new methods, a close coupling of software development and research science, and a kind of esprit de corps among bioinformatic tool developers around the world. Nevertheless, some of the same features that have made the field vibrant and productive have contributed to the difficulty of progressing to a more rigorous and professional model of software development. In particular, the surge of development over the past two decades, done in large part by underpaid workers motivated by pure enthusiasm for their craft, has allowed the field to benefit from a great deal of new software without being forced to reckon with its true costs. Institutions have not been forced to pay professional programmers competitive salaries; grant agencies have not been compelled to set aside appropriate funds for a software infrastructure; and the line items for professional software engineering have not made it into budget models. Thus, genomics has become accustomed to, even addicted to, abundant free software. In a sense, in our idealistic, antiestablishment zeal, we free software warriors have locked computational genomics into an unsustainable financial model.

Remedies: general principles
What, then, can be done to improve the financial and development landscape for genomic software? I address this question by first advancing some general principles, and then putting forward some more specific implementation strategies.
First, a clearer recognition is needed-at all levels, ranging from research institutions to granting agencies to private companies-that software for genomic analysis is a fundamental component of the infrastructure of genomics and requires a substantial commitment of resources. Software development is no less essential to progress of the field, and no less complex and expensive to carry out, than development of new genomic technologies or large-scale databases.
Second, commitments to the development of new software must be accompanied by ongoing commitments to the maintenance, refinement, and support of widely used tools. Because some tools inevitably remain relevant and widely used for longer than others, mechanisms will be needed for determining which previously funded projects do and do not deserve ongoing support.
Third, grant proposal formats and review criteria must be adapted to accommodate fundamental differences between software development projects and genomic research projects. In particular, proposals for software development projects should be evaluated in a way that gives less weight to innovation and more weight to software engineering practices, as well as to distribution, maintenance, support, documentation, and usability.
Fourth, improved career paths are needed for software developers working in academic research settings. Institutions and grant mechanisms must allow for salaries that are competitive with industry, and better opportunities for career advancement and continuing education.
Fifth, academic researchers and funding agencies must remain open to the possibility that some aspects of software development might be better done by private companies and should consider ways to nurture the development of sustainable business models based on genomic software development.
Sixth, it would be a mistake to abandon the current bottom-up model-with investigator-initiated software development closely integrated with genomic research-in favor of a top-down model, dominated by large, centrally organized projects. Rather, a strategy is needed that embraces the strengths of our research-coupled model but promotes software quality and financial sustainability.

Remedies: specific strategies
In keeping with the broad principles outlined above, I propose specific strategies in three major areas: grant funding, career development, and commercial development.

Grant funding
There is clearly a need for continuing support for genomic software development from government grants, but the field would benefit substantially from improved grant opportunities, review criteria, and budget models. Some specific possibilities include: ▪ Changes to proposal formats and review criteria to focus attention on the engineering aspects of software projects that currently tend to be hidden in research proposals. For example, proposals with substantial software development components should be required to address in detail how software will be tested, distributed, and maintained, what user interfaces and documentation will be provided, how version control and bug-tracking will be managed, and how ongoing support will be offered to users. Explicit review criteria should be used to evaluate these features, and at least one suitably trained reviewer should examine each proposal with these criteria in mind. ▪ More government grant opportunities specifically focused on software development, with review criteria as described above. Review of these proposals should also allow for a reduced emphasis on novelty or innovation, as well as for the possibility that innovation might occur at the software design or implementation levels. A substantial fraction of these proposals should be awarded to individual investigator-initiated software projects, rather than being earmarked for large projects or consortia. Perhaps the best example of this type of funding in the US, at present, is the US National Science Foundation (NSF) Infrastructure Capacity for Biology program (formerly, Advances in Bioinformatics), but the funds devoted to this program are modest ( Data Science program appears to be intended to replace them, in part, but it has a broader scope, and it is not clear how many awards will be funded through it. An important issue to address here is how to measure the impact and importance of existing software tools-through citations, downloads, expert opinion, or some other measure? ▪ Budget models that allow professional software developers to be paid competitive salaries from government grants. Current budgetary limits, such as the typical $250,000 per year in direct costs for a "modular" NIH grant, make it nearly impossible to pay these workers appropriately and still have funds for other necessities such as students, postdoctoral researchers, supplies, and portions of principal investigator salaries. ▪ Grant opportunities specifically designed to support computational scientists who wish to continue developing genomic software in a research setting, but who do not wish to serve as independent investigators. The US National Cancer Institute (NCI) Research Specialist (R50) award could serve as a model for such a program. ▪ More grant opportunities from private foundations and companies to support genomic software development. Private foundations, such as the W. M. Keck, Alfred P. Sloan, and Simons Foundations and the Wellcome Trust, have emerged as important auxiliary sources of scientific funding, but their support for projects in software development has so far been limited. Notable exceptions include the Data-Driven Discovery program from the Gordon & Betty Moore Foundation, the Collaborative Computational Tools for the Human Cell Atlas program from the Chan-Zuckerberg Initiative, and the Innovation in Cancer Informatics fund (Table 2). ▪ More grant opportunities to support community development for the kinds of distributed, open-source projects that have been so successful in computational genomics. For example, these grants could support workshops, "hackathons", competitions, and challenges (such as CASP [158,159] or DREAM [160]), creation of standardized benchmarks for testing, and public repositories for code and data.

Career development
As noted above, a crucial barrier to genomic software development is the absence of stable and rewarding career paths for software developers working in academic research settings. Some institutions have been more effective than others at promoting the careers of these individuals-notable examples include the European Bioinformatics Institute, the Broad Institute, the UC Santa Cruz Genomics Institute, and the New York Genome Center-but improvements are needed broadly across the field. Aside from improved funding for salaries (above), the following ideas could be considered: ▪ Improved job descriptions, salary scales, and paths for career advancement, to allow recruitment and retention of first-rate software developers despite competition from industry. Software developers must be provided with clear paths from entry-level positions to jobs with increased pay, professional status, and/or leadership potential. In addition, academic job categories and descriptions should avoid blurring the distinctions among support roles; a software developer is not the same as a laboratory technician, a data analyst, or a systems administrator. ▪ Opportunities for continuing education. Software developers work in a fast-moving field, with new technologies continually emerging. They need to be able to attend their own conferences, workshops, and courses, just as researchers do. These activities would improve their productivity, generate and maintain excitement about their work, and help to create a sense of parity with workers on the research track. ▪ Institutional recognition of the accomplishments of software developers and other support staff. Some academic institutions bestow a seemingly limitless supply of awards and accolades on their faculty and students, but the critically important efforts of programmers, analysts, and technicians are too often overlooked. Recognizing these individuals is a natural way to help them feel valued. ▪ Encouragement for the development of forums for intellectual exchange among software developers and other staff members across an institution. For example, in-house seminars could be organized to focus on new programming languages, hardware resources, or other technologies, or to showcase the technical underpinnings of a new software release or data analysis.

Commercial development
A third major area concerns the development of a sustainable commercial model to support aspects of software development that may be more efficiently, and more naturally, carried out in private companies than in academic research environments. Ideas to consider include: ▪ Grant mechanisms that make it easier to outsource software development, maintenance, and support to private companies, through contracts, consulting or service fees, or other arrangements, instead of implicitly encouraging academics to do this work for themselves (often poorly). For grant proposals that have a substantial software development component, investigators should perhaps be explicitly asked to present a rationale for their decision either to outsource the work or do it in-house. Institutions and granting agencies could facilitate outsourcing by providing lists of companies with various types of expertise. ▪ More proactive efforts by research institutions to spin off companies that develop genomic software. Many institutions have become much more active in encouraging start-ups in recent years, but development has been slow in the area of genomic software owing to uncertainty about business models. Nevertheless, if these efforts were paired with a push to outsource some grant-funded activities, perhaps the business models would begin to coalesce. ▪ More grants to support emerging genomic software companies, through mechanisms such as the Small Business Innovation Research (SBIR) program in the US (which does indeed fund some current software development activities). ▪ More efforts to expose graduate students and other trainees to commercial opportunities, including guidance on how to start their own companies, and benefit from institutional incubators and small business grants.

Conclusions
Genomic software is now a fundamental component of the infrastructure for biological research. It is central to many thousands of research projects, costing many billions of dollars per year. Despite its crucial importance, genomic software development is generally funded at modest levels, primarily through a diffuse collection of government grants to individual researchers in academic research environments. This model is quite different from those adopted for other aspects of the infrastructure for life sciences research, such as public databases, which tend to be publicly funded but centrally organized, and laboratory equipment, which tends to be developed and marketed by private companies. The roots of these differences lie in the rapid growth of genomic software together with the emergence of the Internet, a generational change in the adoption of computers in biological research, and an affinity for the Open Source movement of the 1990s. Despite important strengths, the limitations of the current model are becoming increasingly apparent, with unreliable and hard-to-use software and inadequate maintenance and support, resulting in wasted time and money.
I have argued here that we need major changes in the way that we fund and carry out software development for genomics. In general, I propose measures intended to maintain the fundamental strengths of our current investigator-driven, research-coupled model of software development, but this model should be augmented with improved engineering practices, funding opportunities, career development, and commercial opportunities. These proposed measures would require action at multiple levels including in individual research groups, in institutions, and at funding agencies. They would clearly be costly. However, I believe that these costs are small in comparison to the many hidden costs of failing to offer a robust, reliable, efficient, and conveniently usable software infrastructure for genomics-costs that will only increase as the field grows in size and influence.