By the late 1990s, avalanches of data were pouring into GenBank—the genetic sequence database operated by the U.S. National Institutes of Health. In 1999 GenBank contained about two billion base pairs. That number jumped to 11 billion in 2000, nearly 45 billion in 2004 and nearly 86 billion by 2008. As that database and many others grow, it becomes increasingly valuable to share knowledge, but doing so becomes ever more difficult. Speed makes up part of the problem, but policy hurdles also block the way.
Around 2003 the speed of sharing data concerned Wu Feng, then at Los Alamos National Laboratory and now an associate professor in the departments of computer science and electrical & computer engineering at Virginia Tech in Blacksburg. Feng knew that the number of bases in GenBank was growing faster than the ability to search them, especially with the popular basic local alignment search tool, better known simply as BLAST. So Feng and his colleagues created mpiBLAST, which lets multiple computer processors tackle the same sequencing query as a team. This new software made sequence searches faster—often several orders of magnitude faster—but Feng and his colleagues would soon be searching for more ways to increase the speed of data sharing.
Meanwhile many other researchers pursued different approaches to connecting biological and biomedical information around the world. Although making these data connections demands solving technological and sociological challenges, the results can change the approach to basic research and even the business of biotechnology.
"Today more than ever, researchers recognize the impact of sharing," says Henry Rodriguez, director of clinical proteomic technologies for cancer (CPTC) at the U.S. National Cancer Institute (NCI). He adds, "Advances in science and healthcare are made possible through widespread and barrier-free access to research and the data produced by that research."
In fact Rodriguez sees at least five ways that data sharing benefits research. First, data sharing encourages open scientific enquiry. "This lets conclusions from research be validated or refuted by peers, and that adds more strength to the results," Rodriguez explains. Second, sharing data from past experiments triggers new ones. As Rodriguez says, "Existing data can lead to new insights that the first investigator might not have recognized." He adds, "Programs in genomics and all of the ‘omics are producing vast amounts of data, but connecting the data and extracting knowledge from the data are critical." Third, Rodriguez points out that making data openly available creates huge test sets that can be used to assess the quality of new informatics software. Fourth, combining information creates data sets that cannot be generated by any individual. "Putting it all together is the key," Rodriguez says. Last, he believes that sharing data openly reduces unnecessary duplication. "Some duplication provides rigor," Rodriguez says, "but accessing data from others can also push science further."
In part the very nature of biotechnology demands data sharing. "Almost by definition," says Kenneth H. Buetow, associate director for bioinformatics and information technology at NCI, "biotechnology and biomedicine are international enterprises." He adds, "There are immediate challenges from that globalization, especially how to get continuity of information and connecting information."
Nonetheless, some information in biotechnology, such as proprietary data generated inside biotechnology and pharmaceutical companies, will never be readily released—at least not right after it’s collected. But that is not necessarily the bulk of biotech information. "There are tons of resources nowadays," says Buetow, "that are precompetitive. So while this information is not necessarily proprietary, it is a disadvantage if a company cannot access the information and has to generate it." For example he points out that genome-wide association studies could be helpful to many researchers—in basic research and business—even though the commercial value is largely limited. "That information should be shared on a broad scale," Buetow says. "It is invaluable."
Still, Buetow knows that some information will not be made open to everyone. For example a pharmaceutical company is not going to openly share information about binding between a candidate drug and a disease target. But even that information will need to be shared inside the company. Likewise in a multinational clinical trial, data might be shared between a pharmaceutical company and a contract research organization or local physicians collecting data. As Buetow says, "Some data need to be bound for intellectual property reasons or through licensing, but I would argue that even that needs to be shared. The issue there is: What is the legal framework under which you negotiate to get access?"
In the late summer of 2008, NCI’s CPTC convened an international summit in Amsterdam to discuss data-sharing challenges and solutions. (See sidebar "Outcomes from Amsterdam.") Although that group focused on proteins, the challenges apply to sharing almost any sort of biotechnology data. According to Rodriguez, data sharing faces three categories of challenges: technology, infrastructure, and policy. "Moreover, each of those impacts the other two," says Rodriguez. So the challenges can be described individually, but they interact in practice.
The technology challenge in molecular biotechnology consists of several pieces. First, in genomics, proteomics and other fields, researchers use a range of technologies, such as mass spectrometry, tandem mass spectrometry, liquid chromatography and so on. That makes a variety of data that must somehow be compared. Worse still, the same instrument used in two labs can create different results just because the instrument gets calibrated in different ways. "So data from the same kind of instrument used with the same reagents but in two different labs can pump out data that are not comparable," Rodriguez says. The next technological challenge comes from the "flavor" of data being used—raw or processed. The raw data is just like it sounds, uncooked, not processed in any way, or as little as possible. If the data are processed, different data sets can only be compared when the exact processing can be taken into account, and the data must be adjusted accordingly. Even if a researcher can get raw data from an instrument, that device could put the information in a proprietary format that is incomprehensible to other devices or analysis packages. And that analysis makes up the last technological challenge in data sharing. "Researchers use multiple computational tools—the algorithms that extract knowledge," says Rodriguez. Those algorithms pull out relationships that might be missed otherwise, but it proves difficult to compare data that were analyzed in different ways.
To get at the infrastructure behind data sharing, imagine a transportation analogy: Cars, trucks, trains, jets, ships and so on make up the data; and garages, roadways, waterways, skies and such make up the infrastructure. So the infrastructure determines where the data can be stored and the paths that data can take from one spot to another. "No international or centralized network has emerged," Rodriguez says. "Since the ones available use their own fixed formats, researchers cannot gather information from all of the sites." He adds, "Today’s repositories are a benefit, but it will remain problematic if they are not interoperable."
Policy makes up the last category of data-sharing challenges. "In terms of proteomics," Rodriguez says, "the challenge here is really fundamental. It will be responsible for establishing and ultimately enforcing the guidelines for the proteomics community, including the requirements for submitting data and the metrics that will be used to determine the quality of the data." For example, standards should require researchers to provide the metadata that explain the details behind the experiments that produced the actual data.
For data related specifically to healthcare, other policy considerations also arise. Patrick L. Taylor—deputy general counsel and chief counsel for research affairs at Children’s Hospital Boston and assistant clinical professor at Harvard Medical School—writes often about data sharing, and he sees several obstacles, such as avoiding misuse of the data and creating a level playing field that takes into account the goals of commercial interests and patients. He says, "Managing access to data and its uses in ways that respect people’s privacy but meets everyone’s goals in public health is a real challenge."
So far, though, Taylor thinks that companies could do a better job of sharing data. "Huge amounts of tissue and data get collected in clinical trials," he says, "but that just gets banked away, used in for-profit directives, even though the research subjects just volunteered." Despite that company–patient imbalance, Taylor adds, "I don’t want to demonize companies. They operate in their own environment."
Moreover, Taylor does not encourage a forced approach to data sharing. Instead he wants to find ways that encourage data sharing and help everyone along the way. "We could create data pools and give companies some level of access in exchange for sharing some of their own data." With such data pools, Taylor says, multiple companies might not need to reproduce the same data, which they do today.
In some areas, technology already makes data sharing possible. (See sidebar "Putting Patients Together.") One example is the cancer biomedical informatics grid, or caBIG, which was started by NCI and still run by it. Buetow describes caBIG as the "information technology framework that supports 21st-century biomedicine." He adds, "It’s a way that we can interconnect the entire biomedical enterprise using current-art information technology." So caBIG takes available technology and uses it to connect basic researchers, biomedical scientists, physicians and anyone else interested in cancer—and, actually, healthcare in general.
Just as others have to meet the challenges of sharing proteomics data, caBIG developers needed to make it possible for users to access a range of data types and to make sense of how they interact. Much of the problem revolves around translation—finding ways that software can unravel all of the medical community’s vocabularies. To do this, caBIG provides a range of web services that are designed to work with anything that connects to caBIG. "A key component of caBIG is interoperability," Buetow says. "We are technologically neutral. Information can come from Oracle, a MySQL database and others." For example, caAdapter can be mounted on top of a data resource to make that information available on the caBIG framework.
Virtually anyone around the world can use the caBIG technology. Some international biotechnology operations are already underway. For instance NCI formed a partnership with Duke University related to international clinical trials. "So Duke established a partnership with the Beijing Cancer Hospital to get participation with Chinese colleagues," Buetow says. "They are using caBIG so that a trial being run in Durham, North Carolina, can recruit participants in Beijing, China."
In addition NCI developed a partnership with the Institute of Cancer Research in London. "They are installing a framework called Onyx that will be interoperable with caBIG. So we can interconnect between the U.K. and the U.S."
Despite being called a cancer grid, caBIG goes beyond cancer. "There is nothing cancer-specific about it," Buetow says. Instead NCI scientists hope that this system can draw together a range of health professionals around the world. "In developing countries in particular," Buetow say, "this technology could help scientists become part of a bigger framework. These scientists could contribute their expertise to the field without building all of the components required in biotechnology or biomedicine."
Many of the desired applications of data sharing, though, still hit information bottlenecks. As Feng and his colleagues found with BLAST, sequence searches could run faster by adding the parallel-computing capabilities of mpiBLAST. But even mpiBLAST is not always enough.
Sequence searches—even when done fast—still produce large amounts of data, which are not easy to move. So Feng and Pavan Balaji of the Argonne National Laboratory worked with some colleagues to develop ParaMEDIC, which stands for: parallel metadata environment for distributed I/O and computing. In fact, I/O—the input/ouput, or simply getting information into and out of computing resources—can really slow down data sharing.
To get around that, Feng and Balaji use ParaMEDIC to turn the original data into a code. With sequences, for example, ParaMEDIC uses GenBank Identifiers, which represent sequence strings. So instead of needing to grab a long length of bases—cytosine, guanine, thymine, cytosine and so on—ParaMEDIC just uses an identifier.
To see how well ParaMEDIC could really work, Feng and Balaji took on a tough problem. Scientists at the Virginia Bioinformatics Institute at Virginia Tech wanted to find the missing genes in 567 genomes from microbes, which required 2.63 x 1014 sequence searches. To do those searches, Feng and Balaji created a team of researchers, plus eight supercomputers scattered across the United States. The results consisted of 0.97 petabytes of data—almost a quadrillion bytes. To add the I/O side, they planned to send the results—by Ethernet—to Tokyo. Sending the data in the conventional way would have taken about three years. With ParaMEDIC, the super computers cranked out the sequence searches, then crunched the results into a GenBank Identifier code. That crunching step turned the 0.97 petabytes into about four gigabytes, or reduced the data by roughly 250,000 times. As a result, the Feng and Balaji team computed the missing genes, sent the information from the Unites States to Japan, and had computers in Japan turn the code back into the original data—all in just 10 days.
This application could find lots of uses in biotechnology. "Say that you are a pharmaceutical company that has petabytes of sequence-search data stored around the world and you need to bring it to one place for some reason—back-up store or large-scale experiment," Feng says. "ParaMEDIC will enable the information to be shipped and reconstituted in a fraction of the time that it would take to recompute all the information locally."
In general, sharing data will remain under development, probably indefinitely. New research tools and growing data pools will require ongoing technological advances to keep the sharing doable. With every advance, though, sharing data will increase around the world.
When proteomic experts gathered in Amsterdam on August 14, 2008, to attend the International Summit on Proteomics Data Release and Sharing Policy, they focused on ways to get proteomic data into the public domain. "Our primary focus was on policy," says Henry Rodriguez, director of clinical proteomic technologies for cancer at the U.S. National Cancer Institute.
One policy decision involved when data should be released. In Amsterdam, Rodriguez and his colleagues concluded that it depends on the source of the data. If the data come from an individual researcher’s lab, the data should be released when the work gets published. For large-scale community projects designed to advance science in general, however, the data should be released as they are generated, provided that appropriate procedures exist to control the data quality.
In Amsterdam, the experts also considered what kind of data should be made available. "Raw data are the data that should go into the public domain," Rodriguez says. "Even if you agree to release raw data, though, they must be extremely well annotated with metadata. That defines the quality of the data itself."
Although many details must still be resolved, the intent is certain. "It is clear to me and others," Rodriguez says, "that data sharing expands and expedites research findings, especially where they are applicable to disease."
In 2004 a trio of M.I.T engineers—brothers Ben and Jamie Heywood and long-time friend Jeff Cole—founded PatientsLikeMe. In fact, this project really started in 1998, when another Heywood brother, Stephen, was diagnosed with amyotrophic lateral sclerosis (ALS), often called Lou Gehrig’s disease. Although ALS is always fatal, slowly destroying the central nervous system, the Heywoods started looking for ways to give Stephen the best life that he could have. In 1999 Jamie founded the ALS Therapy Development Institute to speed up the generation of new treatments. Beyond finding new molecules, though, the Heywoods and Cole wanted to do even more. As described on the PatientsLikeMe website: "Our goal is to enable people to share information that can improve the lives of patients diagnosed with life-changing diseases. To make this happen, we’ve created a platform for collecting and sharing real world, outcome-based patient data (patientslikeme.com) and are establishing data-sharing partnerships with doctors, pharmaceutical and medical device companies, research organizations and non-profits."
This work goes beyond ALS. In fact, PatientsLikeMe plans to soon cover more than 50 diseases. It already provides communities for people with depression, HIV/AIDS, multiple sclerosis, Parkinson’s disease and other afflictions. Perhaps most important of all, PatientsLikeMe reveals some of the breadth behind the ways that people can collect and distribute information. "PatientsLikeMe consists of groups of people coming together to share data in dramatically new ways," says Patrick L. Taylor, deputy general counsel and chief counsel for research affairs at Children’s Hospital Boston and assistant clinical professor at Harvard Medical School, and a well-known expert on data sharing. "These data become a source of further data sharing. It’s patient-specific, phenotypically interesting, longitudinal data shared by patients themselves."