A few years ago, Erich Jarvis decided it was time to sequence the genome of his parrots. Jarvis is a neuroscientist at Duke University who studies why songbirds and humans can learn vocal patterns, while most animals cannot. Jarvis hoped to compare the genetic code of vocal learners and non-learners to understand whether the genes and gene expression patterns that allow us to talk are the same as the ones that allow Polly to ask for a cracker.
Two full genome sequences later, something wasn’t right. Jarvis was fishing for some highly specific sequence areas, called promoter regions, which determine when and where cells translate genes into proteins. He suspected that vocal learning was not so much a question of evolving entirely new genes but of tweaking the way old ones were activated in certain regions of the brain.
But he couldn’t find the parts of the genome he was looking for in the final sequence. At first he thought that he was looking in the wrong place, but it soon became apparent that they simply weren’t there. He had sequenced an entire genome looking for needles in a haystack, and it turned out that these needles, due to systematic errors in assembly of the genome, had been left out, misplaced in some remote computer file for junk sequence. This was Jarvis’s introduction to what computer scientists call “the assembly problem.”
The assembly problem has been around as long as DNA sequencing, says Michael Schatz, professor of quantitative biology at the Cold Spring Harbor Laboratory. Sequencing machines don’t produce one long, complete read of each of our chromosomes. Rather scientists use enzymes to cut up the DNA from many cells into pieces short enough for the sequencer to handle. Imagine cutting up many copies of a long poem into strips only a few words long, mixing them up and trying to put the poem back together based on the strips’ overlapping ends. Final assemblies can leave regions out, put in too many copies of a repeating sequence, assemble the pieces in the wrong order or put them in backwards.
Even the competition to sequence the human genome hinged on a difference in assembly strategies. While the government-sponsored Human Genome Project team methodically sequenced regions of DNA in order according to a map, Craig Venter sequenced high volumes of DNA in no particular order very quickly and assembled them later.
Today’s “short-read” sequencing technologies are cheap, but they cut DNA into even smaller pieces than scientists 10 years ago had to work with, magnifying the difficulty of assembly. While traditional Sanger sequencing produced readings 500 to 1,000 base pairs long, the Illumina next-generation sequencers, currently the most widely used, can only read 35 to 150 base pairs in a row. Some computer scientists worry that by sequencing vast quantities of DNA using these technologies, we are dooming ourselves to a substantial amount of genetic gibberish. At the very least, we are overwhelming ourselves with quantities of information that can take energy-sapping months to process.
“There’s an enormous disconnect in the genomics field,” said Schatz. Sequencing has gone from an average of $100 million per genome in 2001 to $10,000 per genome as of last fall, but that doesn’t take into account the computational difficulty or cost of processing the deluges of data the sequencing machines produce. “Those costs are the costs just for acquiring the raw sequences,” Schatz said, “not for assembling them.”
Until recently, scientists didn’t even have a good measure of how accurate or inaccurate sequence assembly was. Sequencing a genome just once is so costly and time-consuming that scientists rarely sequence the exact same individual organism even twice. In other words, they had little way of checking their work.
“Sort of like Consumer Reports will evaluate every different brand of an appliance, there had never really been a serious attempt to compare all these software packages,” said Schatz.
In 2010, scientists at the University of California Santa Cruz decided they had to do something about this knowledge gap. They were recruiting biologists to sequence 10,000 vertebrate genomes in a project called Genome 10K. They wondered if next-generation sequencing was up to the job and which assembly programs to recommend.
The U.C. Santa Cruz scientists sought out teams of computational biologists to put their assembly programs head-to-head, all working with one computer-generated sequence. Whichever group submitted the sequence closest to original would win glory and professional reputation. They called the competition the Assemblathon.
In the end, 17 teams submitted using 18 assembly programs. Some competing programs, such as Phusion, phrap and Celera, had long histories. Celera, for instance, was first developed by Craig Venter’s group during the race to sequence the human genome. Others, such as the Department of Energy’s Meraculous, were relatively new. The groups produced many different solutions, each one flawed in its own way. “It was surprising that the differences were as great as they were,” said Benedict Paten, a postdoctoral student at U.C. Santa Cruz who helped organize the competition. “It was also surprising that no team was the best.”
The Broad Institute came away with the best overall score by a small margin, for instance. Despite its slim victory, the Broad was only the 11th-best team at including the right number of copies of repeating sequences. While some scientists emphasize the inaccuracies the Assemblathon revealed, Paten says he was relieved to find that the new assemblies were almost as accurate as the first-generation sequencing assemblies had been.
Even if next-generation sequencing isn’t disastrously inaccurate, it still has the potential to waste scientists’ time and money. With cheap sequencing, unprecedented numbers of labs now have a complete genome sequence of the organism they study on their wish list. Some quickly find the information they need in their sequences, while others, like Jarvis, end up on long chases that are more expensive than they expected.
Schatz now uses the Assemblathon I paper, published in September 2011, to help biologists decide which sequencing strategies and assembly programs best match their goals. He also recommends a December 2011 paper by University of Maryland professor Steven Salzberg, who used eight computer programs to assemble bacterial and bumblebee genomes, as well as a human chromosome sequence. Schatz hopes that the papers will help scientists evaluate for themselves which assembly programs and which sequencing types best meet their needs.
Kim Worley of Baylor University, a computer scientist who participated in the second Assemblathon, hopes the Assemblathons will teach biologists be more realistic in their expectations. She said that whenever her lab finishes a genome, “everyone says there’s a problem. My gene doesn’t look good. My region is messed up. People don’t ever say, oh you did a great job.” She says that biologists often don’t understand the subtleties of sequencing and assembly, while Jarvis says that computer scientists need to listen more to what kind of sequence information biologists actually need.
Perhaps the problem is that traditional biologists and computational scientists have fundamentally different drives. “I’m not really fascinated by genome assembly,” said Jarvis. “My real fascination is how the brain generates complex behaviors.”
But many agree the divide is getting smaller, as labs increasingly look to hire biologists with computational skill and biologists seek out that training. Jarvis now collaborates closely with the computer scientists behind the Assemblathon.
For the second Assemblathon, which was organized out of University of California Davis and finished accepting submissions in fall 2011, Jarvis achieved the ultimate coup: he provided his parrot sequence, and 21 teams of computer scientists from 14 countries worked to assemble them. The competitors also all worked with snake and cichlid data. The Assemblathon organizers are now busily analyzing the results, to be published later this year.
Schatz’s task is now to pick out the best elements of each submitted assembly for the parrot genome and combine them to form one extra-accurate genome sequence. “The final product I hope will be a super-assembled genome maybe even just as good as the human or the mouse,” said Jarvis.
Schatz is just as pleased with this type of partnership. “I can make substantial contributions on many different projects and different systems,” he said. “I study everything from birds to fish to microbes to fungi to humans to cancer.”
Even after the Assemblathon results are published, there will be no final word. New sequencing technology develops at such a rapid pace that even as the Assemblathon scientists take months analyzing their results, the state of the art is changing. The Assemblathon scientists have day jobs, Jarvis points out. If we were to have truly accurate and up-to-date assembly recommendations, we would need to have many people working on Assemblathons all the time.
We may evolve past the assembly problem before we have time to fully solve it. On February 17, the British company Oxford Nanopore Technologies Ltd. announced its newest product, a sequencer the size of a thumb drive that can reportedly read up to 10,000 base-pairs in a row, over 30 times as many as the current short-read sequencing. This is like going from one of those impossible coffee table puzzles to one for toddlers. The company has yet to publish on its new technology and will have to prove it is accurate as well as efficient, but scientists are cautiously excited. The California company Pacific Biosciences can already produce equally long reads, albeit not as quickly and cheaply.
Paten’s advice to biologists: “If what you want to do is not feasible, just wait a few years.”