COLD SPRING HARBOR, NY (GenomeWeb) – By relying on the Hi-C approach to map interactions between chromatin regions, researchers have been able to assemble de novo mammalian genomes for less than $1,000.
Olga Dudchenko, a postdoc in Erez Lieberman Aiden’s lab at the Baylor College of Medicine, said during her talk at the Biology of Genomes meeting here that other approaches have been used to align a genome for the same cost. However, she noted that those methods depend upon reference genomes.
“We are trying to do that, but without the reference to align to,” she said.
Short reads are typically assembled into contigs by matching overlapping ends. But oftentimes, Dudchenko said, this results in thousands of contigs rather than a handful of chromosomes because of repetitive regions and ambiguous overlaps. Folding in other data, she said, can help bring that number of contigs down, but that requires money and expertise. And in the end, researchers “often still don’t get full chromosomes,” she said.
Dudchenko noted that some of the “kitchen sink” assembly approaches use Hi-C data, but she said it holds potential to help assemble genomes on its own.
Hi-C generates data on the 3D structure of the genome and identifies chromatin regions that are in contact with one another. As regions that are physically closer to one another are more likely to be in contact with each other than with more distant regions, this enables researchers to estimate distances between various sequences.
Hi-C heatmaps can help identify parts of an assembly that are misjoined by highlighting contigs that seem to interact with more distant regions and thus may belong closer to those regions, or by uncovering contigs that need to be flipped so the end that associates with a nearby contig is next to it, she said. They have validated assemblies generated this way be by comparing them to linkage map data.
Dudchenko and her colleagues developed an automated pipeline called 3D-DNA and built the Juicebox Assembly Tool interface to identify and fix genome assembly errors uncovered through Hi-C heatmaps.
This need for better assemblies came to a head, Dudchenko said, during the Zika epidemic, as there was no high-quality assembly for the main vector of the virus, the yellow mosquito.
As they reported in Science last year, she and her colleagues used their Hi-C assembly approach to whittle an Ae. aegypti genome assembly of 4,756 scaffolds down to three large scaffolds that corresponded to the three Ae. aegypti chromosomes.
Additionally, in a paper posted to preprint server BioRxiv, she and her colleagues also reported using Pacific Biosciences reads in conjunction with their Hi-C assembly method to generate a new Ae. aegypti reference genome. The Hi-C data identified 258 misjoins in the initial assembly, and their approach placed 94 percent of the sequenced bases on to three scaffolds that also correspond to the three Ae. aegypti chromosomes.
It’s also an inexpensive approach. Dudchenko said she and her colleagues generated de novo assemblies with chromosome-length scaffold for three mammals for less than $1,000, as they also have reported in a separate BioRxiv preprint.
Each assembly used only Illumina reads from a short-insert DNA-seq library and an in situ Hi-C library, which she said cost less than $1,000. After generating a draft assembly with the software package w2rap, they then validated and refined it using the Hi-C data. For the wombat, Vombatus ursinus, they generated a 3.3-gigabase genome, while they generated a 3.3-gigabase genome for the Virginia opossum, Didelphis virginiana, and a 2.5-gigabase genome for the raccoon, Procyon lotor. For each, they reported chromosome-length scaffolds.
Dudchenko added that she was excited to see this assembly method work so well. Because of that, she and her colleagues have started an effort to use their Hi-C assembly approach to improve assemblies within the National Center for Biotechnology Information database to get them to the chromosome level.