Toward a complete human genome sequence

DNA sequencing is a process of deciphering the exact order of the nucleotide bases (adenine (A), cytosine (C), guanine (G), and thymine (T)), allowing digitization of the genomic data and enabling analysis from a computational point of view. The primary goal of this process is to map the genes of different organisms, particularly humans, to advance biological knowledge by analyzing the correlation between genotype and phenotype. 

However, fully sequencing the human genome has been one of the most significant challenges in recent decades, and its completion has opened the doors to an era of unprecedented breakthroughs in biology and medicine.

The Human Genome Project

The Human Genome Project (HGP), started in 1990 with the ambition to sequence the entire human genome, represented a milestone in genomic analysis. As a result of an international collaboration, the HGP aimed to map all approximately 20,000 to 25,000 human genes and determine the sequence of the 3 billion DNA bases of the human genome. Sanger’s sequencing technology, predominant during the HGP years, enabled the achievement of a historic milestone in 2003: a high-quality draft of the human genome that covered more than 92% of the entire genome, with an accuracy of more than 99.99%. This achievement not only marked the completion of one of the most ambitious scientific goals of the 20th century but also laid the foundation for decades of future research, paving the way for significant advances in understanding genetic diseases and developing new therapies.

Despite its enormous success, the draft human genome published by the HGP lacked more complex regions, including telomeres, regions located at the end of chromosomes and characterized by high repetition, and centromeres, very complex and central regions. The cause of this shortcoming is mainly due to the limitations of Sanger technology, which could not guarantee sufficient sequencing accuracy.

New Sequencing Technologies

Since the HGP, DNA sequencing has seen rapid technological advances. First, with second-generation technologies, also known as Next Generation Sequencing (NGS), the time and cost of producing genomic data have been significantly reduced through a parallel sequencing approach. More recently, single-molecule or third-generation sequencing, with technologies such as PacBio and Oxford Nanopore, has allowed the DNA sample to be sequenced directly, eliminating much of the preprocessing. These advances have greatly improved the ability to sequence previously inaccessible genome regions.

From Telomere to Telomere

The technological advancement of the sequencing process thus spurred the establishment of the Telomere-to-Telomere (T2T) Consortium in 2019, which set out to fill the gaps in the HGP to achieve the completeness of the human genome. In 2022, the T2T published the first complete version of the human genome, literally from telomere to telomere. Specifically, this work prioritized the sequencing of that missing 8%, and a new reference sequence with 200 million new bases containing 1956 gene predictions has been proposed. This milestone represents the culmination of the project that began in 1990 and makes it possible to speed up the discovery of new correlations, increasing the accuracy of diagnosis and therapies. It also opens the door to a breadth of new studies and applications previously unthinkable.

Welcome to the Pangenomic Era!

Technological advances in sequencing have contributed to lowering the cost and time of producing genomic data, influencing the creation of more accurate and complete sequences. This, combined with the advancement in computational analysis techniques, set the stage for what has been called the beginning of the Pangenomic Era (Nature 2023). Given the high quantity and quality of data produced, it is now possible to analyze multiple genomes simultaneously, looking for correlations between different individuals and investigating the genetic variability of entire populations. However, given the considerable complexity of analyzing a single genome, it is easy to deduce the need for new paradigms to analyze a population of genomes.

This reflects the goal of GenoGra, which proposes the first platform for pangenomic analysis. GenoGra enables population studies through a graph-based technology, which allows collecting a set of genomic sequences into a single network of interconnected information, highlighting genetic variability. This approach lets genomic analysis scale in parallel with the massive data throughput generated by new sequencing technologies and enables improved analysis accuracy, bringing us ever closer to truly personalized medicine.