In the early 1990s, scientists set out to map the entire DNA sequence of the human genome.
The so-called Human Genome Project aimed to find genetic links to diseases and to understand the function and structure of various elements of the genome, such as which genes encode proteins and what factors regulate gene expression.
The initial results of the Human Genome Project predicted that there are 40,000 genes that can encode proteins, large molecules that are vital for the good functioning of the body’s tissues and organs.
However, as that project drew to a close in 2003, estimates for that number fell to around 20,000–25,000 protein-encoding genes.
Since that point, scientists have been striving to come up with the final proteome — that is, the total number of proteins that can be expressed by genes — and have been focusing on understanding how the genetic expression of these proteins is mutated in several diseases.
To this end, an international team of researchers led by Michael Tress, from the Spanish National Cancer Research Centre Bioinformatics Unit in Madrid, Spain, has now examined the genes considered protein-coding by the main proteome databases available.
Tress and colleagues published the results of their research in the journal Nucleic Acids Research. Federico Abascal, of the Wellcome Trust Sanger Institute in Hinxton, United Kingdom, is the first author of the paper.
At least 2,000 genes are ‘pseudogenes’
The researchers compared the proteomes from three collections of protein sequences and genetic annotations: GENCODE/Ensembl, RefSeq, and UniProtKB.
Tress and team found that, of the total number of 22,210 genes listed as protein-encoding, only 19,446 featured in all three collections.
Leave a Reply