Hidden within us: The dark matter of the human genome

The human genome is a vast library of over 3 billion base pairs, yet advancements in whole-genome sequencing reveal only 2% are protein-coding. This startling finding has prompted one of the most pressing missions of contemporary biology: to fully understand the role of the remaining 98% of the genome. Accomplishing this mission requires next-generation sequencing and transcription profiling to analyze historically overlooked non-coding stretches between genes, termed genomic “dark matter.”

“The traditional protein-centric view of the genome became less accurate, and an increasing number of researchers deemed it plausible that more non-coding sequences possessed functionality.”

Traditionally, molecular biologists theorized genomic regions not containing codons — the stepwise instructions for assembling proteins by their amino acid subunits —  were functionless, evolutionary artifacts called “junk” DNA. However, late 20th-century advancements in molecular genomics challenged this viewpoint, as researchers demonstrated that select “junk” sequences contained transcription sites for non-coding RNAs and abundant transposable elements. The traditional protein-centric view of the genome became less accurate, and an increasing number of researchers deemed it plausible that more non-coding sequences possessed functionality. Despite these advancements, the majority of non-coding regions remained poorly characterized due to unavailable whole-genome sequences and profiles of the human “transcriptome” — the complete set of RNA present in cells. However, the revolutionary launch of the Human Genome Project in 1990 and the subsequent Encyclopedia of DNA Elements (ENCODE) in 2003 provided biologists with the first high-definition genomic maps and catalogs of RNA transcripts. Combined with newly available assays for gene expression, these projects illuminated dark matter sequences and revealed new insights into genomic activity.

“In the past decade, studies using data from the Human Genome Project and ENCODE have demonstrated non-coding regions to be more active than initially theorized.”

In the past decade, studies using data from the Human Genome Project and ENCODE have demonstrated non-coding regions to be more active than initially theorized. Specifically, 80% of non-coding DNA has displayed biological activity, with evidence for at least 75% of the entire genome being capable of transcription. Further transcriptome profiling has confirmed this, revealing that 98.8% of transcribed RNA is non-coding. From an evolutionary perspective, such high activity in non-coding regions equates to a tremendous loss of cellular resources if it does not confer a fitness advantage. Since a cell’s resources are finite, geneticists argue that such a large loss to functionless DNA is disadvantageous and likely to have been selected against by natural selection. This challenges traditional assumptions, suggesting many non-coding transcripts may have roles in maintaining cellular function.

Evidence supporting this claim has surfaced in recent association studies linking mutations of non-coding loci to many diseases, including cancer and muscular dystrophy. Further investigation into the mechanisms behind this phenomenon reveals RNA transcripts from these loci are associated with regulating coding genes. These non-coding RNAs can assume multiple forms — microRNA (miRNA), piwi-interacting RNA (piRNA), small interfering RNA (siRNA), and long non-coding RNA (lncRNA) are among the most notable examples featured in recent studies. Interestingly, miRNA, piRNA, and siRNA possess remarkably similar three-dimensional structures and are all very small, generally averaging under 40 nucleotides long. Their relative simplicity allowed researchers to quickly determine they serve critical roles throughout the cell in supporting protein translation, regulating genes through protein interference, and other structural functions. Understanding the mechanisms behind these RNA molecules was a milestone in deciphering genomic dark matter, as it provided concrete examples of functional non-coding regions.

However, these small nucleic acid chains represent only a minority of the transcriptome. With lengths generally ranging between 200 and 10,000 nucleotides, lncRNA is the most variable and genomically prominent type of non-coding RNA. Due to the complexity of these molecules, research on lncRNA remains a frontier of modern genomics. So far, studies have suggested many lncRNAs are essential to targeted epigenetic modifications of genes by serving as flags for chromatin remodeling agents. For example, RNA sequencing demonstrates genes critical to development, such as Hox genes and Xist, utilize lncRNA while recruiting chromatin-modifying proteins. Yet, such studies have only begun to characterize the diverse lncRNAs in the human transcriptome — thousands more have yet to be studied. Furthermore, understanding how these molecules connect proteins to genes with high specificity remains a challenge. Next-generation epigenetic and transcription profiling technologies will hopefully aid in overcoming these obstacles, further demystifying the noncoding regions.

The investigation into genomic dark matter was pivotal to the advancement of modern genomics. Not only did it shift viewpoints from a simplified protein-centric model, but it also illuminated the complexities of the genome. While scientists are far from characterizing the entirety of non-coding sequences, many important discoveries have occurred. Among them is that RNA has critical regulatory behavior beyond synthesizing protein. Further studies into non-coding RNA may reveal new life-saving therapies, genetic engineering methods, and clarification on the least understood regions of the human genetic code.