What are the genomic mechanisms responsible for the creation of evolutionary novelty? The view that has emerged from the findings of molecular and developmental biology in recent decades is that the primary substrate for evolutionary change is the regulatory network between genes, rather than emergence of new genes. For instance, animal development repeatedly utilises a relatively limited ‘toolkit’ of highly conserved families of transcription factors, signalling molecules and signal transduction components. Most morphological novelty seems to emerge via new deployments and combinations of this repertoire of pre-existing genes. No one doubts that new genes arise during evolution, but in line with the ‘predominantly regulatory’ model most gene birth has been considered to involve duplication of genes, followed by their reorganisation and diversification. However, whole genome sequencing has found that every evolutionary lineage contains protein-coding genes that lack homologues in other lineages – orphan genes. In fact, as much as one third of all annotated genes are orphans. Although orphan genes can arise by processes of duplication and (rapid) diversification, it appears likely that the primary mechanism for their emergence is de novo evolution from previously non-coding sequence.
de novo gene birth requires non-genic sequences to be first transcribed, acquire open reading frames (ORFs), and these ORFs to be translated. There are some obvious conceptual hurdles to be overcome in making a model to explain how this sequence of events may occur to an important extent; how does non-genic sequence become translated? and wouldn’t any polypeptide translation products be insignificant? Corvunis et al. have tried to surmount these problems by postulating a model of de novo gene birth that proceeds through intermediate, reversible, ‘proto-gene’ stages between the emergence of an ORF and bona fide functional protein-coding genes.
A major problem in understanding the genome is that ORFs in genomic sequence are classified using a minimal length threshold. Although ORFs encoding functional polypeptides as small as 9 amino acids long have been discovered, the standard length threshold used to delineate genic ORFs is 300nt (equating to an 100aa protein). In the budding yeast Saccharomyces cerevisiae, ~6000 ORFs are annotated as genes, whilst ~261, 000 unannotated ORFs longer than 3 codons are considered non-genic ORFs. The majority of the S. cerevisiae genome is transcribed, and a number of putatively non-coding transcripts have been shown to associate with ribosomes. Corvunis et al postulated that a certain amount of translation of non-genic ORFs may go on, and although the polypeptides produced may not be functional, providing they were not toxic and selected against, these proto-genes could be maintained in the genome. Proto-genes would provide adaptive potential, and a subset could be retained over time if they provided selective advantage. New genes originating de novo would be expected to be initially shorter, less expressed and more rapidly evolving than established genes.
Carvunis et als model for de novo gene birth leads to a number of predictions. Firstly, there should be an evolutionary continuum between non-genic ORFs and bona fide genes with respect to such characteristics as length, expression level and sequence composition. Secondly, many non-genic ORFs should be translated, and thirdly, some recently emerged ORFs should be adaptively advantageous and hence retained by natural selection.
To test these predictions, Carvunis et al. estimated the order of emergence of S. cerevisiae ORFs, based on their level of conservation amongst ascomycete fungi. Annotated ORFs were classified into 10 groups. For instance, those ORFs found only in S. cerevisiae constituted ORFs1, which accounted for ~2% of the total. ~12% were only conserved within the four closely related Saccharomyces species (ORFs1-4). The weak conservation, and poor characterisation of ORFs1-4 means that their annotation as genes is debatable, whereas the ~88% of annotated S. cerevisiae ORFs that had homologues in more distant species (ORFs5-10), can be more confidently considered genes. The authors also classified ~108,000 unannotated ORFs longer than 30nt as having a conservation level of 0 (ORFs0). Hence, ORFs0 and ORFs1-4 were considered as candidate proto-genes, whilst ORFs5-10 were classed as bona fide genes.
In agreement with the postulated continuum between non-genic ORFs and genes, Corvunis et al. found a positive correlation between the level of conservation and both gene length and expression level. A spectrum of codon usage was also observed; the relative abundances of amino acids encoded by ORFs1-4 being intermediate between those of the (hypothetical) translation products of ORFs5-10 and ORFs0.
To test the second prediction, Corvunis et al. used data on ribosomal occupancy to search for signatures of translation amongst ORFs0. Of these ~108,000 short, unannotated ORFs, 1,139 showed evidence of translation (termed ORFs0+).
The authors went on to measure the extent of selection operating on their classes of ORFs by comparing the genome sequences of 8 different S. cerevisiae strains. ~3% of ORFs0+ and ~9-25% of ORFs1-4 were found to be under purifying selection.
Corvunis et al. therefore classify the set of ORFs0+ (that showed translational activity), and those ORFs1-4 that don’t necessarily show evidence of being under purifying selection, as proto-genes. This set amounted to 1,891 ORFs displaying characteristics intermediate between non-genic ORFs and genes.
Although Corvunis et al found evidence to support all three of their predictions, I found the most persuasive evidence in favour of the importance of de novo gene birth to be the fact that since the division of S. cerevisiae and S. paradoxus between 1 and 5 novel genes have arisen by gene duplication mechanisms, whilst 19 of the 143 ORFs1 (arising de novo in the same period) were found to be under purifying selection.
Perhaps the main take-home message from these analyses is that the imposition of arbitrary annotation boundaries (eg. the 100 codon cut-off) can lead to artifactual understandings. The findings of widespread non-coding transcription, and the potential for marginal, non-functional translation mean that genes exist on a continuum, and their RNA and protein products exist on spectra of functionality. These ‘shades of grey’ may actually be an important source of evolutionary potential.
Carvunis AR, Rolland T, Wapinski I, Calderwood MA, Yildirim MA, Simonis N, Charloteaux B, Hidalgo CA, Barbette J, Santhanam B, Brar GA, Weissman JS, Regev A, Thierry-Mieg N, Cusick ME, & Vidal M (2012). Proto-genes and de novo gene birth. Nature PMID: 22722833
Tautz D, & Domazet-Lošo T (2011). The evolutionary origin of orphan genes. Nature reviews. Genetics, 12 (10), 692-702 PMID: 21878963