Tag Archives: gene evolution

Gene Birth, de novo

What are the genomic mechanisms responsible for the creation of evolutionary novelty? The view that has emerged from the findings of molecular and developmental biology in recent decades is that the primary substrate for evolutionary change is the regulatory network between genes, rather than emergence of new genes. For instance, animal development repeatedly utilises a relatively limited ‘toolkit’ of highly conserved families of transcription factors, signalling molecules and signal transduction components. Most morphological novelty seems to emerge via new deployments and combinations of this repertoire of pre-existing genes. No one doubts that new genes arise during evolution, but in line with the ‘predominantly regulatory’ model most gene birth has been considered to involve duplication of genes, followed by their reorganisation and diversification. However, whole genome sequencing has found that every evolutionary lineage contains protein-coding genes that lack homologues in other lineages – orphan genes. In fact, as much as one third of all annotated genes are orphans. Although orphan genes can arise by processes of duplication and (rapid) diversification, it appears likely that the primary mechanism for their emergence is de novo evolution from previously non-coding sequence.

de novo gene birth requires non-genic sequences to be first transcribed, acquire open reading frames (ORFs), and these ORFs to be translated. There are some obvious conceptual hurdles to be overcome in making a model to explain how this sequence of events may occur to an important extent; how does non-genic sequence become translated? and wouldn’t any polypeptide translation products be insignificant? Corvunis et al. have tried to surmount these problems by postulating a model of de novo gene birth that proceeds through intermediate, reversible, ‘proto-gene’ stages between the emergence of an ORF and bona fide functional protein-coding genes.

A major problem in understanding the genome is that ORFs in genomic sequence are classified using a minimal length threshold. Although ORFs encoding functional polypeptides as small as 9 amino acids long have been discovered, the standard length threshold used to delineate genic ORFs is 300nt (equating to an 100aa protein). In the budding yeast Saccharomyces cerevisiae, ~6000 ORFs are annotated as genes, whilst ~261, 000 unannotated ORFs longer than 3 codons are considered non-genic ORFs. The majority of the S. cerevisiae genome is transcribed, and a number of putatively non-coding transcripts have been shown to associate with ribosomes. Corvunis et al postulated that a certain amount of translation of non-genic ORFs may go on, and although the polypeptides produced may not be functional, providing they were not toxic and selected against, these proto-genes could be maintained in the genome. Proto-genes would provide adaptive potential, and a subset could be retained over time if they provided selective advantage. New genes originating de novo would be expected to be initially shorter, less expressed and more rapidly evolving than established genes.

Carvunis et als model for de novo gene birth leads to a number of predictions. Firstly, there should be an evolutionary continuum between non-genic ORFs and bona fide genes with respect to such characteristics as length, expression level and sequence composition. Secondly, many non-genic ORFs should be translated, and thirdly, some recently emerged ORFs should be adaptively advantageous and hence retained by natural selection.

To test these predictions, Carvunis et al. estimated the order of emergence of S. cerevisiae ORFs, based on their level of conservation amongst ascomycete fungi. Annotated ORFs were classified into 10 groups. For instance, those ORFs found only in S. cerevisiae constituted ORFs1, which accounted for ~2% of the total. ~12% were only conserved within the four closely related Saccharomyces species (ORFs1-4). The weak conservation, and poor characterisation of ORFs1-4 means that their annotation as genes is debatable, whereas the ~88% of annotated S. cerevisiae ORFs that had homologues in more distant species (ORFs5-10), can be more confidently considered genes. The authors also classified ~108,000 unannotated ORFs longer than 30nt as having a conservation level of 0 (ORFs0). Hence, ORFs0 and ORFs1-4 were considered as candidate proto-genes, whilst ORFs5-10 were classed as bona fide genes.

In agreement with the postulated continuum between non-genic ORFs and genes, Corvunis et al. found a positive correlation between the level of conservation and both gene length and expression level. A spectrum of codon usage was also observed; the relative abundances of amino acids encoded by ORFs1-4 being intermediate between those of the (hypothetical) translation products of ORFs5-10 and ORFs0.

To test the second prediction, Corvunis et al. used data on ribosomal occupancy to search for signatures of translation amongst ORFs0. Of these ~108,000 short, unannotated ORFs, 1,139 showed evidence of translation (termed ORFs0+).

The authors went on to measure the extent of selection operating on their classes of ORFs by comparing the genome sequences of 8 different S. cerevisiae strains. ~3% of ORFs0+ and ~9-25% of ORFs1-4 were found to be under purifying selection.

Corvunis et al. therefore classify the set of ORFs0+  (that showed translational activity), and those ORFs1-4 that don’t necessarily show evidence of being under purifying selection, as proto-genes. This set amounted to 1,891 ORFs displaying characteristics intermediate between non-genic ORFs and genes.

Although Corvunis et al found evidence to support all three of their predictions, I found the most persuasive evidence in favour of the importance of de novo gene birth to be the fact that since the division of S. cerevisiae and S. paradoxus between 1 and 5 novel genes have arisen by gene duplication mechanisms, whilst 19 of the 143 ORFs1 (arising de novo in the same period) were found to be under purifying selection.

Perhaps the main take-home message from these analyses is that the imposition of arbitrary annotation boundaries (eg. the 100 codon cut-off) can lead to artifactual understandings. The findings of widespread non-coding transcription, and the potential for marginal, non-functional translation mean that genes exist on a continuum, and their RNA and protein products exist on spectra of functionality. These ‘shades of grey’ may actually be an important source of evolutionary potential.

Carvunis AR, Rolland T, Wapinski I, Calderwood MA, Yildirim MA, Simonis N, Charloteaux B, Hidalgo CA, Barbette J, Santhanam B, Brar GA, Weissman JS, Regev A, Thierry-Mieg N, Cusick ME, & Vidal M (2012). Proto-genes and de novo gene birth. Nature PMID: 22722833

Tautz D, & Domazet-Lošo T (2011). The evolutionary origin of orphan genes. Nature reviews. Genetics, 12 (10), 692-702 PMID: 21878963

lincRNAs in development and evolution

A new study identifying hundreds of long intervening noncoding RNAs (lincRNAs) in the zebrafish shows that these molecules have important conserved roles in vertebrate development.

Thousands of loci in mammalian genomes produce capped, polyadenylated, and often spliced RNA molecules that are greater than 200nt in length yet do not encode proteins. These lincRNAs have been shown to function in a number of cellular processes including X chromosome inactivation and transcriptional regulation. The roles of the vast majority of identified lincRNAs are however unknown.

To try and identify lincRNAs in the zebrafish, Ulitsky et al designed a pipeline of genomic datasets. The first stage defined boundaries of transcriptional units by combining maps identifying the genomic locations of the 3′ termini of polyadenylated transcripts, with a genome wide chromatin state map based on a specific chromatin modification found in gene promoters, defining 5′ ends. Upon subtracting any transcription units known to encode proteins or small RNAs, and comparison with datasets of transcribed sequences, 567 lincRNA genes were defined. Their approach was quite stringent, so this is an underestimate of the total lincRNAs, and is especially biased against those with low levels of expression or especially tissue-restricted expression.

Within the 567 zebrafish lincRNA gene dataset, only 29 instances of sequence conservation with mammalian lincRNAs were identified. This sequence homology typically only spanned small portions of the transcripts (308nt average in relation to 1,951nt average length of lincRNA). However, broader features of lincRNA gene structure, such as the distribution and length of exons and introns, were better conserved. The positional relationships between lincRNA genes and neighbouring genes (synteny) was also well conserved.

Analysis of the expression of a subset of the identified lincRNAs showed that a high proportion displayed tissue specific embryonic expression patterns, most commonly in the developing central nervous system. To enquire further about the functional significance of lincRNA, the researchers used antisense reagents (morpholinos) to interfere with the function of two of the lincRNAs with significant mammalian homology. In both cases morpholinos causing defective splicing or targeting the areas of conserved sequence caused developmental defects. These morphant phenotypes could be rescued by coinjection of the properly spliced lincRNA. Importantly, they could also be rescued by injection of the orthologous human or mouse lincRNAs. This showed that the developmental functions of these lincRNAs were conserved through vertebrate evolution.

One of the most interesting aspects of this paper is the discussion on the potential mechanisms of lincRNA gene evolution. A higher proportion of zebrafish lincRNA genes show sequence homology with mammalian protein coding sequences than they do with mammalian lincRNA genes. 8.6% of zebrafish lincRNAs showed sequence similarity with zebrafish protein coding genes as well. These findings suggest that some lincRNAs originated from protein coding genes (and vice versa). In this scenario a lincRNA gene can arise either from a pseudogene that has already lost it’s protein coding function, or from a gene that maintained both protein and lincRNA coding function before losing it’s protein coding ability. This raises the possibility that some mRNAs might currently carry out lincRNA type non-coding functions.

See also: Linking a lincRNA to active chromatin

Ulitsky, I., Shkumatava, A., Jan, C., Sive, H., & Bartel, D. (2011). Conserved Function of lincRNAs in Vertebrate Embryonic Development despite Rapid Sequence Evolution Cell, 147 (7), 1537-1550 DOI: 10.1016/j.cell.2011.11.055