FANTOM


FANTOM is an international research consortium first established in 2000 as part of the RIKEN research institute in Japan. The original meeting gathered international scientists from diverse backgrounds to help annotate the function of mouse cDNA clones generated by the Hayashizaki group. Since the initial FANTOM1 effort, the consortium has released multiple projects that look to understand the mechanisms governing the regulation of mammalian genomes. Their work has generated a large collection of shared data and helped advance biochemical and bioinformatic methodologies in genomics research.

Foundation

In 1995, researchers of the RIKEN institute began creating an encyclopedia of full length cDNAs for the mouse genome. The goal of this 'Mouse Encyclopedia Project' was to provide a functional annotation of the mouse transcriptome. This mapping would provide a valuable resource for gene discovery, understanding disease-causing genes and homology across species. This promised to be a formidable task from the onset. Current methodologies were insufficient to generate full length cDNA clones at scale, and to be useful as a resource the annotations would have to be agreed upon by experts across different disciplines.
The first goal was to develop methods that allowed generation of full length cDNA libraries. Reverse transcriptase protocols at the time had difficulties with the secondary structure of mRNA, leading to abbreviated cDNAs that were difficult to align and invited further complications in downstream analysis. To surpass this limitation, a method utilizing trehalose was developed to allow reverse transcriptase to function at a higher temperature, relaxing secondary structures. Other methods were additionally developed to assist in the construction of clonal cDNA libraries. These include a biotin-based capture system to select for full length cDNA, a novel lambda phage vector that minimized biases when delivering cDNA into a plasmid, and an iterative strategy to enrich for cDNA that had yet to be sequenced.
Sequencing began in 1998 and progressed rapidly, producing 246 cDNA libraries that encompassed 21,076 cDNA clones across a large range of mouse cells and tissues. While this stage was largely successful, further limitations were encountered at the bioinformatic level. The sequenced cDNAs were annotated in a semi-automatic manner that utilized available databases to assign genes within a Gene Ontology framework. However, many novel sequences did not have meaningful matches when BLAST against gene databases.
After consulting Gerry Rubin, the organizer of the first genome annotation effort for Drosophila melanogaster, it became apparent that a robust system for annotation that incorporated computational prediction and manual curation was required for the novel sequences. Desiring input from experts in bioinformatics, genetics and other scientific fields, the RIKEN group organized the first FANTOM meeting.

FANTOM1

To facilitate the annotation of the mouse cDNA clones, the RIKEN research group developed a web-based service called FANTOM+ prior to the first meeting. Users could search for motifs, view pre-computed sequence similarity scores, as well as query other public databases and integrate relevant annotations into the FANTOM database. The assignment and functional annotation of the genes required multiple bioinformatic tools and databases. Predominant tools included BLASTN/BLASTX, FASTA/FASTY, DECODER, EST-WISE and HMMER, while both nucleic acid and protein databases such as SwissProt, UniGene and NCBI-nr were utilized. Concurrently, a collaboration with the Mouse Genome Informatics group allowed the RIKEN researchers to establish a validated set of clones that were identical between the two databases.
Armed with computational methodologies and over 20,000 cDNA sequences, the RIKEN group organized the first FANTOM meeting in Tsukuba City from August 28 to September 8, 2000. A diverse group of international scientists were recruited to discuss strategies and execute the annotation of the RIKEN clones. The assembled computational procedures allowed for sequence comparison and domain analysis to assign putative function using GO terms. Redundancy of the cDNA clones presented a challenge, requiring clustering strategies and referral to the MGI validation set to identify unique clones. The RIKEN set of clones was eventually reduced to 15,295 genes, although this was cautiously considered an overestimation.
Central to the curation efforts was the creation of the RIKEN definition. This provided a hierarchical and systematic means to assign functions to the clones based upon known genes, placing priority on previously established or well-curated knowledge. The hierarchical nature of the classification allowed for consistency when a sequence was highly similar to multiple different genes. Importantly, if no sequence similarity was found, the definition assigned putative function based upon predicted protein motif signatures, coding potential and matches to expressed sequence tag databases. Only in the absence of any predicted or representative similarity would a clone be considered ‘unclassifiable.’
The collected efforts of RIKEN/FANTOM resulted in a 2001 Nature publication. The results included the assignment of the 21,076 cDNA clones to 4,012 GO terms, identification of novel mouse genes and protein motifs, detection of likely alternative spliceforms, and the discovery of mouse genes orthologous to human disease genes. Additionally, the first sequenced human genome was published a week later and incorporated FANTOM’s results to predict the number of human genes.

FANTOM2

Having established and improved upon the protocols for full-length cDNA library generation, the RIKEN group continued to add to the FANTOM collection. Modifications to their methods allowed for further selection of rare and long transcripts, enabling identification of cDNA over 4kb in length. The second FANTOM meeting occurred May 2002 - by then the number of cDNA clones had increased by 39,694 to a total of 60,770.
One insight gained from FANTOM1 was that alternative polyadenylation was common in the mouse transcriptome, meaning that 3’-end clustering led to extensive redundancy. To address this, additional sequencing of the 5’-end was performed to identify unique clones. The FANTOM2 publication contributed a substantial addition of novel protein coding transcripts. Arguably the most notable result of FANTOM2 was that efforts to select for long and rare transcripts had revealed a significant amount of non protein-coding RNA.
Again, the FANTOM collection proved to be a fruitful resource. The non-coding RNA were identified as antisense RNA and long non-coding RNAs, poorly understood classes of regulatory RNA. The first published sequence of the mouse genome utilized the annotations established by FANTOM. Other efforts were able to describe entire protein families, such as the G protein-coupled receptors.

FANTOM3

An ultimate goal of FANTOM is to establish gene networks that capture the regulatory interactions of transcription, and to differentiate these interactions by cell type or state. To this extent, it was realized that the polymorphic nature of the 5'-end of sequences would require extensive mapping. Characterizing transcription start sites would allow identification of promoters and differentiation of their usage between cell types. This also meant further developments in sequencing methods were needed. While full length mouse cDNAs continued to be generated, the RIKEN-led researchers established Cap Analysis of Gene Expression, a technique that would drive much of their future work.

Development of CAGE

CAGE was a continuation of the concepts developed for FANTOM1 - and used extensively in the following projects - for capturing 5' mRNA caps. Unlike previous efforts to generate full length cDNA, CAGE examines fragments, or tags, that are 20-27 in length. This provided an economical and high-throughput means of mapping TSSs, including promoter structure and activity.
The general steps are as follows: cDNA is reverse transcribed from mRNA using random or oligo dT primers. The cap trapper method is then employed to ensure selection of full length cDNA. This entails adding biotin to the 5' cap, and subsequent capture with streptavidin beads after an RNase digestion step to remove single stranded RNA that has not hybridized to cDNA. Following cap trapping, the cDNA is separated from the RNA-cDNA hybrid. A double-stranded CAGE linker that is also biotinylated is ligated to the 5' end of the cDNA, and the second strand of the cDNA is synthesized. This resulting dual stranded DNA is digested with the Mme1 endonuclease, cutting the CAGE linker and producing a 20-27bp CAGE tag. A second linker is added to the 3'-end and the tag is amplified by PCR. Finally, the CAGE tags are released from the 5' and 3' linkers. The tags can then be sequenced, concatenated or cloned. At the time, CAGE was carried out using the RISA 384 capillary sequencer that had been previously established by RIKEN.

Discoveries

The development of CAGE gave rise to a number of milestone findings. Importantly, RNA was found to be much more abundant in the mammalian transcriptome than previously thought, accompanied with the realization that the genome was pervasively transcribed.
Combining the methods of CAGE, gene identification signatures, and gene signature cloning, the ‘transcriptional landscape’ of the mammalian genome was mapped, characterizing the pattern of transcription control signals and the transcripts they generate. It was discovered that there are many more transcripts than the estimated 22,000 genes in the mouse genome, and that many of these transcriptional units have alternative promoters and polyadenylation sites.
Furthermore, it was discovered that ‘transcriptional forests’, clusters of transcripts that share common expression regions and regulatory events, are separated by ‘transcription deserts,’ and make up ~63% of the genome. A jointly released publication found that many of the transcripts in these forests show antisense transcription, and that most sense/antisense pairs show concordant regulation. Another notable result showed that many non-coding RNAs are dynamically expressed, with many being initiated in 3’ untranslated regions, and that they are positionally conserved across species.
The third milestone paper to come out of FANTOM3 investigated mammalian promoter architecture and evolution. It established two classes of mammalian promoters. The first are TATA box-enriched promoters, with well defined transcriptional start sites. These promoters are evolutionary conserved and are more commonly associated with tissue-specific genes. The second and more common class of promoters, broad CpG rich promoters, are plastic, evolvable, and expressed in a wide range of cells and tissues. This study also demonstrated that CpG-rich promoters may be bidirectional, and are highly susceptible epigenetic control and are thus a potential component of adaptive evolution.
The meeting for FANTOM3 occurred in September, 2004. A collection of satellite publications that spawned from FANTOM3 were published in PLoS Genetics. They include further work on promoter properties, exon length and pseudo-messenger RNA.

FANTOM4

The rise of next-generation sequencing was significantly beneficial to the advancement of CAGE technology. Using the Roche-454 sequencer, the FANTOM group developed deepCAGE, increasing the throughput of CAGE to more than a million tags per sample. At these depths, researchers could now start constructing networks of gene regulatory interactions. The FANTOM4 meeting took place December, 2006.
While previous FANTOM projects examined a range of cell types, FANTOM4's purpose was to deeply interrogate the dynamics driving cellular differentiation. Analysis was confined to a human THP-1 cell line, providing time course data of a monoblast becoming a monocyte. DeepCage resolved TSSs at single-nucleotide resolution, pinpointing where transcription factors bind. By monitoring time-dependent gene expression changes as cells differentiated, inference was provided for which regulatory motifs are predictive of expression changes, time dependency of TF activity, and TF target genes. These efforts resulted in a transcriptional regulatory network, demonstrating that the differentiation process is highly complex and driven by a great magnitude of TFs enacting both positive and negative regulatory interactions.
FANTOM4 also increased our understanding of retrotransposon transcription and transcriptional initiation RNAs. Retrotransposons contribute to repetitive elements in mammalian genomes and can affect multiple biological processes - like genomic evolution - as well as structures, such as alternative promoters and exons. It was demonstrated that retrotransposons are expressed in a cell and tissue specific manner, and approximately 250,000 previously unknown retrotransposon-driven TSSs were identified.
It was discovered that retrotransposons can influence mammalian transcription and transcriptional regulation of both coding and non-coding RNAs in various tissues. Further efforts found
a genomically and evolutionary widespread new class of RNAs, called transcription initiation RNAs. This species of RNA are relatively tiny and are typically found downstream of TSSs of CpG rich promoters. tiRNAs are low in abundance and are associated with highly expressed genes, as well as RNA polymerase II binding and TSSs. More recent work has shown that tiRNs may be able to modulate epigenetic states and local chromatin architecture. However, it possible that these tiRNAs do not have a regulatory role and are simply a byproduct of transcription.
Following these initial findings, an atlas of combinatorial transcriptional regulation in mouse and humans was published by the RIKEN researchers. This work demonstrated that transcriptional complexes can interact within a network to control tissue identity/cell state, and that these networks are often dominated by ‘facilitator' transcription factors which are broadly expressed across tissues/cells. It was found that about half of the measured regulatory interactions were conserved between mouse and human. FANTOM4 led to numerous satellite papers, investigating topics like promoter architecture, miRNA regulation and genomic regulatory blocks.

FANTOM5

The fifth round of FANTOM aimed to provide insight into the regulatory landscape of the transcriptome across as many cell states as possible. It continues to be a relevant resource of shared data. The project consisted of two phases: the first focused on steady state cells, while the second focused on temporal data. Advancements in next generation sequencing were leveraged to achieve FANTOM5’s great breadth, with single molecule sequencing allowing single base pair resolution of TSS activity from as little as 100 ng of RNA. Samples was collected from every human organ, as well as over 200 cancer lines, 30 time courses of cellular differentiation, mouse development time courses, and over 200 primary cell types. In total, 1,816 human and 1,1016 mouse samples were profiled across both phases.
While similar to the ENCODE Project, FANTOM5 differs in two key ways. First, ENCODE utilized immortalised cell lines, while FANTOM5 focused on primary cells and tissues, which are more reflective of the actual biological processes responsible for maintaining cell type identity. Second, ENCODE utilized multiple genomic assays to capture the transcriptome and epigenome. FANTOM5 focused solely on the transcriptome, relying on other published work to infer features like cell type as defined by chromatin status. The FANTOM5 meeting took place October, 2011.

Phase 1

The first phase of FANTOM5 involved taking ‘snapshots’ of a wide range of steady state cell types using CAGE profiling across 975 human and 399 mouse samples. This initial effort resulted in two Nature papers -
one describing the mammalian promoter landscape and the other describing active enhancers. Together, they provide an atlas of promoters, enhancers and TSSs across diverse cell types, acting as a ‘baseline’ for studying the complex landscape of transcription regulation. Specifically, single molecule CAGE profiles were generated using a HeliScope sequencer across 573 human primary cell samples, 128 mouse primary cell samples, 250 cancer cell lines, 152 human post-mortem tissues and 271 mouse developmental tissue samples.
A new method to identify the CAGE peaks was developed, called decomposition peak analysis. CAGE tags are clustered by proximity, followed by independent component analysis to decompose the peaks into non-overlapping regions. An enrichment step is applied to ensure the peaks correspond to TSSs, and external data of EST, histone H3 lysine 4 trimethylation marks and DNase hypersensitivity sites are used to support that the peaks are genuine TSSs.
A key finding showed that the typical mammalian promoter contains multiple TSSs with differing expression patterns across samples. This implied that these TSSs are regulated separately, despite being within close proximity. Ubiquitously expressed promoters had the highest conservation in their sequences, while cell-specific promoters were less conserved. A further prominent result suggested that enhancer-derived RNA are transcribed in a cell/tissue specific manner, reflective of the activity of that enhancer.

Phase 2

While the first phase was focused on a steady state representation of cell states, the second phase looked to explore the dynamic process of transitioning cell states through the use of time course data. Again, CAGE was employed - this time over 19 human and 14 mouse time courses covering a range of cell types and biological stimuli that represented 408 distinct time points. This included the differentiation of stem cell or committed progenitor cells towards their terminal fates, as well as fully differentiated cells responding to growth factors or pathogens.
Unsupervised clustering was performed to identify a set of distinct response classes, examining patterns in expression fold changes compared to time 0. In this manner, the expression of enhancers, TF promoters and non-TF promoters were generalized on a temporal scale of the first 6 hours of the time-course. Generally, the earliest response of the cells occurred at enhancers, with eRNA concentrations peaking as early as 15 minutes after time 0. Even in the classes that represent ‘later’ responses, enhancers tended to activate before proximal promoters. Variability was seen in the persistence of this activation - some enhancers rapidly returned to baseline after the burst at 15 minutes, while others persisted after promoter activation. Together, this is suggestive that eRNA may have differential roles in regulating gene activity.

Additional Work

Aside from the typical sharing of data on the FANTOM database, FANTOM5 also introduced two bioinformatic tools for data exploration. ZENBU is a genome browser with additional functionality: users can upload BAM files of CAGE, short-RNA and ChIP-seq experiments and perform quality control, normalization, peak finding and annotation among visual comparisons. SSTAR meanwhile allows exploration and searches of the FANTOM5 samples and their genomic features.
The bounty of data produced by FANTOM5 continues to provide a resource for researchers looking to explain the regulatory mechanisms that shape processes like development. Often CAGE data in a specific cell/tissue type is used in conjunction with further epigenomic assays - one such example describes the interplay of DNA methylation and CAGE-defined regulatory sequences during differentiation of a granulocyte.
Three years after introducing the enhancer and promoter atlases, the FANTOM group released atlases for lncRNAs and microRNAs, incorporating FANTOM5 data. An overarching goal was to provide further insight into the earlier observation of pervasive transcription of the mammalian genome. The lncRNA work characterized 27,919 human lncRNA genes across 1,829 samples to stimulate research in the functional relevance of this poorly understood class of RNA. The results were suggestive that 69% of the identified lncRNA had potential functionality, although more evidence is required to comment on whether the remaining 31% are merely transcriptional ‘noise’ from spurious transcription initiation. The miRNA atlas identified 1,357 human and 804 mouse miRNA promoters and demonstrated strong sequence conservation between the two species. It was also demonstrated that primary miRNA expression could be used as a proxy for mature miRNA levels.

FANTOM6

Currently underway, FANTOM6 aims to systematically characterize the role of lncRNA in the human genome. The biological function of these large and untranslated RNA is largely unknown. Based upon the few works that have examined lncRNA, it is believed that they are involved in regulating transcription, translation, post-translational modifications, and epigenetic marks. However, current knowledge of the extent and range of these putative regulatory interactions is rudimentary.
There are numerous challenges to address for this next rendition of FANTOM. In particular, lncRNAs are ill-defined - they lack conservation and vary greatly in size, ranging from 200 to over one million nucleotides in length. Unlike coding transcripts, which are found in the cytosol for translation, lncRNA are found primarily in the nucleus - a much more complex landscape of RNA. In general, lncRNA have lower expression levels than coding transcripts, but there is great variability in this expression which can be obscured by cell type or localization within the nucleus. Furthermore, functional classification lncRNAs remains hotly debated - it is unknown if lncRNAs can be grouped based on common function/mechanisms of action, or by active domains.
FANTOM has laid out a three pronged experimental strategy to explore these unknowns. A reference transcriptome and epigenome profile of different cell types will be constructed as a base line for each cell type. Next, using lncRNAs identified in previous publications, FANTOM5 data and further CAGE profiling, perturbation experiments will be conducted to evaluate changes in cellular molecular phenotype. Lastly, complementary technology will be used to functionally annotate/classify a selected subset of lncRNAs. These techniques will be aimed at elucidating lncRNA secondary structure, their association to proteins and chromatin, and mapping long range interactions of lncRNA throughout the genome.