Prof. Alan Christoffels

Permanent URI for this collection

Prof. Alan Christoffels


Position: Director of SANBI
Department: South African National Bioinformatics Institute (SANBI)
Faculty: Faculty of Natural Sciences
Qualifications: BSc (Microbiology, Biochemistry) UCT, BSc Hons (Pharmacology) UCT, MedMSc (Genetics) Stellenbosh University, PhD (Bioinformatics) UWC
My publications in this repository
ORCID: 0000-0002-0420-2916
Researcher ID: C-3269-2014
More about me: here, and here
Tel: 021 959 2969
Fax: 021 959 2512
Email: alan@sanbi.ac.za

Browse

collection.page.browse.recent.head

Now showing 1 - 19 of 19
  • Item
    An integrated and comparative approach towards identification, characterization and functional annotation of candidate genes for drought tolerance in sorghum (Sorghum bicolor (L.) Moench)
    (BioMed Central, 2017) Woldesemayat, Adugna Abdi; Van Heusden, Peter; Ndimba, Bongani K.; Christoffels, Alan
    BACKGROUND: Drought is the most disastrous abiotic stress that severely affects agricultural productivity worldwide. Understanding the biological basis of drought-regulated traits, requires identification and an in-depth characterization of genetic determinants using model organisms and high-throughput technologies. However, studies on drought tolerance have generally been limited to traditional candidate gene approach that targets only a single gene in a pathway that is related to a trait. In this study, we used sorghum, one of the model crops that is well adapted to arid regions, to mine genes and define determinants for drought tolerance using drought expression libraries and RNA-seq data. RESULTS: We provide an integrated and comparative in silico candidate gene identification, characterization and annotation approach, with an emphasis on genes playing a prominent role in conferring drought tolerance in sorghum. A total of 470 non-redundant functionally annotated drought responsive genes (DRGs) were identified using experimental data from drought responses by employing pairwise sequence similarity searches, pathway and interprodomain analysis, expression profiling and orthology relation. Comparison of the genomic locations between these genes and sorghum quantitative trait loci (QTLs) showed that 40% of these genes were co-localized with QTLs known for drought tolerance. The genome reannotation conducted using the Program to Assemble Spliced Alignment (PASA), resulted in 9.6% of existing single gene models being updated. In addition, 210 putative novel genes were identified using AUGUSTUS and PASA based analysis on expression dataset. Among these, 50% were single exonic, 69.5% represented drought responsive and 5.7% were complete gene structure models. Analysis of biochemical metabolism revealed 14 metabolic pathways that are related to drought tolerance and also had a strong biological network, among categories of genes involved. Identification of these pathways, signifies the interplay of biochemical reactions that make up the metabolic network, constituting fundamental interface for sorghum defence mechanism against drought stress. CONCLUSIONS: This study suggests untapped natural variability in sorghum that could be used for developing drought tolerance. The data presented here, may be regarded as an initial reference point in functional and comparative genomics in the Gramineae family.
  • Item
    Whole-genome sequencing for an enhanced understanding of genetic variation among South Africans
    (Nature Publishing Group, 2017) Choudhury, Ananyo; Christoffels, Alan; Gamieldien, Junaid; Sefid-Dashti, Mahjoubeh J.
    The Southern African Human Genome Programme is a national initiative that aspires to unlock the unique genetic character of southern African populations for a better understanding of human genetic diversity. In this pilot study the Southern African Human Genome Programme characterizes the genomes of 24 individuals (8 Coloured and 16 black southeastern Bantu-speakers) using deep whole-genome sequencing. A total of ~16 million unique variants are identified. Despite the shallow time depth since divergence between the two main southeastern Bantu-speaking groups (Nguni and Sotho-Tswana), principal component analysis and structure analysis reveal significant (p < 10−6) differentiation, and FST analysis identifies regions with high divergence. The Coloured individuals show evidence of varying proportions of admixture with Khoesan, Bantu-speakers, Europeans, and populations from the Indian sub-continent. Whole-genome sequencing data reveal extensive genomic diversity, increasing our understanding of the complex and region-specific history of African populations and highlighting its potential impact on biomedical research and genetic susceptibility to disease.
  • Item
    Genome sequence of the tsetse fly (glossina morsitans): vector of African trypanosomiasis
    (American Association for the Advancement of Science, 2014) Christoffels, Alan; Obiero, George F.; Harkins, Gordon William
    Tsetse flies are the sole vectors of human African trypanosomiasis throughout sub-Saharan Africa. Both sexes of adult tsetse feed exclusively on blood and contribute to disease transmission. Notable differences between tsetse and other disease vectors include obligate microbial symbioses, viviparous reproduction, and lactation. Here, we describe the sequence and annotation of the 366-megabase Glossina morsitans morsitans genome. Analysis of the genome and the 12,308 predicted protein-encoding genes led to multiple discoveries, including chromosomal integrations of bacterial (Wolbachia) genome sequences, a family of lactation-specific proteins, reduced complement of host pathogen recognition proteins, and reduced olfaction/chemosensory associated genes. These genome data provide a foundation for research into trypanosomiasis prevention and yield important insights with broad implications for multiple aspects of tsetse biology.
  • Item
    Taste and odorant receptors of the coelecanth- a gene repertoire in transition
    (Wiley, 2014) Picone, Barbara; Hesse, Uljana; Panji, Sumir; Van Heusden, Peter; Jonas, Mario; Christoffels, Alan
    G-protein coupled chemosensory receptors (GPCR-CRs) aid in the perception of odors and tastes in vertebrates. So far, six GPCR-CR families have been identified that are conserved in most vertebrate species. Phylogenetic analyses indicate differing evolutionary dynamics between teleost fish and tetrapods. The coelacanth Latimeria chalumnae belongs to the lobe-finned fishes, which represent a phylogenetic link between these two groups. We searched the genome of L. chalumnae for GPCR-CRs and found that coelacanth taste receptors are more similar to those in tetrapods than in teleost fish: two coelacanth T1R2s co-segregate with the tetrapod T1R2s that recognize sweet substances, and our phylogenetic analyses indicate that the teleost T1R2s are closer related to T1R1s (umami taste receptors) than to tetrapod T1R2s. Furthermore, coelacanths are the first fish with a large repertoire of bitter taste receptors (58 T2Rs). Considering current knowledge on feeding habits of coelacanths the question arises if perception of bitter taste is the only function of these receptors. Similar to teleost fish, coelacanths have a variety of olfactory receptors (ORs) necessary for perception of water-soluble substances. However, they also have seven genes in the two tetrapod OR subfamilies predicted to recognize airborne molecules. The two coelacanth vomeronasal receptor families are larger than those in teleost fish, and similar to tetrapods, form V1R and V2R monophyletic clades. This may point to an advanced development of the vomeronasal organ as reported for lungfish. Our results show that the intermediate position of Latimeria in the phylogeny is reflected in its GPCR-CR repertoire.
  • Item
    Transcriptomic analysis reveal novel genes with sexually dimorphic expression in the zebrafish gonad and brain
    (Plosone, 2008) Sreenivasan, Rajini; Cai, Minnie; Bartfai, Richard; Wang, Xingang; Orban, Laszlo; Christoffels, Alan
    Our knowledge on zebrafish reproduction is very limited. We generated a gonad-derived cDNA microarray from zebrafish and used it to analyze large-scale gene expression profiles in adult gonads and other organs. We have identified 116638 gonad-derived zebrafish expressed sequence tags (ESTs), 21% of which were isolated in our lab. Following in silico normalization, we constructed a gonad-derived microarray comprising 6370 unique, full-length cDNAs from differentiating and adult gonads. Labeled targets from adult gonad, brain, kidney and ‘rest-of-body’ from both sexes were hybridized onto the microarray. Our analyses revealed 1366, 881 and 656 differentially expressed transcripts (34.7% novel) that showed highest expression in ovary, testis and both gonads respectively. Hierarchical clustering showed correlation of the two gonadal transcriptomes and their similarities to those of the brains. In addition, we have identified 276 genes showing sexually dimorphic expression both between the brains and between the gonads. By in situ hybridization, we showed that the gonadal transcripts with the strongest array signal intensities were germline-expressed. We found that five members of the GTP-binding septin gene family, from which only one member (septin 4) has previously been implicated in reproduction in mice, were all strongly expressed in the gonads. We have generated a gonad-derived zebrafish cDNA microarray and demonstrated its usefulness in identifying genes with sexually dimorphic co-expression in both the gonads and the brains. We have also provided the first evidence of large-scale differential gene expression between female and male brains of a teleost. Our microarray would be useful for studying gonad development, differentiation and function not only in zebrafish but also in related teleosts via cross-species hybridizations. Since several genes have been shown to play similar roles in gonadogenesis in zebrafish and other vertebrates, our array may even provide information on genetic disorders affecting gonadal phenotypes and fertility in mammals.
  • Item
    Comparative analysis of testis and ovary transcriptomes in zebrafish by combining experimental and computational tools
    (Wiley, 2004) Li, Yang; Chia, Jer, M; Bartfai, Richard; Christoffels, Alan; Yue, Gen, H; Ding, Ke; Ho, Mei, Y; Hill, James, A; Stupka, Elia; Orban, Laszlo
    Studies on the zebrafish model have contributed to our understanding of several important developmental processes, especially those that can be easily studied in the embryo. However, knowledge on late events such as gonad differentiation in the zebrafish is still limited. Here an analysis on the gene sets is expressed in the adult zebrafish testis and ovary in an attempt to identify genes with potential role in (zebra)fish gonad development and function. We produced 10 533 expressed sequence tags (ESTs) from zebrafish testis or ovary and downloaded an additional 23 642 gonad-derived sequences from the zebrafish EST database. We clustered these sequences together with over 13 000 kidney-derived zebrafish ESTs to study partial transcriptomes for these three organs. We searched for genes with gonad-specific expression by screening macroarrays containing at least 2600 unique cDNA inserts with testis-, ovary- and kidney-derived cDNA probes. Clones hybridizing to only one of the two gonad probes were selected, and subsequently screened with computational tools to identify 72 genes with potentially testis-specific and 97 genes with potentially ovary-specific expression, respectively. PCR-amplification confirmed gonad-specificity for 21 of the 45 clones tested (all without known function). Our study, which involves over 47 000 EST sequences and specialized cDNA arrays, is the first analysis of adult organ transcriptomes of zebrafish at such a scale. The study of genes expressed in adult zebrafish testis and ovary will provide useful information on regulation of gene expression in teleost gonads and might also contribute to our understanding of the development and differentiation of reproductive organs in vertebrates.
  • Item
    Mice and men: Their promoter properties
    (PLoS Genetics, 2006) Bajic, Vladimir B.; Tan, Sin lam; Christoffels, Alan; Schonbach, Christian; Lipovich, Leonard; Yang, Liang; Hofmann, Oliver; Kruger, Adele; Hide, Winston; Kai, Chikatoshi; Kawai, Jun; Hume, David, A.; Carninci, Piero; Hayashizaki, Yoshihide
    Using the two largest collections of Mus musculus and Homo sapiens transcription start sites (TSSs) determined based on CAGE tags, ditags, full-length cDNAs, and other transcript data, we describe the compositional landscape surrounding TSSs with the aim of gaining better insight into the properties of mammalian promoters. We classified TSSs into four types based on compositional properties of regions immediately surrounding them. These properties highlighted distinctive features in the extended core promoters that helped us delineate boundaries of the transcription initiation domain space for both species. The TSS types were analyzed for associations with initiating dinucleotides, CpG islands, TATA boxes, and an extensive collection of statistically significant cis-elements in mouse and human. We found that different TSS types show preferences for different sets of initiating dinucleotides and ciselements. Through Gene Ontology and eVOC categories and tissue expression libraries we linked TSS characteristics to expression. Moreover, we show a link of TSS characteristics to very specific genomic organization in an example of immune-response-related genes (GO:0006955). Our results shed light on the global properties of the two transcriptomes not revealed before and therefore provide the framework for better understanding of the transcriptional mechanisms in the two species, as well as a framework for development of new and more efficient promoter- and gene-finding tools.
  • Item
    The African Coelecanth genome provides insights into tetrapod evolution
    (Macmillan Publishers, 2013) Christoffels, Alan; Hesse, Uljana; Gamieldien, Junaid; Panji, Sumir; Picone, Barbara; Van Heusden, Peter
    The discovery of a living coelacanth specimen in 1938 was remarkable, as this lineage of lobe-finned fish was thought to have become extinct 70 million years ago. The modern coelacanth looks remarkably similar to many of its ancient relatives, and its evolutionary proximity to our own fish ancestors provides a glimpse of the fish that first walked on land. Here we report the genome sequence of the African coelacanth, Latimeria chalumnae. Through a phylogenomic analysis, we conclude that the lungfish, and not the coelacanth, is the closest living relative of tetrapods. Coelacanth protein-coding genes are significantly more slowly evolving than those of tetrapods, unlike other genomic features. Analyses of changes in genes and regulatory elements during the vertebrate adaptation to land highlight genes involved in immunity, nitrogen excretion and the development of fins, tail, ear, eye, brain and olfaction. Functional assays of enhancers involved in the fin-to-limb transition and in the emergence of extra-embryonic tissues show the importance of the coelacanth genome as a blueprint for understanding tetrapod evolution.
  • Item
    Genome-wide SNP identification by high-throughput sequencing and selective mapping allows sequence assembly positioning using a framework genetic linkage map
    (BioMed Central, 2010) Celton, Jean M.; Christoffels, Alan; Sargant, Daniel J.; Xu, Xiangming; Rees, Jasper G.
    Determining the position and order of contigs and scaffolds from a genome assembly within an organism’s genome remains a technical challenge in a majority of sequencing projects. In order to exploit contemporary technologies for DNA sequencing. We developed a strategy for whole genome single nucleotide polymorphism sequencing allowing the positioning of sequence contigs onto a linkage map using the bin mapping method. The strategy was tested on a draft genome of the fungal pathogen Venturia inaequalis, the causal agent of apple scab, and further validated using sequence contigs derived from the diploid plant genome Fragaria vesca. Using our novel method we were able to anchor 70% and 92% of sequences assemblies for V. inaequalis and F. vesca, respectively, to genetic linkage maps. We demonstrated the utility of this approach by accurately determining the bin map positions of the majority of the large sequence contigs from each genome sequence and validated our method by mapping single sequence repeat markers derived from sequence contigs on a full mapping population.
  • Item
    DAMPD: a manually curated antimicrobial peptide database
    (Oxford University Press, 2012) Sundararajan, Vijayaraghava S.; Gabere, Musa N.; Pretorius, Ashley; Adam, Saleem; Christoffels, Alan; Minna, Lehvaslaiho; Archer, John A.C.; Bajic, Vladimir B.
    The demand for antimicrobial peptides (AMPs) is rising because of the increased occurrence of pathogens that are tolerant or resistant to conventional antibiotics. Since naturally occurring AMPs could serve as templates for the development of new anti-infectious agents to which pathogens are not resistant, a resource that contains relevant information on AMP is of great interest. To that extent, we developed the Dragon Antimicrobial Peptide Database (DAMPD, http://apps.sanbi.ac.za/dampd) that contains 1232 manually curated AMPs. DAMPD is an update and a replacement of the ANTIMIC database. In DAMPD an integrated interface allows in a simple fashion querying based on taxonomy, species, AMP family, citation, keywords and a combination of search terms and fields (Advanced Search). A number of tools such as Blast, ClustalW, HMMER, Hydrocalculator, SignalP, AMP predictor, as well as a number of other resources that provide additional information about the results are also provided and integrated into DAMPD to augment biological analysis of AMPs.
  • Item
    A uniquely African focus: bioinformatics is a field that has grown exponentially in the past few years, and it is becoming increasingly important to collaborate as the field continues to gather momentum
    (2015) Christoffels, Alan
    Bioinformatics is a field that has grown exponentially in the past few years, and it is becoming increasingly important to collaborate as the field continues to gather momentum.
  • Item
    Gonad differentiation in zebrafish is regulated by the canonical Wnt signalling pathway
    (Society for the Study of Reproduction, 2014) Sreenivasan, Rajini; Jiang, Junhai; Wang, Xingang; Bartfai, Richard; Kwan, Hsiao, Y.; Christoffels, Alan; Orban, Laszlo
    Zebrafish males undergo a ‘‘juvenile ovary-to-testis’’ gonadal transformation process. Several genes, including nuclear receptor subfamily 5, group A (nr5a) and anti-Mu¨ llerian hormone (amh), and pathways such as Tp53-mediated germ-cell apoptosis have been implicated in zebrafish testis formation. However, our knowledge of the regulation of this complex process is incomplete, and much remains to be investigated about the molecular pathways and network of genes that control it. Using a microarray-based analysis of transforming zebrafish male gonads, we demonstrated that their transcriptomes undergo transition from an ovary-like pattern to an ovotestis to a testislike profile. Microarray results also validated the previous histological and immunohistochemical observation that there is high variation in the duration and extent of commitment to the juvenile ovary phase among individuals. Interestingly, global gene expression profiling of diverging zebrafish juvenile ovaries and transforming ovotestes revealed that some members of the canonical Wnt/beta-catenin signaling pathway were differentially expressed between these two phases. To investigate whether Wnt/beta-catenin signaling plays a role in zebrafish gonad differentiation, we used the Tg (hsp70l:dkk1b-GFP)w32 line to inhibit Wnt/beta-catenin signaling during gonad differentiation. Activation of dkk1b-GFP expression by heat shock resulted in an increased proportion of males and corresponding decrease in gonadal aromatase gene (cyp19a1a) expression. The Wnt target gene, lymphocyte enhancer binding factor 1 (lef1), was also down-regulated in the process. Together, these results provide the first functional evidence that, similarly to mammals, Wnt/beta-catenin signaling is a ‘‘pro-female’’ pathway that regulates gonad differentiation in zebrafish.
  • Item
    Molecular evolution of key receptor genes in primates and non-human primates
    (Science Publishing Group, 2014) Picone, Barbara; Christoffels, Alan
    African primates remain an unexplored source of information required to complete the origin and evolution of many human pathogens. Current studies have shown the importance of several receptor human genes implicated in host resistance or susceptibility to tuberculosis. The validation of these genes in Mycobacterium tuberculosis infection makes them an excellent model system to investigate the mode of selective pressures that may act on pathogen defense genes. To trace the evolutionary history of these genes, the report describes preliminary results for eight receptors human genes having either a significant or a possible association with Tuberculosis (TB). By using a combination of maximum likelihood approaches, evidence of positive selection were detected for four genes. The analysis between species, nevertheless, shows a clear pattern of nucleotide variation mostly compatible with purifying selection.
  • Item
    Human African trypanosomiasis research gets booost: unravelling the tsetse Genome
    (PLOS, 2014) Aksoy, Serap; Attardo, Geoffrey; Berriman, Matthew; Christoffels, Alan; Lehane, Mike; Masiga, Daniel K.; Toure', Yeya
    Human African trypanosomiasis (HAT), also known as sleeping sickness, is a neglected disease that impacts 70 million people distributed over 1.55 million km2 in sub-Saharan Africa. Trypanosoma brucei gambiense accounts for almost 90% of the infections in central and western Africa, the remaining infections being from T. b. rhodesiense in eastern Africa. Furthermore, the animal diseases caused by related parasites inflict major economic losses to countries already strained. The parasites are transmitted to the mammalian hosts through the bite of an infected tsetse fly.
  • Item
    A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads
    (Frontiers, 2014) Mbandi, Stanley K.; Hesse, Uljana; Rees, Jasper G.; Christoffels, Alan
    Downstream analyses of short-reads from next-generation sequencing platforms are often preceded by a pre-processing step that removes uncalled and wrongly called bases. Standard approaches rely on their associated base quality scores to retain the read or a portion of it when the score is above a predefined threshold. It is difficult to differentiate sequencing error from biological variation without a reference using a quality score. The effects of quality score based trimming have not been systematically studied in de novo transcriptome assembly. Using RNA-Seq data produced from Illumina,we teased out the effects of quality score based filtering or trimming on de novo transcriptome reconstruction. We showed that assemblies produced from reads subjected to different quality score thresholds contain truncated and missing transfrags when compared to those from untrimmed reads. Our data supports the fact that de novo assembling of untrimmed data is challenging for de Bruijn graph assemblers. However, our results indicates that comparing the assemblies from untrimmed and trimmed read subsets can suggest appropriate filtering parameters and enables election of the optimum de novo transcriptome assembly in non-model organisms.
  • Item
    Evolution and structural analysis of Glossina morsitans (Diptera; Glossinidae) Tetraspanins
    (MDPI, Basel, Switzerland, 2014) Murungi, Edwin, K.; Kariithi, Henry, M.; Adunga, Vincent; Obonya, Meshack; Christoffels, Alan
    Tetraspanins are important conserved integral membrane proteins expressed in many organisms. Although there is limited knowledge about the full repertoire, evolution and structural characteristics of individual members in various organisms, data obtained so far show that tetraspanins play major roles in membrane biology, visual processing, memory, olfactory signal processing, and mechanosensory antennal inputs. Thus, these proteins are potential targets for control of insect pests. Here, we report that the genome of the tsetse fly, Glossina morsitans (Diptera: Glossinidae) encodes at least seventeen tetraspanins (GmTsps), all containing the signature features found in the tetraspanin superfamily members. Whereas six of the GmTsps have been previously reported, eleven could be classified as novel because their amino acid sequences do not map to characterized tetraspanins in the available protein data bases. We present a model of the GmTsps by using GmTsp42Ed, whose presence and expression has been recently detected by transcriptomics and proteomics analyses of G. morsitans. Phylogenetically, the identified GmTsps segregate into three major clusters. Structurally, the GmTsps are largely similar to vertebrate tetraspanins. In view of the exploitation of tetraspanins by organisms for survival, these proteins could be targeted using specific antibodies, recombinant large extracellular loop (LEL) domains, small-molecule mimetics and siRNAs as potential novel and efficacious putative targets to combat African trypanosomiasis by killing the tsetse fly vector.
  • Item
    Odorant and gustatory receptors in the tsetse fly Glossina morsitans morsitans
    (PLOS, 2014) Obiero, George F.; Nyanjom, Steven R. G.; Mireji, Paul O.; Christoffels, Alan; Robertson, Hugh M.; Masiga, Daniel K.
    Tsetse flies use olfactory and gustatory responses, through odorant and gustatory receptors (ORs and GRs), to interact with their environment. Glossina morsitans morsitans genome ORs and GRs were annotated using homologs of these genes in Drosophila melanogaster and an ab initio approach based on OR and GR specific motifs in G. m. morsitans gene models coupled to gene ontology (GO). Phylogenetic relationships among the ORs or GRs and the homologs were determined using Maximum Likelihood estimates. Relative expression levels among the G. m. morsitans ORs or GRs were established using RNA-seq data derived from adult female fly. Overall, 46 and 14 putative G. m. morsitans ORs and GRs respectively were recovered. These were reduced by 12 and 59 ORs and GRs respectively compared to D. melanogaster. Six of the ORs were homologous to a single D. melanogaster OR (DmOr67d) associated with mating deterrence in females. Sweet taste GRs, present in all the other Diptera, were not recovered in G. m. morsitans. The GRs associated with detection of CO2 were conserved in G. m. morsitans relative to D. melanogaster. RNA-sequence data analysis revealed expression of GmmOR15 locus represented over 90% of expression profiles for the ORs. The G. m. morsitans ORs or GRs were phylogenetically closer to those in D. melanogaster than to other insects assessed. We found the chemoreceptor repertoire in G. m. morsitans smaller than other Diptera, and we postulate that this may be related to the restricted diet of blood-meal for both sexes of tsetse flies. However, the clade of some specific receptors has been expanded, indicative of their potential importance in chemoreception in the tsetse.
  • Item
    International Glossina Genome Initiative 2004-2014: a driver for post-genomic era research on the African continent
    (PLOS, 2014) Christoffels, Alan; Masiga, Daniel K.; Berriman, Matthew; Lehane, Mike; Toure', Yeya; Aksoy, Serap
    Human African trypanosomiasis (HAT), also known as sleeping sickness, is a neglected disease that impacts 70 million people distributed over 1.55 million km2 in sub- Saharan Africa and includes at least 50% of the population of theDemocratic Republic of the Congo [1]. Trypanosoma brucei gambiense accounts for more than 98% of the infections in central and West Africa, the remaining infections being from Trypanosoma brucei rhodesiense in East Africa [2]. The parasites are transmitted to the hosts through the bite of an infected tsetse fly. Disease control is challenging as there are no vaccines, and effective, easily delivered drugs are still lacking. Treatment invariably involves lengthy hospitalization, with both medical and socioeconomic consequences.
  • Item
    Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms
    (BioMed Central, 2015) Mbandi, Stanley K.; Hesse, Uljana; Van Heusden, Peter; Christoffels, Alan
    Background: De novo transcriptome assembly of short transcribed fragments (transfrags) produced from sequencing-by-synthesis technologies often results in redundant datasets with differing levels of unassembled, partially assembled or mis-assembled transcripts. Post-assembly processing intended to reduce redundancy typically involves reassembly or clustering of assembled sequences. However, these approaches are mostly based on common word heuristics and often create clusters of biologically unrelated sequences, resulting in loss of unique transfrags annotations and propagation of mis-assemblies. Results: Here, we propose a structured framework that consists of a few steps in pipeline architecture for Inferring Functionally Relevant Assembly-derived Transcripts (IFRAT). IFRAT combines 1) removal of identical subsequences, 2) error tolerant CDS prediction, 3) identification of coding potential, and 4) complements BLAST with a multiple domain architecture annotation that reduces non-specific domain annotation. We demonstrate that independent of the assembler, IFRAT selects bona fide transfrags (with CDS and coding potential) from the transcriptome assembly of a model organism without relying on post-assembly clustering or reassembly. The robustness of IFRAT is inferred on RNA-Seq data of Neurospora crassa assembled using de Bruijn graph-based assemblers, in single (Trinity and Oases-25) and multiple (Oases-Merge and additive or pooled) k-mer modes. Single k-mer assemblies contained fewer transfrags compared to the multiple k-mer assemblies. However, Trinity identified a comparable number of predicted coding sequence and gene loci to Oases pooled assembly. IFRAT selects bona fide transfrags representing over 94% of cumulative BLAST-derived functional annotations of the unfiltered assemblies. Between 4-6% are lost when orphan transfrags are excluded and this represents only a tiny fraction of annotation derived from functional transference by sequence similarity. The median length of bona fide transfrags ranged from 1.5kb (Trinity) to 2kb (Oases), which is consistent with the average coding sequence length in fungi. The fraction of transfrags that could be associated with gene ontology terms ranged from 33-50%, which is also high for domain based annotation. We showed that unselected transfrags were mostly truncated and represent sequences from intronic, untranslated (5′ and 3′) regions and non-coding gene loci. Conclusions: IFRAT simplifies post-assembly processing providing a reference transcriptome enriched with functionally relevant assembly-derived transcripts for non-model organism.