South African National Bioinformatics Institute (SANBI)
Permanent URI for this community
News
How South Africa can stop HIV drug resistance in its tracks22 May, 2015
UWC team slashes cost of drug test11 June, 2015
UWC HIV drug resistance test innovation18 February 2016
Browse
Browsing by Issue Date
Now showing 1 - 20 of 214
Results Per Page
Sort Options
Item Generation of a human gene index and its application to disease candidacy(University of the Western Cape, 2001) Christoffels, Alan; Hide, Winston; Faculty of ScienceWith easy access to technology to generate expressed sequence tags (ESTs), several groups have sequenced from thousands to several thousands of ESTs. These ESTs benefit from consolidation and organization to deliver significant biological value. A number of EST projects are underway to extract maximum value from fragmented EST resources by constructing gene indices, where all transcripts are partitioned into index classes such that transcripts are put into the same index class if they represent the same gene. Therefore a gene index should ideally represent a non-redundant set of transcripts. Indeed, most gene indices aim to reconstruct the gene complement of a genome and their technological developments are directed at achieving this goal. The South African National Bioinformatics Institute (SANBI), on the other hand, embarked on the development of the sequence alignment and consensus knowledgebase (STACK) database that focused on the detection and visualisation of transcript variation in the context of developmental and pathological states, using all publicly available ESTs. Preliminary work on the STACK project employed an approach of partitioning the EST data into arbitrarily chosen tissue categories as a means of reducing the EST sequences to manageable sizes for subsequent processing. The tissue partitioning provided the template material for developing error-checking tools to analyse the information embedded in the error-laden EST sequences. However, tissue partitioning increases redundancy in the sequence data because one gene can be expressed in multiple tissues, with the result that multiple tissue partitioned transcripts will correspond to the same gene.Therefore, the sequence data represented by each tissue category had to be merged in order to obtain a comprehensive view of expressed transcript variation across all available tissues. The need to consolidate all EST information provided the impetus for developing a STACK human gene index, also referred to as a whole-body index. In this dissertation, I report on the development of a STACK human gene index represented by consensus transcripts where all constituent ESTs sample single or multiple tissues in order to provide the correct development and pathological context for investigating sequence variation. Furthermore, the availability of a human gene index is assessed as a diseasecandidate gene discovery resource. A feasible approach to construction of a whole-body index required the ability to process error-prone EST data in excess of one million sequences (1,198,607 ESTs as of December 1998). In the absence of new clustering algorithms, at that time, we successfully ported D2_CLUSTER, an EST clustering algorithm, to the high performance shared multiprocessor machine, Origin2000. Improvements to the parallelised version of D2_CLUSTER included: (i) ability to cluster sequences on as many as 126 processors. For example, 462000 ESTs were clustered in 31 hours on 126 R10000 MHz processors, Origin2000. (ii) enhanced memory management that allowed for clustering of mRNA sequences as long as 83000 base pairs. (iii) ability to have the input sequence data accessible to all processors, allowing rapid access to the sequences. (iv) a restart module that allowed a job to be restarted if it was interrupted. The successful enhancements to the parallelised version of D2_CLUSTER, as listed above, allowed for the processing of EST datasets in excess of 1 million sequences. An hierarchical approach was adopted where 1,198,607 million ESTs from GenBank release 110 (October 1998) were partitioned into "tissue bins" and each tissue bin was processed through a pipeline that included masking for contaminants, clustering, assembly, assembly analysis and consensus generation. A total of 478,707 consensus transcripts were generated for all the tissue categories and these sequences served as the input data for the generation of the wholebody index sequences. The clustering of all tissue-derived consensus transcripts was followed by the collapse of each consensus sequence to its individual ESTs prior to assembly and whole-body index consensus sequence generation. The hierarchical approach demonstrated a consolidation of the input EST data from 1,198607 ESTs to 69,158 multi-sequence clusters and 162,439 singletons (or individual ESTs). Chromosomal locations were added to 25,793 whole-body index sequences through assignment of genetic markers such as radiation hybrid markers and généthon markers. The whole-body index sequences were made available to the research community through a sequence-based search engine (http://ziggy.sanbi.ac.za/~alan/researchINDEX.html).Item The contribution of exon-skipping events on chromosome 22 to protein coding diversity(Cold Spring Harbor Laboratory Press, 2001) Hide, Winston A.; Babenko, Vladimir N.; van Heusden, Peter A.Completion of the human genome sequence provides evidence for a gene count with lower bound 30,000–40,000. Significant protein complexity may derive in part from multiple transcript isoforms. Recent EST based studies have revealed that alternate transcription, including alternative splicing, polyadenylation and transcription start sites, occurs within at least 30–40% of human genes. Transcript form surveys have yet to integrate the genomic context, expression, frequency, and contribution to protein diversity of isoform variation. We determine here the degree to which protein coding diversity may be influenced by alternate expression of transcripts by exhaustive manual confirmation of genome sequence annotation, and comparison to available transcript data to accurately associate skipped exon isoforms with genomic sequence. Relative expression levels of transcripts are estimated from EST database representation. The rigorous in silico method accurately identifies exon skipping using verified genome sequence. 545 genes have been studied in this first hand-curated assessment of exon skipping on chromosome 22.Item HIV subtype C diversity: analysis of the relationship of sequence diversity to proposed epitope locations(University of the Western Cape, 2002) Ernstoff, Elana Ann; Hide, Winston; Seoighe, Cathal; South African National Bioinformatics Institute (SANBI); Faculty of ScienceSouthern Africa is facing one of the most serious HIV epidemics. This project contributes to the HIVNET, Network for Prevention Trials cohort for vaccine development. HIVÂ’s biology and rapid mutation rate have made vaccine design difficult. We examined HIV-1 subtype C diversity and how it relates to CTL epitope location along viral gag sequences. We found a negative correlation between codon sites under positive selection and epitope regions; suggesting epitope regions are evolutionarily conserved. It is possible that epitopes exist in non-conserved regions, yet fail to be detected due to the reference strain diverging from the circulating viral population. To test if CTL clustering is an artifact of the reference strain, we calculated differences between the gag codons and the reference strain. We found a weak negative correlation, suggesting epitopes in less conserved regions maybe evading detection. Locating conserved and optimal epitopes that can be recognized by CTLs is essential for the design of vaccine reagents.Item HIV Subtype C Diversity: Analysis of the Relationship of Sequence Diversity to Proposed Epitope Locations(University of the Western Cape, 2002) Ernstoff, Elana Ann; Hide, WinstonSouthern Africa is facing one of the most serious HIV epidemics. This project contributes to the HIVNET, Network for Prevention Trials cohort for vaccine development. HIV's biology and rapid mutation rate have made vaccine design difficult. We examined HIV-l subtype C diversity and how it relates to CTL epitope location along viral gag sequences. We found a negative correlation between codon sites under positive selection and epitope regions; suggesting epitope regions are evolutionarily conserved. It is possible that detected due to the reference regions, yet fail to be viral population. To test if CTL clustering is an we calculated differences between the gag codons and the a weak negative correlation, suggesting epitopes in less conserved regions maybe evading detection. Locating conserved and optimal epitopes that can be recognized by CTLs is essential for the design of vaccine reagents.Item Analyses of sequence divergence using completely sequenced genomes(University of the Western Cape, 2003) Nembaware, Victoria P.; Seoighe, CathalUsing the complete genome, Saccharomyces cerevisiae, which duplicated after its speciation fuom Kluyveromyces lactics, a dataset of 119 putative S. cerevisiae - K. lactis ortholog-pairs was constructed. S. cerevisiae paralogous pairs that are likely to have duplicated during the whole genome duplication of S. cerevisiae were obtained and the approach taken in our previous work (Nembaware et al., 20OZ), was repeated to test whether the presence of a paralogue in S. cerevisiae had an effect on the rate of sequence divergence of the 119 pairs of orthologous genes. We found, however, that substitutions at synonymous sites had reached saturation and this prevented us from being able to repeat the previous finding with S. cerevistae and K. lactis . From this study a publicly available web-server (http://hamlyn.sanbi.ac.zal-victoria) that automates the calculation of Ka:Ks values given a pairs homologous CDS sequences is presented.Item Assessment of genome visualization tools relevant to HIV genome research: development of a genome browser prototype(University of the Western Cape, 2004) Boardman, Anelda Philine; Hide, Winston; Faculty of ScienceOver the past two decades of HIV research, effective vaccine candidates have been elusive. Traditionally viral research has been characterized by a gene -by-gene approach, but in the light of the availability of complete genome sequences and the tractable size of the HIV genome, a genomic approach may improve insight into the biology and epidemiology of this virus. A genomic approach to finding HIV vaccine candidates can be facilitated by the use of genome sequence visualization. Genome browsers have been used extensively by various groups to shed light on the biology and evolution of several organisms including human, mouse, rat, Drosophila and C.elegans. Application of a genome browser to HIV genomes and related annotations can yield insight into forces that drive evolution, identify highly conserved regions as well as regions that yields a strong immune response in patients, and track mutations that appear over the course of infection. Access to graphical representations of such information is bound to support the search for effective HIV vaccine candidates. This study aimed to answer the question of whether a tool or application exists that can be modified to be used as a platform for development of an HIV visualization application and to assess the viability of such an implementation. Existing applications can only be assessed for their suitability as a basis for development of an HIV genome browser once a well-defined set of assessment criteria has been compiled.Item Comparative analysis of testis and ovary transcriptomes in zebrafish by combining experimental and computational tools(Wiley, 2004) Li, Yang; Chia, Jer, M; Bartfai, Richard; Christoffels, Alan; Yue, Gen, H; Ding, Ke; Ho, Mei, Y; Hill, James, A; Stupka, Elia; Orban, LaszloStudies on the zebrafish model have contributed to our understanding of several important developmental processes, especially those that can be easily studied in the embryo. However, knowledge on late events such as gonad differentiation in the zebrafish is still limited. Here an analysis on the gene sets is expressed in the adult zebrafish testis and ovary in an attempt to identify genes with potential role in (zebra)fish gonad development and function. We produced 10 533 expressed sequence tags (ESTs) from zebrafish testis or ovary and downloaded an additional 23 642 gonad-derived sequences from the zebrafish EST database. We clustered these sequences together with over 13 000 kidney-derived zebrafish ESTs to study partial transcriptomes for these three organs. We searched for genes with gonad-specific expression by screening macroarrays containing at least 2600 unique cDNA inserts with testis-, ovary- and kidney-derived cDNA probes. Clones hybridizing to only one of the two gonad probes were selected, and subsequently screened with computational tools to identify 72 genes with potentially testis-specific and 97 genes with potentially ovary-specific expression, respectively. PCR-amplification confirmed gonad-specificity for 21 of the 45 clones tested (all without known function). Our study, which involves over 47 000 EST sequences and specialized cDNA arrays, is the first analysis of adult organ transcriptomes of zebrafish at such a scale. The study of genes expressed in adult zebrafish testis and ovary will provide useful information on regulation of gene expression in teleost gonads and might also contribute to our understanding of the development and differentiation of reproductive organs in vertebrates.Item Opportunities in Africa for training in genome science(Academic Journals, 2004) Masiga, Daniel K.; Isokpehi, Raphael D.Genome science is a new type of biology that unites genetics, molecular biology, computational biology and bioinformatics. The availability of the human genome sequence, as well as the genome sequences of several other organisms relevant to health, agriculture and the environment in Africa necessitates the development and delivery of several types and levels of training that will enhance the use of genome data and the associated computational resources. A survey of initiatives that provide opportunities for training in genome science is presented. Current efforts to increase the ability of African scientists to computationally process and analyse genomic and post-genomic data have the potential to produce excellent scientists who perform cutting-edge, hypothesis-based research, and who will accelerate the continent's scientific and technological development.Item FRAGS: Estimation of coding sequence substitution rates from fragmentary data(BMC, 2004) Swart, Estienne C; Hide, Winston A; Seoighe, CathalRates of substitution in protein-coding sequences can provide important insights into evolutionary processes that are of biomedical and theoretical interest. Increased availability of coding sequence data has enabled researchers to estimate more accurately the coding sequence divergence of pairs of organisms. However the use of different data sources, alignment protocols and methods to estimate substitution rates leads to widely varying estimates of key parameters that define the coding sequence divergence of orthologous genes. Although complete genome sequence data are not available for all organisms, fragmentary sequence data can provide accurate estimates of substitution rates provided that an appropriate and consistent methodology is used and that differences in the estimates obtainable from different data sources are taken into account.Item Detection of positive selection resulting from Nevirapine treatment in longitudinal HIV-1 reverse transcriptase sequences(University of the Western Cape, 2006) Ketwaroo, Bibi Farahnaz K.; Hide, Winston; Seoighe, Cathal; Scheffle, Konrad; South African National Bioinformatics Institute (SANBI); Faculty of ScienceNevirapine (NVP) is a cheap anti-retroviral drug used in poor countries worldwide, administered to pregnant women at the onset of labour to inhibit HIV enzyme reverse transcriptase. Viruses which may get transmitted to newborns are deficient in this enzyme, and HIV-1 infection cannot be established, thereby preventing mother to child transmission (MTCT). In some cases, babies get infected and positive selection for viruses resistant to nevirapine may be inferred. Positive selection can be inferred from sequence data, when the rate of nonsynonymous substitutions is significantly greater than the rate of synonymous substitutions. Unfortunately, it is found that available positive selection methods should not be used to analyse before- and after- NVP treatment sequence pairs associated with MTCT. Methods which use phylogenetic trees to infer positive selection trace synonymous and nonsynonymous substitutions further back in time than the short time duration during which selection for NVP occurred. The other group of methods for inferring positive selection, the pairwise methods, do not have appreciable power, because they average susbtituion rates over all codons in a sequence pair and not just at single codons. We introduce a simple counting method which we call the Pairwise Homologous Codons (PHoCs) method with which we have inferred positive selection resulting from NVP treatment in longitudinal HIV-1 reverse transcriptase sequences. The PHoCs method estimates rates of substitutions between before- and after- NVP treatment codons, using a simple pairwise method.Item High performance computing and algorithm development: application of dataset development to algorithm parameterization(University of the Western Cape, 2006) Jonas, Mario Ricardo Edward; Hide, Winston A; South African National Bioinformatics Institute (SANBI); Faculty of ScienceA number of technologies exist that captures data from biological systems. In addition, several computational tools, which aim to organize the data resulting from these technologies, have been created. The ability of these tools to organize the information into biologically meaningful results, however, needs to be stringently tested. The research contained herein focuses on data produced by technology that records short Expressed Sequence Tags (EST's).Item Mice and men: Their promoter properties(PLoS Genetics, 2006) Bajic, Vladimir B.; Tan, Sin lam; Christoffels, Alan; Schonbach, Christian; Lipovich, Leonard; Yang, Liang; Hofmann, Oliver; Kruger, Adele; Hide, Winston; Kai, Chikatoshi; Kawai, Jun; Hume, David, A.; Carninci, Piero; Hayashizaki, YoshihideUsing the two largest collections of Mus musculus and Homo sapiens transcription start sites (TSSs) determined based on CAGE tags, ditags, full-length cDNAs, and other transcript data, we describe the compositional landscape surrounding TSSs with the aim of gaining better insight into the properties of mammalian promoters. We classified TSSs into four types based on compositional properties of regions immediately surrounding them. These properties highlighted distinctive features in the extended core promoters that helped us delineate boundaries of the transcription initiation domain space for both species. The TSS types were analyzed for associations with initiating dinucleotides, CpG islands, TATA boxes, and an extensive collection of statistically significant cis-elements in mouse and human. We found that different TSS types show preferences for different sets of initiating dinucleotides and ciselements. Through Gene Ontology and eVOC categories and tissue expression libraries we linked TSS characteristics to expression. Moreover, we show a link of TSS characteristics to very specific genomic organization in an example of immune-response-related genes (GO:0006955). Our results shed light on the global properties of the two transcriptomes not revealed before and therefore provide the framework for better understanding of the transcriptional mechanisms in the two species, as well as a framework for development of new and more efficient promoter- and gene-finding tools.Item miRNAMatcher: High throughput miRNA discovery using regular expressions obtained via a genetic algorithm(University of the Western Cape, 2008) Duvenage, Eugene; Bajic, Vladimir; Faculty of ScienceIn summary there currently exist techniques to discover miRNA however both require many calculations to be performed during the identification limiting their use at a genomic level. Machine learning techniques are currently providing the best results by combining a number of calculated and statistically derived features to identify miRNA candidates, however almost all of these still include computationally intensive secondary-structure calculations. It is the aim of this project to produce a miRNA identification process that minimises and simplifies the number of computational elements required during the identification process.Item Market segmentation and factors affecting stock returns on the JSE(2008) Chimanga, Artwell S.; Kotze, DanelleThis study examines the relationship between stock returns and market segmentation. Monthly returns of stocks listed on the JSE from 1997-2007 are analysed using mostly the analytic factor and cluster analysis techniques. Evidence supporting the use of multi-index models in explaining the return generating process on the JSE is found. The results provide additional support for Van Rensburg (1997)'s hypothesis on market segmentation on the JSE.Item Computational verification of published human mutations(University of the Western Cape, 2008) Kamanu, Frederick Kinyua; Lehväslaiho, Heikki; Bajic, Vladimir; Faculty of ScienceThe completion of the Human Genome Project, a remarkable feat by any measure, has provided over three billion bases of reference nucleotides for comparative studies. The next, and perhaps more challenging step is to analyse sequence variation and relate this information to important phenotypes. Most human sequence variations are characterized by structural complexity and, are hence, associated with abnormal functional dynamics. This thesis covers the assembly of a computational platform for verifying these variations, based on accurate, published, experimental data.Item DDESC: Dragon database for exploration of sodium channels in human(BMC Cancer, 2008) Sagar, Sunil; Kaur, Mandeep; Dawe, Adam; Seshadri, Sundararajan V.; Christoffels, Alan; Schaefer, Ulf; Radovanovic, Aleksander; Bajic, Vladimir B.Sodium channels are heteromultimeric, integral membrane proteins that belong to a superfamily of ion channels. The mutations in genes encoding for sodium channel proteins have been linked with several inherited genetic disorders such as febrile epilepsy, Brugada syndrome, ventricular fibrillation, long QT syndrome, or channelopathy associated insensitivity to pain. In spite of these significant effects that sodium channel proteins/genes could have on human health, there is no publicly available resource focused on sodium channels that would support exploration of the sodium channel related information. We report here Dragon Database for Exploration of Sodium Channels in Human (DDESC), which provides comprehensive information related to sodium channels regarding different entities, such as "genes and proteins", "metabolites and enzymes", "toxins", "chemicals with pharmacological effects", "disease concepts", "human anatomy", "pathways and pathway reactions" and their potential links. DDESC is compiled based on text- and data-mining. It allows users to explore potential associations between different entities related to sodium channels in human, as well as to automatically generate novel hypotheses. DDESC is first publicly available resource where the information related to sodium channels in human can be explored at different levels.Item Transcriptomic analysis reveal novel genes with sexually dimorphic expression in the zebrafish gonad and brain(Plosone, 2008) Sreenivasan, Rajini; Cai, Minnie; Bartfai, Richard; Wang, Xingang; Orban, Laszlo; Christoffels, AlanOur knowledge on zebrafish reproduction is very limited. We generated a gonad-derived cDNA microarray from zebrafish and used it to analyze large-scale gene expression profiles in adult gonads and other organs. We have identified 116638 gonad-derived zebrafish expressed sequence tags (ESTs), 21% of which were isolated in our lab. Following in silico normalization, we constructed a gonad-derived microarray comprising 6370 unique, full-length cDNAs from differentiating and adult gonads. Labeled targets from adult gonad, brain, kidney and ‘rest-of-body’ from both sexes were hybridized onto the microarray. Our analyses revealed 1366, 881 and 656 differentially expressed transcripts (34.7% novel) that showed highest expression in ovary, testis and both gonads respectively. Hierarchical clustering showed correlation of the two gonadal transcriptomes and their similarities to those of the brains. In addition, we have identified 276 genes showing sexually dimorphic expression both between the brains and between the gonads. By in situ hybridization, we showed that the gonadal transcripts with the strongest array signal intensities were germline-expressed. We found that five members of the GTP-binding septin gene family, from which only one member (septin 4) has previously been implicated in reproduction in mice, were all strongly expressed in the gonads. We have generated a gonad-derived zebrafish cDNA microarray and demonstrated its usefulness in identifying genes with sexually dimorphic co-expression in both the gonads and the brains. We have also provided the first evidence of large-scale differential gene expression between female and male brains of a teleost. Our microarray would be useful for studying gonad development, differentiation and function not only in zebrafish but also in related teleosts via cross-species hybridizations. Since several genes have been shown to play similar roles in gonadogenesis in zebrafish and other vertebrates, our array may even provide information on genetic disorders affecting gonadal phenotypes and fertility in mammals.Item Development and implementation of ontology-based systems for mammalian gene expression profiling(University of the Western Cape, 2009) Kruger, Adele; Hide, WinstonThe use of ontologies in the mapping of gene expression events provides an effective and comparable method to determine the expression profile of an entire genome across a large collection of experiments derived from different expression sources. In this dissertation I describe the development of the developmental human and mouse e voe ontologies and demonstrate the ontologies by identifying genes showing a bias for developmental brain expression in human and mouse, identifying transcription factor complexes, and exploring the mouse orthologs of human cancer/testis genes.Item An evolutionary genomics approach towards analysis of genes implicated in transmission of trypanosomes between tsetse fly and mammalian host(2009) Mwangi, Sarah Wambui; Christoffels, AlanHuman African trypanosomiasis is the world’s third most important parasitic disease affecting human health after malaria and schistosomiaisis. The world health organization estimates approximately 60 million people at risk in sub-Saharan Africa and up to 50,000 deaths per year caused by trypanosomiasis. Current management of human African trypanosomiasis relies on active surveillance and chemotherapy of infected patients. Efforts to develop a vaccine to immunize the human host have been hampered by antigenic variation of the parasites cell coat. The advent of the genome era has opened up opportunities for developing novel strategies for interrupting the transmission cycle of trypanosomes, specifically using any of the three players,the human host, the tsetse fly vector and/or the parasite. The human genome has been deciphered and the genomes of several trypanosome species have been sequenced. Sequencing of additional neglected trypanosome species is in progress. The tsetse fly genome is currently being sequenced as part of the genomic activities of the International Glossina genome initiative (IGGI). In an attempt to support the tsetse fly sequencing effort, expressed sequence tags (ESTs) from various tissues and developmental stages of Glossina morsitans have been generated.In this study, tsetse fly EST data was analyzed using bioinformatics approaches, focusing on transcripts encoding serpin genes implicated in the immune defenses of tsetse flies. Glossina morsitans homologues to Drosophila melanogaster serpin4, serpin5, and serpin27A and Anopheles gambiae serpin10 were identified in the tsetse fly EST contigs. Comparison of the reactive center loop of tsetse fly serpins with human α-1-antitrypsin suggests that these tsetse serpins are inhibitory. Preliminary EST clustering did not succeed in assembling 3564 Tsal encoded ESTs into one contig. In this study, these ESTs were assembled together with three published Tsal cDNAs. A total of 29 Tsal-encoded contigs were generated. An analysis of the sequence variation within the Tsal EST assembled contigs identified five single base mismatches namely A-T, T-A, G-T and T-G.Results from this study form a basis onto which genetic and biochemical experimental studies can be designed, a process that will be successfully carried out once we have a reference genome. Specifically, studies aimed at genetic modification of tsetse flies towards populations that are inhabitable to trypanosomes. Ultimately, this will supplement current vector control strategies towards elimination of human African trypanosomiasis.Item Concept Based Knowledge Discovery from Biomedical Literature(University of the Western Cape, 2009) Radovanovic, Aleksandar.; Bajic, Vladimir; Faculty of ScienceThis thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology, resented can be integrated with the researchers own knowledge, experimentation and observations for optimal progression of scientific research.