Philosophiae Doctor - PhD (Bioinformatics)
Permanent URI for this collection
Browse
Browsing by Issue Date
Now showing 1 - 20 of 40
Results Per Page
Sort Options
Item Generation of a human gene index and its application to disease candidacy(University of the Western Cape, 2001) Christoffels, Alan; Hide, Winston; Faculty of ScienceWith easy access to technology to generate expressed sequence tags (ESTs), several groups have sequenced from thousands to several thousands of ESTs. These ESTs benefit from consolidation and organization to deliver significant biological value. A number of EST projects are underway to extract maximum value from fragmented EST resources by constructing gene indices, where all transcripts are partitioned into index classes such that transcripts are put into the same index class if they represent the same gene. Therefore a gene index should ideally represent a non-redundant set of transcripts. Indeed, most gene indices aim to reconstruct the gene complement of a genome and their technological developments are directed at achieving this goal. The South African National Bioinformatics Institute (SANBI), on the other hand, embarked on the development of the sequence alignment and consensus knowledgebase (STACK) database that focused on the detection and visualisation of transcript variation in the context of developmental and pathological states, using all publicly available ESTs. Preliminary work on the STACK project employed an approach of partitioning the EST data into arbitrarily chosen tissue categories as a means of reducing the EST sequences to manageable sizes for subsequent processing. The tissue partitioning provided the template material for developing error-checking tools to analyse the information embedded in the error-laden EST sequences. However, tissue partitioning increases redundancy in the sequence data because one gene can be expressed in multiple tissues, with the result that multiple tissue partitioned transcripts will correspond to the same gene.Therefore, the sequence data represented by each tissue category had to be merged in order to obtain a comprehensive view of expressed transcript variation across all available tissues. The need to consolidate all EST information provided the impetus for developing a STACK human gene index, also referred to as a whole-body index. In this dissertation, I report on the development of a STACK human gene index represented by consensus transcripts where all constituent ESTs sample single or multiple tissues in order to provide the correct development and pathological context for investigating sequence variation. Furthermore, the availability of a human gene index is assessed as a diseasecandidate gene discovery resource. A feasible approach to construction of a whole-body index required the ability to process error-prone EST data in excess of one million sequences (1,198,607 ESTs as of December 1998). In the absence of new clustering algorithms, at that time, we successfully ported D2_CLUSTER, an EST clustering algorithm, to the high performance shared multiprocessor machine, Origin2000. Improvements to the parallelised version of D2_CLUSTER included: (i) ability to cluster sequences on as many as 126 processors. For example, 462000 ESTs were clustered in 31 hours on 126 R10000 MHz processors, Origin2000. (ii) enhanced memory management that allowed for clustering of mRNA sequences as long as 83000 base pairs. (iii) ability to have the input sequence data accessible to all processors, allowing rapid access to the sequences. (iv) a restart module that allowed a job to be restarted if it was interrupted. The successful enhancements to the parallelised version of D2_CLUSTER, as listed above, allowed for the processing of EST datasets in excess of 1 million sequences. An hierarchical approach was adopted where 1,198,607 million ESTs from GenBank release 110 (October 1998) were partitioned into "tissue bins" and each tissue bin was processed through a pipeline that included masking for contaminants, clustering, assembly, assembly analysis and consensus generation. A total of 478,707 consensus transcripts were generated for all the tissue categories and these sequences served as the input data for the generation of the wholebody index sequences. The clustering of all tissue-derived consensus transcripts was followed by the collapse of each consensus sequence to its individual ESTs prior to assembly and whole-body index consensus sequence generation. The hierarchical approach demonstrated a consolidation of the input EST data from 1,198607 ESTs to 69,158 multi-sequence clusters and 162,439 singletons (or individual ESTs). Chromosomal locations were added to 25,793 whole-body index sequences through assignment of genetic markers such as radiation hybrid markers and généthon markers. The whole-body index sequences were made available to the research community through a sequence-based search engine (http://ziggy.sanbi.ac.za/~alan/researchINDEX.html).Item Detection of positive selection resulting from Nevirapine treatment in longitudinal HIV-1 reverse transcriptase sequences(University of the Western Cape, 2006) Ketwaroo, Bibi Farahnaz K.; Hide, Winston; Seoighe, Cathal; Scheffle, Konrad; South African National Bioinformatics Institute (SANBI); Faculty of ScienceNevirapine (NVP) is a cheap anti-retroviral drug used in poor countries worldwide, administered to pregnant women at the onset of labour to inhibit HIV enzyme reverse transcriptase. Viruses which may get transmitted to newborns are deficient in this enzyme, and HIV-1 infection cannot be established, thereby preventing mother to child transmission (MTCT). In some cases, babies get infected and positive selection for viruses resistant to nevirapine may be inferred. Positive selection can be inferred from sequence data, when the rate of nonsynonymous substitutions is significantly greater than the rate of synonymous substitutions. Unfortunately, it is found that available positive selection methods should not be used to analyse before- and after- NVP treatment sequence pairs associated with MTCT. Methods which use phylogenetic trees to infer positive selection trace synonymous and nonsynonymous substitutions further back in time than the short time duration during which selection for NVP occurred. The other group of methods for inferring positive selection, the pairwise methods, do not have appreciable power, because they average susbtituion rates over all codons in a sequence pair and not just at single codons. We introduce a simple counting method which we call the Pairwise Homologous Codons (PHoCs) method with which we have inferred positive selection resulting from NVP treatment in longitudinal HIV-1 reverse transcriptase sequences. The PHoCs method estimates rates of substitutions between before- and after- NVP treatment codons, using a simple pairwise method.Item Concept Based Knowledge Discovery from Biomedical Literature(University of the Western Cape, 2009) Radovanovic, Aleksandar.; Bajic, Vladimir; Faculty of ScienceThis thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology, resented can be integrated with the researchers own knowledge, experimentation and observations for optimal progression of scientific research.Item Development and implementation of ontology-based systems for mammalian gene expression profiling(2009) Kruger, Adéle; Hide, WinstonThe use of ontologies in the mapping of gene expression events provides an effective and comparable method to determine the expression profile of an entire genome across a large collection of experiments derived from different expression sources. In this dissertation I describe the development of the developmental human and mouse eVOC ontologies and demonstrate the ontologies by identifying genes showing a bias for developmental brain expression in human and mouse, identifying transcription factor complexes, and exploring the mouse orthologs of human cancer/testis genes.Model organisms represent an important resource for understanding the fundamental aspects of mammalian biology. Mapping of biological phenomena between model organisms is complex and if it is to be meaningful, a simplified representation can be a powerful means for comparison. The implementation of the ontologies has been illustrated here in two ways.Firstly, the ontologies have been used to illustrate methods to determine clusters of genes showing tissue-restricted expression in humans. The identification of tissue restricted genes within an organism serves as an indication of the finetuning in the regulation of gene expression in a given tissue. Secondly, due to the differences in human and mouse gene expression on a temporal and spatial level, the ontologies were used to identify mouse orthologs of human cancer/testis genes showing cancer/testis characteristics. With the use of model systems such as mouse in the development of gene-targeted drugs in the treatment of disease, it is important to establish that the expression characteristics and profiles of a drug target in the model system is representative of the characteristics of the target in the system for which it is intended.Item Computational analyses on transcriptional regulation in mammals(University of the Western Cape, 2009) Schmeier, Sebastian; Bajic, VladimirThe genomes of various organisms have been sequenced and their transcriptome elucidated. With the information about genes and gene products readily available it has become of the utmost importance to decipher the underlying biological mechanisms that are involved in the transcriptional control of these genes. Transcription initiation is a fundamental process in living cells. It involves the interaction of transcription factors with DNA to regulate the transcription of a gene. Despite significant research during the last few decades into transcription factors and their role in gene regulation we are still far from understanding the complete transcriptional machinery that acts within biological systems. In this dissertation two computational approaches are presented to contribute to a better understanding of the transcriptional control of genes in mammals. The first addresses the transcriptional regulation of microRNA genes and its influence on the microRNA gene expression during monocytic differentiation. This is the first large-scale approach to decipher how microRNA genes are regulated by transcription factors during monocytic differentiation. The second approach relates to combinatorial gene regulation and the physical interaction of transcription factors. Here, a computational approach is used together with a novel form of numerical representation of transcription factors to predict their interactions. In this setup, the information necessary to predict the transcription factor interactions is kept at the lowest level to minimize the data acquisition overhead that often occurs in computational prediction tasks. Both approaches enhance our insights into transcriptional control and have an impact on the further study of gene regulation.Item "Development and implementation of ontology-based systems for mammalian gene expression profiling"(University of the Western Cape, 2009) Kruger, Adele; Hide, WinstonThe use of ontologies in the mapping of gene expression events provides an effective and comparable method to determine the expression profile of an entire genome across a large collection of experiments derived from different expression sources. In this dissertation I describe the development of the developmental human and mouse e VOC ontologies and demonstrate the ontologies by identifying genes showing a bias for developmental brain expression in human and mouse, identifying transcription factor complexes, and exploring the mouse orthologs of human cancer/testis genes. Model organisms represents fundamental aspects of mammal biology phenomena between model organism is complex and it is to be the meaningful, a simplified representation can be a powerful means for comparison illustrated here in two ways. Firstly, the ontologies have been used to illustrate methods to determine clusters of genes showing tissue-restricted expression in humans. The identification of tissue-restricted genes within an organism serves as an indication of the finetuning in the regulation of gene expression in a given tissue. Secondly, due to the differences in human and mouse gene expression on a temporal and spatial level, the ontologies were used to identify mouse orthologs of human cancer/testis genes showing cancer/testis characteristics. With the use of model systems such as mouse in the development of gene-targeted drugs in the treatment of disease, it is important to establish that the expression characteristics and profiles of a drug target in the model system is representative of the characteristics of the target in the system for which it is intended.Item Concept Based Knowledge Discovery From Biomedical Literature(University of the Western Cape, 2009) Radovanovic, Aleksandar; Bajic, VladimirAdvancement in biomedical research and continuous growth of scientific literature available in electronic form, calls for innovative methods and tools for information management, knowledge discovery, and data integration. Many biomedical fields such as genomics, proteomics, metabolomics, genetics, and emerging disciplines like systems biology and conceptual biology require synergy between experimental, computational, data mining and text mining technologies. A large amount of biomedical information available in various repositories, such as the US National Library of Medicine Bibliographic Database, emerge as a potential source of textual data for knowledge discovery. Text mining and its application of natural language processing and machine learning technologies to problems of knowledge discovery, is one of the most challenging fields in bioinformatics. This thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology presented can be integrated with the researchers' own knowledge, experimentation and observations for optimal progression of scientific research.Item Development of a hepatitis C virus knowledgebase with computational prediction of functional hypothesis of therapeutic relevance(2011) Samuel, Kojo Kwofie; Bajic, Vladimir; Christoffels, AlanTo ameliorate Hepatitis C Virus (HCV) therapeutic and diagnostic challenges requires robust intervention strategies, including approaches that leverage the plethora of rich data published in biomedical literature to gain greater understanding of HCV pathobiological mechanisms. The multitudes of metadata originating from HCV clinical trials as well as low and high-throughput experiments embedded in text corpora can be mined as data sources for the implementation of HCV-specific resources. HCV-customized resources may support the generation of worthy and testable hypothesis and reveal potential research clues to augment the pursuit of efficient diagnostic biomarkers and therapeutic targets. This research thesis report the development of two freely available HCV-specific web-based resources: (i) Dragon Exploratory System on Hepatitis C Virus (DESHCV) accessible via http://apps.sanbi.ac.za/DESHCV/ or http://cbrc.kaust.edu.sa/deshcv/ and(ii) Hepatitis C Virus Protein Interaction Database (HCVpro) accessible via http://apps.sanbi.ac.za/hcvpro/ or http://cbrc.kaust.edu.sa/hcvpro/.DESHCV is a text mining system implemented using named concept recognition and cooccurrence based approaches to computationally analyze about 32, 000 HCV related abstracts obtained from PubMed. As part of DESHCV development, the pre-constructed dictionaries of the Dragon Exploratory System (DES) were enriched with HCV biomedical concepts, including HCV proteins, name variants and symbols to enable HCV knowledge specific exploration. The DESHCV query inputs consist of user-defined keywords, phrases and concepts. DESHCV is therefore an information extraction tool that enables users to computationally generate association between concepts and support the prediction of potential hypothesis with diagnostic and therapeutic relevance.Additionally, users can retrieve a list of abstracts containing tagged concepts that can be used to overcome the herculean task of manual biocuration. DESHCV has been used to simulate previously reported thalidomide-chronic hepatitis C hypothesis and also to model a potentially novel thalidomide-amantadine hypothesis.HCVpro is a relational knowledgebase dedicated to housing experimentally detected HCV-HCV and HCV-human protein interaction information obtained from other databases and curated from biomedical journal articles. Additionally, the database contains consolidated biological information consisting of hepatocellular carcinoma(HCC) related genes, comprehensive reviews on HCV biology and drug development,functional genomics and molecular biology data, and cross-referenced links to canonical pathways and other essential biomedical databases. Users can retrieve enriched information including interaction metadata from HCVpro by using protein identifiers,gene chromosomal locations, experiment types used in detecting the interactions, PubMed IDs of journal articles reporting the interactions, annotated protein interaction IDs from external databases, and via “string searches”. The utility of HCVpro has been demonstrated by harnessing integrated data to suggest putative baseline clues that seem to support current diagnostic exploratory efforts directed towards vimentin. Furthermore,eight genes comprising of ACLY, AZGP1, DDX3X, FGG, H19, SIAH1, SERPING1 and THBS1 have been recommended for possible investigation to evaluate their diagnostic potential. The data archived in HCVpro can be utilized to support protein-protein interaction network-based candidate HCC gene prioritization for possible validation by experimental biologists.Item Development of a comprehensive annotation and curation framework for analysis of Glossina Morsitans Morsitans expresses sequence tags(University of the Western Cape, 2011) Wamalwa, Mark; Christoffels, Alan; South African National Bioinformatics Institute (SANBI); Faculty of ScienceThis study has successfully identified transcripts differentially expressed in the salivary gland and midgut and provides candidate genes that are critical to response to parasite invasion. Furthermore, an open-source Glossina resource (G-ESTMAP) was developed that provides interactive features and browsing of functional genomics data for researchers working in the field of Trypanosomiasis on the African continent.Item Prediction of antimicrobial peptides using hyperparameter optimized support vector machines(University of the Western Cape, 2011) Gabere, Musa Nur; Vladimir, Bajic; Christoffels, Alan; South African National Bioinformatics Institute (SANBI); Faculty of ScienceAntimicrobial peptides (AMPs) play a key role in the innate immune response. They can be ubiquitously found in a wide range of eukaryotes including mammals, amphibians, insects, plants, and protozoa. In lower organisms, AMPs function merely as antibiotics by permeabilizing cell membranes and lysing invading microbes. Prediction of antimicrobial peptides is important because experimental methods used in characterizing AMPs are costly, time consuming and resource intensive and identification of AMPs in insects can serve as a template for the design of novel antibiotic. In order to fulfil this, firstly, data on antimicrobial peptides is extracted from UniProt, manually curated and stored into a centralized database called dragon antimicrobial peptide database (DAMPD). Secondly, based on the curated data, models to predict antimicrobial peptides are created using support vector machine with optimized hyperparameters. In particular, global optimization methods such as grid search, pattern search and derivative-free methods are utilised to optimize the SVM hyperparameters. These models are useful in characterizing unknown antimicrobial peptides. Finally, a webserver is created that will be used to predict antimicrobial peptides in haemotophagous insects such as Glossina morsitan and Anopheles gambiae.Item Development of a Hepatitis C Virus knowledgebase with computational prediction of functional hypothesis of therapeutic relevance(University of the Western Cape, 2011) Kojo, Kwofie Samuel; Bajic, Vladimir; Christoffels, Alan; South African National Bioinformatics Institute (SANBI)To ameliorate Hepatitis C Virus (HCV) therapeutic and diagnostic challenges requires robust intervention strategies, including approaches that leverage the plethora of rich data published in biomedical literature to gain greater understanding of HCV pathobiological mechanisms. The multitudes of metadata originating from HCV clinical trials as well as low and high-throughput experiments embedded in text corpora can be mined as data sources for the implementation of HCV-specific resources. HCV-customized resources may support the generation of worthy and testable hypothesis and reveal potential research clues to augment the pursuit of efficient diagnostic biomarkers and therapeutic targets. This research thesis report the development of two freely available HCV-specific web-based resources: (i) Dragon Exploratory System on Hepatitis C Virus (DESHCV) accessible via http://apps.sanbi.ac.za/DESHCV/ or http://cbrc.kaust.edu.sa/deshcv/ and (ii) Hepatitis C Virus Protein Interaction Database (HCVpro) accessible via http://apps.sanbi.ac.za/hcvpro/ or http://cbrc.kaust.edu.sa/hcvpro/. DESHCV is a text mining system implemented using named concept recognition and cooccurrence based approaches to computationally analyze about 32, 000 HCV related abstracts obtained from PubMed. As part of DESHCV development, the pre-constructed dictionaries of the Dragon Exploratory System (DES) were enriched with HCV biomedical concepts, including HCV proteins, name variants and symbols to enable HCV knowledge specific exploration. The DESHCV query inputs consist of user-defined keywords, phrases and concepts. DESHCV is therefore an information extraction tool that enables users to computationally generate association between concepts and support the prediction of potential hypothesis with diagnostic and therapeutic relevance. Additionally, users can retrieve a list of abstracts containing tagged concepts that can be used to overcome the herculean task of manual biocuration. DESHCV has been used to simulate previously reported thalidomide-chronic hepatitis C hypothesis and also to model a potentially novel thalidomide-amantadine hypothesis. HCVpro is a relational knowledgebase dedicated to housing experimentally detected HCV-HCV and HCV-human protein interaction information obtained from other databases and curated from biomedical journal articles. Additionally, the database contains consolidated biological information consisting of hepatocellular carcinoma (HCC) related genes, comprehensive reviews on HCV biology and drug development, functional genomics and molecular biology data, and cross-referenced links to canonical pathways and other essential biomedical databases. Users can retrieve enriched information including interaction metadata from HCVpro by using protein identifiers, gene chromosomal locations, experiment types used in detecting the interactions, PubMed IDs of journal articles reporting the interactions, annotated protein interaction IDs from external databases, and via “string searches”. The utility of HCVpro has been demonstrated by harnessing integrated data to suggest putative baseline clues that seem to support current diagnostic exploratory efforts directed towards vimentin. Furthermore, eight genes comprising of ACLY, AZGP1, DDX3X, FGG, H19, SIAH1, SERPING1 and THBS1 have been recommended for possible investigation to evaluate their diagnostic potential. The data archived in HCVpro can be utilized to support protein-protein interaction network-based candidate HCC gene prioritization for possible validation by experimental biologists.Item Irradiation induced effects on 6h-SIC(University of the Western Cape, 2012) Sibuyi, Praise; Maaza, M.The framework agreement in the year 2000 by the international community to launch Generation IV program with 10 nations, to develop safe and reliable nuclear reactors gave rise to the increased interest in the studies of SiC and the effect of different irradiations on solids. Silicon carbide is a preferred candidate used in harsh environments due to its excellent properties such as high chemical stability and strong mechanical strength. The PBMR technology promises to be the safest of all nuclear technology that have been developed before. SiC has been considered one candidate material being used in the fabrication of pebble bed fuel cell. Its outstanding physical and chemical properties even at high temperatures render it a material of choice for the future nuclear industry as whole and PBMR in particular. Due to the hostile environment created during the normal reactor operation, some of these excellent properties are compromised. In order to use this material in such conditions, it should have at least a near perfect crystal lattice to prevent defects that could compromise its strength and performance. A proper knowledge of the behavior of radiation-induced defects in SiC is vital. During irradiation, a disordered crystal lattice occurs, resulting in the production of defects in the lattice. These defects lead to the degradation of these excellent properties of a particular material. This thesis investigates the effects of various radiation effects to 6H-SiC. We have investigated the effects of radiation induced damages to SiC, with a description of the beds and the importance of the stability of the SiC-C interface upon the effects of radiations (y-rays, hot neutrons). The irradiated samples of 6H-SiC have been studied with various spectroscopic and structural characterization methods. The surface sensitive techniques such as Raman spectroscopy, UV-Vis, Photoluminescence and Atomic Force Microscopy will be employed in several complimentary ways to probe the effect of irradiation on SiC. The obtained results are discussed in details.Item Modulation of soybean and maize antioxidant activities by Caffeic acid and nitric oxide under salt stress(University of the Western Cape, 2012) Klein, Ashwil Johan; Ludidi, Ndomelele Ndiko; Keyster, MarshallThis study explores the roles of exogenously applied nitric oxide, exogenously applied caffeic acid and salt stress on the antioxidant system in cereal (exemplified by maize) and legume (using soybean as an example) plants together with their influence on membrane integrity and cell death.This study investigates changes in H2O2 content, root lipid peroxidation, root cell death and antioxidant enzymatic activity in maize roots in response to exogenously applied nitric oxide (NO) and salt stress. This part of the study is based on the partially understood interaction between NO and reactive oxygen species (ROS) such as H2O2 and the role of antioxidant enzymes in plant salt stress responses. The results show that application of salt (NaCl) results in elevated levels of H2O2 and an increase in lipid peroxidation, consequently leading to increased cell death. The study also shows that by regulating the production and detoxification of ROS through modulation of antioxidant enzymatic activities, NO plays a pivotal role in maize responses to salt stress. The study argues for NO as a regulator of redox homeostasis that prevents excessive ROS accumulation during exposure of maize to salinity stress that would otherwise be deleterious to maize. This study extends the role of exogenously applied NO to improve salt stress tolerance in cereals crops (maize) further to its role in enhancing salt stress tolerance in legumes. The effect of long-term exposure of soybean to NO and salt stress on root nodule antioxidant activity was investigated to demonstrate the role of NO in salt stress tolerance. The results show that ROS scavenging antioxidative enzymes like SOD, GPX and GR are differentially regulated in response to exogenous application of NO and salt stress. It remains to be determined if the NOinduced changes in antioxidant enzyme activity under salt stress are sufficient to efficiently reduce ROS accumulation in soybean root nodules to levels close to those of unstressed soybean root nodules. Furthermore, this study investigates the effect of long-term exposure of soybean to exogenous caffeic acid (CA) and salt stress, on the basis of the established role of CA as an antioxidant and the involvement of antioxidant enzymes in plant salt stress responses. The effect of CA on soybean nodule number, biomass (determined on the basis of nodule dry weight, root dry weight and shoot dry weight), nodule NO content, and nodule cyclic guanosine monophosphate (cGMP) content in response to salt stress was investigated. Additionally, CA-induced changes in nodule ROS content, cell viability, lipid peroxidation and antioxidant enzyme activity as well as some genes that encode antioxidant enzymes were investigated in the presence or absence of salt stress. The study shows that long-term exposure of soybean to salt stress results in reduced biomass associated with accumulation of ROS, elevated levels of lipid peroxidation and elevated levels of cell death. However, exogenously applied CA reversed the negative effects of salt stress on soybean biomass, lipid peroxidation and cell death. CA reduced the salt stress-induced accumulation of ROS by mediating changes in root nodule antioxidant enzyme activity and gene expression. These CA-responsive antioxidant enzymes were found to be superoxide dismutase (SOD), ascorbate peroxidase (APX), glutathione peroxidase (GPX), and glutathione reductase (GR), which contributed to the scavenging of ROS in soybean nodules under salt stress. The work reported in Chapter 2 has been published in a peer-reviewed journal [Keyster M, Klein A, Ludidi N (2012) Caspase-like enzymatic activity and the ascorbate-glutathione cycle participate in salt stress tolerance of maize conferred by exogenously applied nitric oxide. Plant Signaling and Behavior 7: 349-360]. My contribution to the published paper was all the work that is presented in Chapter 2,whereas the rest of the work in the paper (which is not included in Chapter 2) was contributed by Dr Marshall Keyster.Item Computational characterization of IRE-regulated genes in Glossina morsitans(University of Western Cape, 2013) Dashti, Zahra Jalali Sefid; Christoffels, AlanBlood feeding is a habit exhibited by many insects. Considering the devastating impact of these insects on human health, it is important to focus research on understanding the biology behind blood-feeding, disease transmission and host-pathogen interactions. Such knowledge would pave the way for developing efficient preventative measures. Iron an important element for species survival, is at the center of events controlling tsetse’s fitness and reproductive success. Hence, targeting genes involved in iron trafficking and sequestration would present possible means of preventing disease transmission. Considering the dynamic and multi-factorial nature of iron metabolism, a well-coordinated regulatory system is expected to be at work. Despite extensive literature on the mechanism of iron regulation and key factors responsible in maintaining its homeostasis in human, less attention has been given to understand such system in insects, especially the blood-feeding insects. The availability of the genome sequences for several insect disease vectors allows for a more detailed analysis on the identification and characterization of events controlling and preventing iron-induced toxicity following a blood-meal. The International Glossina Genome Initiative (IGGI) has coordinated the sequencing and annotation of the Glossina morsitans genome that has led to the identification of 12220 genes. This knowledge-base along with current understanding of the IRE system in regulating iron metabolism, allowed for investigating the UTRs of Glossina genes for the presence of these elements. Using a combination of motif enrichment and IRE-stem loop structure prediction, an IRE-mediated regulation was inferred for 150 genes, among which, 72 were identified with 5’-IREs and 78 with 3’-IREs. Of the identified IRE-regulated genes, the ferritin heavy chain and MRCK-alpha are the only known genes to have IREs, while the rest are novel genes for which putative roles in regulating iron levels in tsetse fly have been assigned in this study. Moreover, the functional inference of the identified genes further points to the enrichment of transcription and translation. Furthermore, several hypothetical proteins with no defined functions were identified to be IRE-regulated. These include TMP007137, TMP009128, TMP002546, TMP002921, TMP003628, TMP004581, TMP008259, TMP012389, TMP005219, TMP005827, TMP007908, TMP009332, TMP01- 3384, TMP009102, TMP010544, TMP010707, TMP004292, TMP006517, TMP014030, TMP009821 and TMP003060 for which an iron-regulatory mechanism of action may be inferred. We further report 26 IRE-regulated secreted proteins in Glossina, that present good candidates for further investigation pertaining to the development of novel vector control strategies. Using the predicted data on the identified IRE-regulated genes and their functional classification, we derived at 29 genes with putative roles in iron trafficking, where several unknown and hypothetical proteins are included. Thus a novel role is inferred for these genes in cellular binding and transport in the context of iron metabolism. It is therefore possible that these genes may have evolved in Glossina, such that they compensate for the absence of an IRE- regulated mechanism for transferrin. Additionally, we propose 14 IRE-regulated genes involved in immune and stress response, which may indeed play crucial roles at the host pathogen interface through their possible mechanisms of iron sequestration. Using the subcellular localization analysis, we further categorized the putative IRE regulated genes into several subcellular localizations, where the majority of genes were found within the nucleus and the cytosol. The detection of the conserved motifs in a set of genes, is an interesting yet sophisticated area of research, that allows for identifying either co-regulated or orthologous genes, while further providing support for the putative function of a set of genes that would otherwise remain uncharacterized. This is based on the notion that co-regulated genes are often coexpressed to carry out a specific function. As such, 14 regulatory elements were identified in the 5’- and 3’-UTRs of IRE-regulated genes, involved in embryonic development and reproduction, inflammation and immune response, signaling pathways and neurogenesis as well as DNA repair. This study further proposes several IRE-regulated genes as targets for micro-RNA regulation through identifying micro-RNA binding sites in their 3’UTRs. Using a motif clustering approach we clustered IRE-regulated genes based on the number of motifs they share. Significantly co-regulated genes sharing two or more motifs were determined as critical targets for future investigation. The expression map of IRE-regulated genes was analyzed to better understand the events taking place from 3 hours to 15 days following a blood meal. Re-analysis of Anopheles microarray chip showed the significant expression of three cell envelope and transport genes as early response and six as late response to a blood meal, which could indeed be assigned a putative role in iron trafficking. Genes identified in this study with implications in iron metabolism, whose timely expression allows for maintaining iron homeostasis, represent good targets for future work. Considering the important role of evolution in species adaptation to habits such as Hematophagy, it is of importance to identify evolutionary signatures associated with these changes. To distinguish between evolutionary forces that are specific to iron-metabolism in blood-feeding insects and those that are found in other insects, the IRE-regulated genes were clustered into orthologous groups using several blood feeding and non-blood feeding insect species. Assessment of different evolutionary scenarios using the Maximum Likelihood (ML) approach, points to variations in the evolution of IRE-regulated genes between the two insect groups, whereby several genes indicate an increased mutation rate in the BF-insect group relative to their non-blood feeding insect counterparts. These include TMP003602 (phosphoinositide3-kinase), TMP009157 (ubiquitin-conjugating enzyme9), TMP010317 (general transcription factor IIH subunit1), TMP011104 (serine-pyruvate mitochondrial), TMP013137 (pentatricopeptide Transcription and translation), TMP013886 (tRNA(uridine-2-o-)-methyl-transferase-trm7) and TMP014187 (mediator 100kD). Additionally, we have indicated the presence of positively selected sites within seven blood-feeding IRE-regulated genes namely TMP002520 (nucleoporin), TMP008942 (eukaryotic translation initiation factor 3), TMP009871(bruno-3 transcript) , TMP010317 (general transcription factor IIH subunit1), TMP010673 (ferritin heavy-chain protein), TMP011104 (serine-pyruvate mitochondrial) and TMP011448 (brain chitinase and chia). Thus the results of this study provides an in depth understanding of iron metabolism in Glossina morsitans and confers important targets for future validations based on which innovative control strategies may be designed.Item Identification and characterization of microRNAs and their putative target genes in Anopheles funestus s.s(2013) Ali, Mushal Allam Mohamed Alhaj; Christoffels, AlanThe discovery of microRNAs (miRNAs) is one of the most exciting scientific breakthroughs in the last decade. miRNAs are short RNA molecules that do not encode proteins but instead, regulate gene expression. Over the past several years, thousands of miRNAs have been identified in various insect genomes through cloning and sequencing, and even by computational prediction. However, information concerning possible roles of miRNAs in mosquitoes is limited. Within this context, we report here the first systematic analysis of these tiny RNAs and their target mRNAs in one of the principal African malaria vectors, Anopheles funestus s.s. Firstly, to extend the known repertoire of miRNAs expressed in this insect, the small RNAs from the four developmental stages (egg, larvae, pupae and the adult females), were sequenced using next generation sequencing technology. A total of 98 miRNAs were identified, which included 65 known Anopheles miRNAs, 25 miRNAs conserved in other insects and 8 novel miRNAs that had not been reported in any species. We further characterized new variants for miR-2 and miR-927 and stem-loop precursors for miR-286 and miR-2944. The analysis showed that many miRNAs have stage-specific expression, and co-transcribed and co-regulated during development. Secondly, for a better understanding of the molecular details of the miRNAs function, we identified the target genes for the Anopheles miRNAs using a novel approach that identifies overlap genes among three target prediction tools followed by filtering genes based on functional enrichment of GO terms and KEGG pathways. We found that most of the miRNAs are metabolic regulators. Moreover, the results suggest implication of some miRNAs not only in the development but also in insect-parasite interaction. Finally, we developed the InsecTar database (http://insectar.sanbi.ac.za) for miRNA targets in the three mosquito species; Anopheles gambiae, Aedes aegypti, and Culex quinquefasciatus, which incorporates prediction and the functional analysis of these target genes. The proposed database will undoubtedly assist to explore the roles of these regulatory molecules in insects. This type of analysis is a key step towards improving our understanding of the complexity and regulationmode of miRNAs in mosquitoes. Moreover, this study opens the door for exploration of miRNA in regulation of critical physiological functions specific to vector arthropods which may lead to novel approaches to combat mosquito-borne infectious diseases.Item Computational characterisation of DNA methylomes in mycobacterium tuberculosis Beijing hyper- and hypo-virulent strains(University of the Western Cape, 2014) Naidu, Alecia Geraldine; Christoffels, Alan; Gey van Pittius, NicoMycobacterium tuberculosis, the causative agent of tuberculosis, is estimated to infect approximately one-third of the world’s population and is responsible for around 2 million deaths per year. The disease is endemic in South Africa which has one of the world’s highest tuberculosis incidence and death rates. The M. tuberculosis Beijing genotype are characterised by having an enhanced virulence capability over other M. tuberculosis strains and are the predominant strain observed in the Western Cape of South Africa. DNA methylation is a largely untapped area of research in M.tuberculosis and has been poorly described in the literature especially given its connection to virulence despite it being well characterised along with its role in virulence in other pathogenic bacteria such as E.coli. The overall aim was to characterise a global DNA methylation profile for two M. tuberculosis Beijing strains, hyper-virulent and hypo-virulent, using single molecule real time sequencing data technology. Moreover, to determine if adenine methylation in promoter regions has a possible functional role. This study identified and characterised the DNA methylation profile at the single nucleotide resolution in these strains using Pacific Biosciences single molecule real time sequencing data. A computational approach was used to discern DNA methylation patterns between the hyper and hypo-virulent strains with a view of understanding virulence in the hyper-virulent strain. Methylated motifs, which belong to known Restriction Modification (RM) systems of the H37Rv referencegenome were also identified. N6-methyladenine (m6A) and N4-methlycytosine (m4C) loci were identified in both strains. m6A were idenitified in both strains occuring within the following sequence motifs CACGCAG (Type II RM system), GATNNNNRTAC/GTAYNNNNATC (Type I RM system), while the CTGGAGGA motif was found to be uniquley methylated in the hyper-virulentstrain.Interestingly, the CACGCAG motif was significantly methylated (p = 9.9 x10 -63) at a higher proportion in intergenic regions (~70%) as opposed to genic regions in both the hyper-virulent and hypo-virulent strains suggesting a role in gene regulation. There appeared to be a higher proportion of m6A occuring in intergenic regions compared to within genes for hyper-virulent (61%) and hypo-virulent (62%) strains. The genic proportion revealed that 35% of total m6A occurred uniquely within genes for the hyper-virulent strain while 27.9% for uniquely methylated genes in hypo-virulent strain.Item Evolution of HIV-1 subtype C gp120 envelope sequences in the female genital tract and blood plasma during acute and chronic infection(University of the Western Cape, 2014) Ramdayal, Kavisha; Harkins, Gordon; Christoffels, AlanHeterosexual transmission of HIV-1 via the female genital tract is the leading route of HIV infection in sub-Saharan Africa. Viruses then traffic between the cervical compartment and blood ensuring pervasive infection. Previous studies have however reported the existence of genetically diverse viral populations in various tissue types, each evolving under separate selective pressures within a single individual, though it is still unclear how compartmentalization dynamics change over acute and chronic infection in the absence of ARVs. To better characterize intrahost evolution and the movement of viruses between different anatomical tissue types, statistical and phylogenetic methods were used to reconstruct temporal dynamics between blood plasma and cervico-vaginal lavage (CVL) derived HIV-1 subtype C gp120 envelope sequences. A total of 206 cervical and 253 blood plasma sequences obtained from four treatment naïve women enrolled in the CAPRISA Acute Infection study cohort in South Africa were evaluated for evidence of genotypic and phenotypic differences between viral populations from each tissue type up to 3.6 years post-infection. Evidence for tissue-specific differences in genetic diversity, V-loop length variation, codon-based selection, co-receptor usage, hypermutation, recombination and potential N-linked glycosylation (PNLG) site accumulation were investigated. Of the four participants studied, two anonymously identified as CAP270 and CAP217 showed evidence of infection with a single HIV-1 variant, whereas CAP177 and CAP261 showed evidence of infection by more than one variant. As a result, genetic diversity, PNLGs accumulation and the number of detectable recombination events along the gp120 env region were lowest in the former patients and highest in the latter. Overall, genetic diversity increased over the course of infection in all participants and correlated significantly with viral load measurements from the blood plasma in one of the four participants tested (i.e. CAP177). Employing a structured coalescent model approach, rates of viral migration between anatomical tissue types on time-measured genealogies were also estimated. No persistent evidence for the existence of separate viral populations in the cervix and blood plasma was found in any of the participants and instead, sequences generally clustered together by time point on Bayesian Maximum Clade Credibility (MCC) trees. Clades that were monophyletic by tissue type comprised mostly of low diversity or monotypic sequences from the same time point, consistent with bursts of viral replication. Tissue-specific monophyletic clades also generally contained few sequences and were interspersed among sequences from both tissue-types. Tree and distance-based statistical tests were employed to further evaluate the degree to which cervical and blood plasma viruses clustered together on Bayesian MCC trees using the Slatkin-Maddison (S-M), Simmonds Association index (AI), Monophyletic Clade (MC), Wright’s measure of population subdivision (FST) and Hudson’s Nearest Neighbour (Snn) statistics, in the presence and absence of monotypic and low diversity sequences. Statistical evidence for the presence of tissue-specific population structure disappeared or was greatly reduced after the removal of monotypic and low diversity sequences, except in CAP177 and CAP217, in 3/5 of longitudinal tree and distance-based tests. Analysis of phenotypic differences between viral populations from the blood plasma and cervix revealed inconsistent tissue-specific patterns in genetic diversity, codon-based selection, co-receptor usage, hypermutation, recombination, V-loop length variation and PNLG site accumulation during acute and chronic infection among all participants. There is therefore no evidence to support the existence of distinct viral populations within the blood plasma and cervical compartments longitudinally, however slightly constrained populations may exist within the female genital tract at isolated time points, based on the statistical findings presented in this study.Item Management and analysis of HIV -1 ultra-deep sequence data(University of the Western Cape, 2014) Shrestha, Ram Krishna; Travers, Simon AThe continued success of antiretroviral programmes in the treatment of HIV is dependent on access to a cost-effective HIV drug resistance test (HIV-DRT). HIVDRT involves sequencing a fragment of the HIV genome and characterising the presence/absence of mutations that confer resistance to one or more drugs. HIV-DRT using conventional DNA sequencing is prohibitively expensive (~US$150 per patient) for routine use in resource-limited settings such as many African countries. While the advent of ultra deep pyrosequencing (UDPS) approaches have considerably reduced (3-5 fold reduction) the cost of generating the sequence data, there has been an even more significant increase in the volume of data generated and the complexity involved in its analysis. In order to address this issue we have developed Seq2Res, a computational pipeline for HIV drug resistance test from UDPS genotypic data. We have developed QTrim, software that undertakes high throughput quality trimming of UDPS sequencing data to ensure that subsequently analyzed data is of high quality. The comparison of QTrim to other widely used tools showed that it is equivalent to the next best method at trimming good quality data but outperforms all methods at trimming poor quality data. Further, we have developed, and evaluated, a computational approach for the analysis of UDPS sequence data generated using the novel Primer ID that enables the generation of a consensus sequence from all sequence reads originating from the same viral template, thus reducing the presence of PCR and sequencing induced errors in the dataset as well as reducing. We see that while the Primer ID approach does undoubtedly reduce the prevalence of PCR and sequencing induced errors, it artificially reduces the diversity of the subsequently analysed data due to the large volume of data that is discarded as a result of there being an insufficient number of sequences for consensus sequence generation. We validated the sensitivity of the Seq2Res pipeline using two real biological datasets from the Stanford HIV Database and five simulated datasets The Seq2Res results correlated fully with that of the Stanford database as well as identifying a drug resistance mutations (DRM) that had been incorrectly interpreted by the Stanford approach. Further, the analysis of the simulated datasets showed that Seq2Res is capable of accurately identifying DRMs at all prevalence levels down to at least 1% of the sequence data generated from a viral population. Finally, we applied Seq2Res to UDPS resistance data generated from as many as 641 individuals as part of the CIPRA-SA study to evaluate the effectiveness of UDPS HIV drug resistance genotyping in resource limited settings with a high burden of HIV infections. We find that, despite the FLX coverage being almost three times as much as that of the Junior platform, resistance genotyping results are directly comparable between both of the approaches at a range of prevalence levels to as low as 1%. Further, we find no significant difference between UDPS sequencing and the "gold standard" Sanger based approach, thus indicating that pooling as many as 48 patient's data and sequencing using the Roche/454 Junior platform is a viable approach for HIV drug resistance genotyping. Further, we explored the presence of resistant minor variants in individual's viral populations and find that the identification of minor resistant variants in individuals exposed to nevirapine through PMTCT correlates with the time since exposure. We conclude that HIV resistance genotyping is now a viable prospect for resource limited setting with a high burden of HIV infections and that UDPS approaches are at least as sensitive as the currently used Sanger-based sequencing approaches. Further, the development of Seq2Res has provided a sensitive, easy to use and scalable technology that facilitates the routine use of UDPS for HIV drug resistance genotyping.Item Genome-wide annotation of chemosensory and glutamate-gated receptors, and related genes in Glossina morsitans morsitans tsetse fly(University of the Western Cape, 2014) Obiero, George Fredrick Opondo; Christoffels, Alan; Mireji, Paul O.; Masiga, DanielTsetse flies are the sole vectors of trypanosomes that cause nagana and sleeping sickness in animals and humans respectively in tropical Africa. Tsetse are unique: both sexes adults are exclusive blood-feeders, females are mated young and give birth to a single mature larva in sheltered habitats per pregnancy. Tsetse use chemoreception to detect and respond to chemical stimuli, helping them to locate hosts, mates, larviposition and resting sites. The detection is facilitated by chemoreceptors expressed on sensory neurons to cause specific responses. Specific molecular factors that mediate these responses are poorly understood in tsetse flies. This study aimed to identify and characterize genes that potentially mediate chemoreception in Glossina morsitans morsitans tsetse flies. These genes included sensory odorant (OR), gustatory (GR), ionotropic (IR), and related genes for odorant-binding (OBP), chemosensory (CSP) and sensory neuron membrane (SNMP) proteins. Synaptic transmission in higher brain sites may involve ionotropic glutamate-gated (iGluR) and metabotropic glutamate-gated (mGluR) receptors. The genes were annotated in G. m. morsitans genome scaffold assembly GMOY1.1 Yale strain using orthologs from D. melanogaster as query via TBLASTX algorithm at e-value below 1e-03. Positive blast hits were seeded as gene constructs in their respective scaffolds, and used as genomic reference onto which female fly-derived RNA sequence reads were mapped using CLC Genomics workbench suite. Seeded gene models were modified using RNA-Seq reads then viewed and re-edited using Artemis genome viewer tool. The genome was iteratively searched using the G. m. morsitans gene model sequences to recover additional similar hit sequences. The gene models were confirmed through comparisons against the NCBI conserved domains database (CDD) and non-redundant Swiss-Prot database. Trans-membrane domains and secretory peptides were predicted using TMHMM and SignalP tools respectively. Putative functions of the genes were confirmed via Blast2GO searches against gene ontology database. Evolutionary relationships amongst and between the genes were established using maximum likelihood estimates using best fitting amino acid model test in MEGA5 suite and PhyML tool. Expression profiles of genes were estimated using the RNA-seq data via CLCGenomics RNA-sequences analysis pipeline. Overall, 46 ORs, 14 GRs, and 19 IRs were identified, of which 21, 6 and 4 were manually identified for ORs, GRs, and IRs respectively. Additionally, 15 iGluRs, 6 mGluRs, 5 CSPs, 15 CD36-like, and 32 OBPs were identified. Six copies of OR genes (GmmOR41-46) were homologous to DmelOr67d, a single copy cis vacenyl acetate (cVA) receptor . Genes whose receptor homologs are associated with responses to CO2, GmmGR1-4, had higher expression profiles from amongst glossina GR genes. Known core-receptor homologs OR1, IR8a, IR25a and IR64a were conserved, and three species-specific divergent IRs (IR10a, IR56b and IR56d) were identified. Homologs of GluRIID, IR93a, and sweet taste receptors (Gr5a and Gr64a) were not identified in the genome. Homolog for LUSH protein, GmmOBP26, and sensory neuron membrane receptors SNMP1 and SNMP2 were conserved in the genome. Results indicate reduced repertoire of the chemosensory genes, and suggest reduced host range of the tsetse flies compared to other Diptera. Genes in multiple copies suggest their prioritization in chemoreception, which in turn may be tied to high specificity in host selection. Genes with high sequence conservation and expression profiles probably relate to their broad expression and utility within the fly nervous system. These results lay foundation for future comparative studies with other insects, provide opportunities for functional studies, and form the basis for re-examining new approaches for improving tsetse control tools and possible drug targets based on chemoreception.Item Computational genomics approaches for kidney diseases in Africa(University of the Western Cape, 2015) Mapiye, Darlington Shingirirai; Tiffin, Nicki; Gamieldien, JunaidEnd stage renal disease (ESRD), a more severe form of kidney disease, is considered to be a complex trait that may involve multiple processes which work together on a background of a significant genetic susceptibility. Black Africans have been shown to bear an unequal burden of this disease compared to white Europeans, Americans and Caucasians. Despite this, most of the genetic and epidemiological advances made in understanding the aetiology of kidney diseases have been done in other populations outside of sub-Saharan Africa (SSA). Very little research has been undertaken to investigate key genetic factors that drive ESRD in Africans compared to patients from rest of world populations. Therefore, the primary aim of this Bioinformatics thesis was twofold: firstly, to develop and apply a whole exome sequencing (WES) analysis pipeline and use it to understand a genetic mechanism underlying ESRD in a South African population of mixed ancestry. As I hypothesized that the pipeline would enable the discovery of highly penetrate rare variants with large effect size, which are expected to explain an important fraction of the genetic aetiology and pathogenesis of ESRD in these African patients. Secondly, the aim was to develop and set up a multicenter clinical database that would capture a plethora of clinical data for patients with Lupus, one of the risk factors of ESRD. From WES of six family members (five cases and one control); a total of 23 196 SNVs, 1445 insertions and 1340 deletions, overlapped amongst all affected family members. The variants were consistent with an autosomal dominant inheritance pattern inferred in this family. Of these, only 1550 SNVs, 67 insertions and 112 deletions were present in all affected family members but absent in the unaffected family member. Following detailed evaluation of evidence for variant implication and pathogenicity, only 3 very rare heterozygous missense variants in 3 genes COL4A1 [p.R476W], ICAM1 [p.P352L], COL16A1 [p.T116M] were considered potentially disease causing. Computational relatedness analysis revealed approximate amount of DNA shared by family members and confirmed reported relatedness. Genotyping for the Y chromosome was additionally performed to assist in sample identity. The clinical database has been designed and is being piloted at Groote Schuur medical Hospital at the University of Cape Town. Currently, about 290 patients have already been entered in the registry. The resources and methodologies developed in this thesis have the potential to contribute not only to the understanding of ESRD and its risk factors, but to the successful application of WES in clinical practice. Importantly, it contributes significant information on the genetics of ESRD based on an African family and will also improve scientific infrastructure on the African continent. Clinical databasing will go a long way to enable clinicians to collect and store standardised clinical data for their patients.