Philosophiae Doctor - PhD (Bioinformatics)
Permanent URI for this collection
Browse
Browsing by Title
Now showing 1 - 20 of 40
Results Per Page
Sort Options
Item Baobab LIMS: An open source biobank laboratory information management system for resource-limited settings(University of the Western Cape, 2019) Bendou, Hocine; Christoffels, AlanA laboratory information management system (LIMS) is central to the informatics infrastructure that underlies biobanking activities. To date, a wide range of commercial and open source LIMS are available. The decision to opt for one LIMS over another is often influenced by the needs of the biobank clients and researchers, as well as available financial resources. However, to find a LIMS that incorporates all possible requirements of a biobank may often be a complicated endeavour. The need to implement biobank standard operation procedures as well as stimulate the use of standards for biobank data representation motivated the development of Baobab LIMS, an open source LIMS for Biobanking. Baobab LIMS comprises modules for biospecimen kit assembly, shipping of biospecimen kits, storage management, analysis requests, reporting, and invoicing. Baobab LIMS is based on the Plone web-content management framework, a server-client-based system, whereby the end user is able to access the system securely through the internet on a standard web browser, thereby eliminating the need for standalone installations on all machines. The Baobab LIMS components were tested and evaluated in three human biobanks. The testing of the LIMS modules aided in the mapping of the biobanks requirements to the LIMS functionalities, and furthermore, it helped to reveal new user suggestions, such as the enhancement of the online documentation. The user suggestions are demonstrated to be important for both LIMS strengthen and biobank sustainability. Ultimately, the practical LIMS evaluations showed the ability of Boabab LIMS to be used in the management of human biobanks operations of relatively different biobanking workflows.Item Coding of tsetse repellents by olfactory sensory neurons: towards the improvement and the development of novel(University of the Western Cape, 2020) Souleymane, Diallo; Christoffels, AlanTsetse flies are the biological vectors of human and animal trypanosomiasis and hence representant medical and veterinary importance. The sense of smell plays a significant role in tsetse and its ecological interaction, such as finding blood meal source, resting, and larvicidal sites and for mating. Tsetse olfactory behaviour can be exploited for their management; however, olfactory studies in tsetse flies are still fragmentary. Here in my PhD thesis, using scanning electron microscopy, electrophysiology, behaviour, bioinformatics and molecular biology techniques, I have investigated tsetse flies (Glossina fuscipes fuscipes) olfaction using behaviourally well studied odorants, tsetse repellent by comparing with attractant odour. Insect olfaction is mediated by olfactory sensory neurons (OSNs), located in olfactory sensilla, which are cuticular structures exposed to the environment through pore and create a platform for chemical communication. In the sensilla shaft the dendrite of OSNs are housed, which are protected by called the sensillum lymph produced by support cells and contains a variety of olfactory proteins, including the odorant binding protein (OBP) and chemosensory proteins (CSP). While on the dendrite of OSNs are expressed olfactory receptors. In my PhD, studies I tried to decipher the sense of smell in tsetse fly. In the second chapter, I demonstrated that G. f. fuscipes is equipped with diverse olfactory sensilla, that various from basiconic, trichoid and coeloconic. I also demonstrated, there is shape, length, number difference between sensilla types and sexual dimorphism. There is a major difference between male and female, while male has the unique basiconic sensilla, club shaped found in the pits, which is absent from female pits. In my third chapter, I investigated the odorant receptors which are expressed on the dendrite of the olfactory sensory neurons (OSNs). G. f. fuscipes has 42 ORs, which were not functionally characterised. I used behaviourally well studied odorants, tsetse repellents, composed of four components blend. I demonstrated that tsetse repellent is also a strong antifeedant for both G. pallidipes and G. f. fuscipes using feeding bioassays as compared to the attractant odour, adding the value of tsetse repellent. However, the attractant odour enhanced the feeding index. Using DREAM (deorphanization of receptors based on expression alterations of mRNA levels). I found that in G. f. fuscipes, following a short in vivo exposure to the individual tsetse repellent component as well as an attractant volatile chemical, OSNs that respond to these compounds altered their mRNA expression in two opposite direction, significant downregulation and upregulation in their number of transcripts corresponding to the OR that they expressed and interacted with odorant. Also, I found that the odorants with opposite valence already segregate distinctly at the cellular and molecular target at the periphery, which is the reception of odorants by OSNs, which is the basis of sophisticated olfactory behaviour. Deorphanization of ORs in none model insect is a challenge, here by combining DREAM with molecular dynamics, as docking score, physiology and homology modelling with Drosophila a well-studied model insects, I was able to predict putative receptors of the tsetse repellent components and an attractant odour. However, many ORs were neutral, showing they were not activated by the odorants, demonstrating the selectivity of the technique as well as the receptors. In my fourth chapter, I investigated the OBPs structures and their interaction with odorants molecules. I demonstrated that OBPs are expressed both in the antenna, as well as in other tissues, such as legs. I also demonstrated that there are variations in the expression of OBPs between tissues as well as sexes. I also demonstrated that odorants induced a fast alteration in OBP mRNA expression, some odorants induced a decrease in the transcription of genes corresponding to the activated OBP and others increased the expression by many fold in OBPs in live insect, others were neutral after 5 hours of exposure. Moreover, with subsequent behavioural data showed that the behavioural response of G. f. fuscipes toward 1-octen-3-ol decreased significantly when 1-octen-3-ol putative OBPs were silenced with feeding of double-stranded RNA (dsRNA). In summary, our finding whereby odorant exposure affects the OBPs mRNA, their physiochemical properties and the silencing of these OBPs affected the behavioural response demonstrate that the OBPs are involved in odour detection that affect the percept of the given odorant. The expression of OBPs in olfactory tissues, antenna and their interaction with odorant and their effect on behavioural response when silenced shows their direct involvement in odour detection and reception. Furthermore, their expression in other tissues such as legs indicates they might also have role in other physiological functions, such as taste.Item Coding of tsetse repellents by olfactory sensory neurons: towards the improvement and the development of novel tsetse repellents(University of the Western Cape, 2020) Souleymane, Diallo; Christoffels, AlanTsetse flies are the biological vectors of human and animal trypanosomiasis and hence representant medical and veterinary importance. The sense of smell plays a significant role in tsetse and its ecological interaction, such as finding blood meal source, resting, and larvicidal sites and for mating. Tsetse olfactory behaviour can be exploited for their management; however, olfactory studies in tsetse flies are still fragmentary. Here in my PhD thesis, using scanning electron microscopy, electrophysiology, behaviour, bioinformatics and molecular biology techniques, I have investigated tsetse flies (Glossina fuscipes fuscipes) olfaction using behaviourally well studied odorants, tsetse repellent by comparing with attractant odour. Insect olfaction is mediated by olfactory sensory neurons (OSNs), located in olfactory sensilla, which are cuticular structures exposed to the environment through pore and create a platform for chemical communication. In the sensilla shaft the dendrite of OSNs are housed, which are protected by called the sensillum lymph produced by support cells and contains a variety of olfactory proteins, including the odorant binding protein (OBP) and chemosensory proteins (CSP). While on the dendrite of OSNs are expressed olfactory receptors. In my PhD, studies I tried to decipher the sense of smell in tsetse fly. In the second chapter, I demonstrated that G. f. fuscipes is equipped with diverse olfactory sensilla, that various from basiconic, trichoid and coeloconic. I also demonstrated, there is shape, length, number difference between sensilla types and sexual dimorphism. There is a major difference between male and female, while male has the unique basiconic sensilla, club shaped found in the pits, which is absent from female pits. In my third chapter, I investigated the odorant receptors which are expressed on the dendrite of the olfactory sensory neurons (OSNs). G. f. fuscipes has 42 ORs, which were not functionally characterised. I used behaviourally well studied odorants, tsetse repellents, composed of four components blend. I demonstrated that tsetse repellent is also a strong antifeedant for both G. pallidipes and G. f. fuscipes using feeding bioassays as compared to the attractant odour, adding the value of tsetse repellent. However, the attractant odour enhanced the feeding index. Using DREAM (deorphanization of receptors based on expression alterations of mRNA levels). I found that in G. f. fuscipes, following a short in vivo exposure to the individual tsetse repellent component as well as an attractant volatile chemical, OSNs that respond to these compounds altered their mRNA expression in two opposite direction, significant downregulation and upregulation in their number of transcripts corresponding to the OR that they expressed and interacted with odorant. Also, I found that the odorants with opposite valence already segregate distinctly at the cellular and molecular target at the periphery, which is the reception of odorants by OSNs, which is the basis of sophisticated olfactory behaviour. Deorphanization of ORs in none model insect is a challenge, here by combining DREAM with molecular dynamics, as docking score, physiology and homology modelling with Drosophila a well-studied model insects, I was able to predict putative receptors of the tsetse repellent components and an attractant odour. However, many ORs were neutral, showing they were not activated by the odorants, demonstrating the selectivity of the technique as well as the receptors. In my fourth chapter, I investigated the OBPs structures and their interaction with odorants molecules. I demonstrated that OBPs are expressed both in the antenna, as well as in other tissues, such as legs. I also demonstrated that there are variations in the expression of OBPs between tissues as well as sexes. I also demonstrated that odorants induced a fast alteration in OBP mRNA expression, some odorants induced a decrease in the transcription of genes corresponding to the activated OBP and others increased the expression by many fold in OBPs in live insect, others were neutral after 5 hours of exposure. Moreover, with subsequent behavioural data showed that the behavioural response of G. f. fuscipes toward 1-octen-3-ol decreased significantly when 1-octen-3-ol putative OBPs were silenced with feeding of double-stranded RNA (dsRNA). In summary, our finding whereby odorant exposure affects the OBPs mRNA, their physiochemical properties and the silencing of these OBPs affected the behavioural response demonstrate that the OBPs are involved in odour detection that affect the percept of the given odorant. The expression of OBPs in olfactory tissues, antenna and their interaction with odorant and their effect on behavioural response when silenced shows their direct involvement in odour detection and reception. Furthermore, their expression in other tissues such as legs indicates they might also have role in other physiological functions, such as taste.Item Coding of tsetse repellents by olfactory sensory neurons: towards the improvement and the development of novel tsetse repellents(University of Western Cape, 2021) Souleymane, Diallo; Christoffels, AlanTsetse flies are the biological vectors of human and animal trypanosomiasis and hence representant medical and veterinary importance. The sense of smell plays a significant role in tsetse and its ecological interaction, such as finding blood meal source, resting, and larvicidal sites and for mating. Tsetse olfactory behaviour can be exploited for their management; however, olfactory studies in tsetse flies are still fragmentary. Here in my PhD thesis, using scanning electron microscopy, electrophysiology, behaviour, bioinformatics and molecular biology techniques, I have investigated tsetse flies (Glossina fuscipes fuscipes) olfaction using behaviourally well studied odorants, tsetse repellent by comparing with attractant odour. Insect olfaction is mediated by olfactory sensory neurons (OSNs), located in olfactory sensilla, which are cuticular structures exposed to the environment through pore and create a platform for chemical communication.Item Computational analyses on transcriptional regulation in mammals(University of the Western Cape, 2009) Schmeier, Sebastian; Bajic, VladimirThe genomes of various organisms have been sequenced and their transcriptome elucidated. With the information about genes and gene products readily available it has become of the utmost importance to decipher the underlying biological mechanisms that are involved in the transcriptional control of these genes. Transcription initiation is a fundamental process in living cells. It involves the interaction of transcription factors with DNA to regulate the transcription of a gene. Despite significant research during the last few decades into transcription factors and their role in gene regulation we are still far from understanding the complete transcriptional machinery that acts within biological systems. In this dissertation two computational approaches are presented to contribute to a better understanding of the transcriptional control of genes in mammals. The first addresses the transcriptional regulation of microRNA genes and its influence on the microRNA gene expression during monocytic differentiation. This is the first large-scale approach to decipher how microRNA genes are regulated by transcription factors during monocytic differentiation. The second approach relates to combinatorial gene regulation and the physical interaction of transcription factors. Here, a computational approach is used together with a novel form of numerical representation of transcription factors to predict their interactions. In this setup, the information necessary to predict the transcription factor interactions is kept at the lowest level to minimize the data acquisition overhead that often occurs in computational prediction tasks. Both approaches enhance our insights into transcriptional control and have an impact on the further study of gene regulation.Item Computational analysis of multi-omic data for the elucidation of molecular mechanisms of neuroblastoma(University of Western Cape, 2021) Giwa, Abdulazeez; Bendou, HocineNeuroblastoma is the most common extracranial solid tumor in childhood. The survival rates of patients with neuroblastoma, especially those in the high-risk category, are still low despite varied therapies. The detailed understanding of the molecular mechanisms underlying the pathogenesis of neuroblastoma is essential to develop better therapeutics and improve the poor survival rates. This study provides a multi-omic analysis of neuroblastoma datasets from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) neuroblastoma project and the Gene Expression Omnibus (GEO) data portals to better understand the molecular mechanisms of neuroblastoma.Item Computational analysis of multilevel omics data for the elucidation of molecular mechanisms of cancer(University of the Western Cape, 2015) Fatai, Azeez Ayomide; Gamieldien, JunaidCancer is a group of diseases that arises from irreversible genomic and epigenomic alterations that result in unrestrained proliferation of abnormal cells. Detailed understanding of the molecular mechanisms underlying a cancer would aid the identification of most, if not all, genes responsible for its progression and the development of molecularly targeted chemotherapy. The challenge of recurrence after treatment shows that our understanding of cancer mechanisms is still poor. As a contribution to overcoming this challenge, we provide an integrative multi-omic analysis on glioblastoma multiforme (GBM) for which large data sets on di erent classes of genomic and epigenomic alterations have been made available in the Cancer Genome Atlas data portal. The rst part of this study involves protein network analysis for the elucidation of GBM tumourigenic molecular mechanisms, identification of driver genes, prioritization of genes in chromosomal regions with copy number alteration, and co-expression and transcriptional analysis. Functional modules were obtained by edge-betweenness clustering of a protein network constructed from genes with predicted functional impact mutations and differentially expressed genes. Pathway enrichment analysis was performed on each module to identify statistical overrepresentation of signaling pathways. Known and novel candidate cancer driver genes were identi ed in the modules, and functionally relevant genes in chromosomal regions altered by homologous deletion or high-level amplication were prioritized with the protein network. Co-expressed modules enriched in cancer biological processes and transcription factor targets were identified using network genes that demonstrated high expression variance. Our findings show that GBM's molecular mechanisms are much more complex than those reported in previous studies. We next identified differentially expressed miRNAs for which target genes associated with the protein network were also differentially expressed. MiRNAs and target genes were prioritized based on the number of targeted genes and targeting miRNAs, respectively. MiRNAs that correlated with time to progression were selected by an elastic net-penalized Cox regression model for survival analysis. These miRNA were combined into a signature that independently predicted adjuvant therapy-linked progression-free survival in GBM and its subtypes and overall survival in GBM. The results show that miRNAs play significant roles in GBM progression and patients' survival finally, a prognostic mRNA signature that independently predicted progression-free and overall survival was identified. Pathway enrichment analysis was carried on genes with high expression variance across a cohort to identify those in chemoradioresistance associated pathways. A support vector machine-based method was then used to identify a set of genes that discriminated between rapidly- and slowly-progressing GBM patients, with minimal 5 % cross-validation error rate. The prognostic value of the gene set was demonstrated by its ability to predict adjuvant therapy-linked progression-free and overall survival in GBM and its subtypes and was validated in an independent data set. We have identified a set of genes involved in tumourigenic mechanisms that could potentially be exploited as targets in drug development for the treatment of primary and recurrent GBM. Furthermore, given their demonstrated accuracy in this study, the identified miRNA and mRNA signatures have strong potential to be combined and developed into a robust clinical test for predicting prognosis and treatment response.Item Computational characterisation of DNA methylomes in mycobacterium tuberculosis Beijing hyper- and hypo-virulent strains(University of the Western Cape, 2014) Naidu, Alecia Geraldine; Christoffels, Alan; Gey van Pittius, NicoMycobacterium tuberculosis, the causative agent of tuberculosis, is estimated to infect approximately one-third of the world’s population and is responsible for around 2 million deaths per year. The disease is endemic in South Africa which has one of the world’s highest tuberculosis incidence and death rates. The M. tuberculosis Beijing genotype are characterised by having an enhanced virulence capability over other M. tuberculosis strains and are the predominant strain observed in the Western Cape of South Africa. DNA methylation is a largely untapped area of research in M.tuberculosis and has been poorly described in the literature especially given its connection to virulence despite it being well characterised along with its role in virulence in other pathogenic bacteria such as E.coli. The overall aim was to characterise a global DNA methylation profile for two M. tuberculosis Beijing strains, hyper-virulent and hypo-virulent, using single molecule real time sequencing data technology. Moreover, to determine if adenine methylation in promoter regions has a possible functional role. This study identified and characterised the DNA methylation profile at the single nucleotide resolution in these strains using Pacific Biosciences single molecule real time sequencing data. A computational approach was used to discern DNA methylation patterns between the hyper and hypo-virulent strains with a view of understanding virulence in the hyper-virulent strain. Methylated motifs, which belong to known Restriction Modification (RM) systems of the H37Rv referencegenome were also identified. N6-methyladenine (m6A) and N4-methlycytosine (m4C) loci were identified in both strains. m6A were idenitified in both strains occuring within the following sequence motifs CACGCAG (Type II RM system), GATNNNNRTAC/GTAYNNNNATC (Type I RM system), while the CTGGAGGA motif was found to be uniquley methylated in the hyper-virulentstrain.Interestingly, the CACGCAG motif was significantly methylated (p = 9.9 x10 -63) at a higher proportion in intergenic regions (~70%) as opposed to genic regions in both the hyper-virulent and hypo-virulent strains suggesting a role in gene regulation. There appeared to be a higher proportion of m6A occuring in intergenic regions compared to within genes for hyper-virulent (61%) and hypo-virulent (62%) strains. The genic proportion revealed that 35% of total m6A occurred uniquely within genes for the hyper-virulent strain while 27.9% for uniquely methylated genes in hypo-virulent strain.Item Computational characterization of IRE-regulated genes in Glossina morsitans(University of Western Cape, 2013) Dashti, Zahra Jalali Sefid; Christoffels, AlanBlood feeding is a habit exhibited by many insects. Considering the devastating impact of these insects on human health, it is important to focus research on understanding the biology behind blood-feeding, disease transmission and host-pathogen interactions. Such knowledge would pave the way for developing efficient preventative measures. Iron an important element for species survival, is at the center of events controlling tsetse’s fitness and reproductive success. Hence, targeting genes involved in iron trafficking and sequestration would present possible means of preventing disease transmission. Considering the dynamic and multi-factorial nature of iron metabolism, a well-coordinated regulatory system is expected to be at work. Despite extensive literature on the mechanism of iron regulation and key factors responsible in maintaining its homeostasis in human, less attention has been given to understand such system in insects, especially the blood-feeding insects. The availability of the genome sequences for several insect disease vectors allows for a more detailed analysis on the identification and characterization of events controlling and preventing iron-induced toxicity following a blood-meal. The International Glossina Genome Initiative (IGGI) has coordinated the sequencing and annotation of the Glossina morsitans genome that has led to the identification of 12220 genes. This knowledge-base along with current understanding of the IRE system in regulating iron metabolism, allowed for investigating the UTRs of Glossina genes for the presence of these elements. Using a combination of motif enrichment and IRE-stem loop structure prediction, an IRE-mediated regulation was inferred for 150 genes, among which, 72 were identified with 5’-IREs and 78 with 3’-IREs. Of the identified IRE-regulated genes, the ferritin heavy chain and MRCK-alpha are the only known genes to have IREs, while the rest are novel genes for which putative roles in regulating iron levels in tsetse fly have been assigned in this study. Moreover, the functional inference of the identified genes further points to the enrichment of transcription and translation. Furthermore, several hypothetical proteins with no defined functions were identified to be IRE-regulated. These include TMP007137, TMP009128, TMP002546, TMP002921, TMP003628, TMP004581, TMP008259, TMP012389, TMP005219, TMP005827, TMP007908, TMP009332, TMP01- 3384, TMP009102, TMP010544, TMP010707, TMP004292, TMP006517, TMP014030, TMP009821 and TMP003060 for which an iron-regulatory mechanism of action may be inferred. We further report 26 IRE-regulated secreted proteins in Glossina, that present good candidates for further investigation pertaining to the development of novel vector control strategies. Using the predicted data on the identified IRE-regulated genes and their functional classification, we derived at 29 genes with putative roles in iron trafficking, where several unknown and hypothetical proteins are included. Thus a novel role is inferred for these genes in cellular binding and transport in the context of iron metabolism. It is therefore possible that these genes may have evolved in Glossina, such that they compensate for the absence of an IRE- regulated mechanism for transferrin. Additionally, we propose 14 IRE-regulated genes involved in immune and stress response, which may indeed play crucial roles at the host pathogen interface through their possible mechanisms of iron sequestration. Using the subcellular localization analysis, we further categorized the putative IRE regulated genes into several subcellular localizations, where the majority of genes were found within the nucleus and the cytosol. The detection of the conserved motifs in a set of genes, is an interesting yet sophisticated area of research, that allows for identifying either co-regulated or orthologous genes, while further providing support for the putative function of a set of genes that would otherwise remain uncharacterized. This is based on the notion that co-regulated genes are often coexpressed to carry out a specific function. As such, 14 regulatory elements were identified in the 5’- and 3’-UTRs of IRE-regulated genes, involved in embryonic development and reproduction, inflammation and immune response, signaling pathways and neurogenesis as well as DNA repair. This study further proposes several IRE-regulated genes as targets for micro-RNA regulation through identifying micro-RNA binding sites in their 3’UTRs. Using a motif clustering approach we clustered IRE-regulated genes based on the number of motifs they share. Significantly co-regulated genes sharing two or more motifs were determined as critical targets for future investigation. The expression map of IRE-regulated genes was analyzed to better understand the events taking place from 3 hours to 15 days following a blood meal. Re-analysis of Anopheles microarray chip showed the significant expression of three cell envelope and transport genes as early response and six as late response to a blood meal, which could indeed be assigned a putative role in iron trafficking. Genes identified in this study with implications in iron metabolism, whose timely expression allows for maintaining iron homeostasis, represent good targets for future work. Considering the important role of evolution in species adaptation to habits such as Hematophagy, it is of importance to identify evolutionary signatures associated with these changes. To distinguish between evolutionary forces that are specific to iron-metabolism in blood-feeding insects and those that are found in other insects, the IRE-regulated genes were clustered into orthologous groups using several blood feeding and non-blood feeding insect species. Assessment of different evolutionary scenarios using the Maximum Likelihood (ML) approach, points to variations in the evolution of IRE-regulated genes between the two insect groups, whereby several genes indicate an increased mutation rate in the BF-insect group relative to their non-blood feeding insect counterparts. These include TMP003602 (phosphoinositide3-kinase), TMP009157 (ubiquitin-conjugating enzyme9), TMP010317 (general transcription factor IIH subunit1), TMP011104 (serine-pyruvate mitochondrial), TMP013137 (pentatricopeptide Transcription and translation), TMP013886 (tRNA(uridine-2-o-)-methyl-transferase-trm7) and TMP014187 (mediator 100kD). Additionally, we have indicated the presence of positively selected sites within seven blood-feeding IRE-regulated genes namely TMP002520 (nucleoporin), TMP008942 (eukaryotic translation initiation factor 3), TMP009871(bruno-3 transcript) , TMP010317 (general transcription factor IIH subunit1), TMP010673 (ferritin heavy-chain protein), TMP011104 (serine-pyruvate mitochondrial) and TMP011448 (brain chitinase and chia). Thus the results of this study provides an in depth understanding of iron metabolism in Glossina morsitans and confers important targets for future validations based on which innovative control strategies may be designed.Item Computational genomics approaches for kidney diseases in Africa(University of the Western Cape, 2015) Mapiye, Darlington Shingirirai; Tiffin, Nicki; Gamieldien, JunaidEnd stage renal disease (ESRD), a more severe form of kidney disease, is considered to be a complex trait that may involve multiple processes which work together on a background of a significant genetic susceptibility. Black Africans have been shown to bear an unequal burden of this disease compared to white Europeans, Americans and Caucasians. Despite this, most of the genetic and epidemiological advances made in understanding the aetiology of kidney diseases have been done in other populations outside of sub-Saharan Africa (SSA). Very little research has been undertaken to investigate key genetic factors that drive ESRD in Africans compared to patients from rest of world populations. Therefore, the primary aim of this Bioinformatics thesis was twofold: firstly, to develop and apply a whole exome sequencing (WES) analysis pipeline and use it to understand a genetic mechanism underlying ESRD in a South African population of mixed ancestry. As I hypothesized that the pipeline would enable the discovery of highly penetrate rare variants with large effect size, which are expected to explain an important fraction of the genetic aetiology and pathogenesis of ESRD in these African patients. Secondly, the aim was to develop and set up a multicenter clinical database that would capture a plethora of clinical data for patients with Lupus, one of the risk factors of ESRD. From WES of six family members (five cases and one control); a total of 23 196 SNVs, 1445 insertions and 1340 deletions, overlapped amongst all affected family members. The variants were consistent with an autosomal dominant inheritance pattern inferred in this family. Of these, only 1550 SNVs, 67 insertions and 112 deletions were present in all affected family members but absent in the unaffected family member. Following detailed evaluation of evidence for variant implication and pathogenicity, only 3 very rare heterozygous missense variants in 3 genes COL4A1 [p.R476W], ICAM1 [p.P352L], COL16A1 [p.T116M] were considered potentially disease causing. Computational relatedness analysis revealed approximate amount of DNA shared by family members and confirmed reported relatedness. Genotyping for the Y chromosome was additionally performed to assist in sample identity. The clinical database has been designed and is being piloted at Groote Schuur medical Hospital at the University of Cape Town. Currently, about 290 patients have already been entered in the registry. The resources and methodologies developed in this thesis have the potential to contribute not only to the understanding of ESRD and its risk factors, but to the successful application of WES in clinical practice. Importantly, it contributes significant information on the genetics of ESRD based on an African family and will also improve scientific infrastructure on the African continent. Clinical databasing will go a long way to enable clinicians to collect and store standardised clinical data for their patients.Item Computational prediction of host-pathogen protein-protein interactions(University of the Western Cape, 2017) Ahmed, Ibrahim H.I.; Christo els, Alan; Witbooi, PeterSupervised machine learning approaches have been applied successfully to the prediction of protein-protein interactions (PPIs) within a single organism, i.e., intra-species predictions. However, because of the absence of large amounts of experimentally validated PPIs data for training and testing, fewer studies have successfully applied these techniques to host-pathogen PPI, i.e., inter-species comparisons. Among the host-pathogen studies, most of them have focused on human-virus interactions and specifically human-HIV PPI data. Additional improvements to machine learning techniques and feature sets are important to improve the classification accuracy for host-pathogen protein-protein interactions prediction. The primary aim of this bioinformatics thesis was to develop a binary classifier with an appropriate feature set for host-pathogen protein-protein interaction prediction using published human-Hepatitis C virus PPI, and to test the model on available host-pathogen data for human-Bacillus anthracis PPI. Twelve different feature sets were compared to find the optimal set. The feature selection process reveals that our novel quadruple feature (a subsequence of four consecutive amino acid) combined with sequence similarity and human interactome network properties (such as degree, cluster coefficient, and betweenness centrality) were the best set. The optimal feature set outperformed those in the relevant published material, giving 95.9% sensitivity, 91.6% specificity and 89.0% accuracy. Using our optimal features set, we developed a neural network model to predict PPI between human-Mycobacterium tuberculosis. The strategy is to develop a model trained with intra-species PPI data and extend it to inter-species prediction. However, the lack of experimentally validated PPI data between human-Mycobacterium tuberculosis (Mtuberculosis), leads us to first assess the feasibility of using validated intra-species PPI data to build a model for inter-species PPI. In this model we used human intra-species PPI combined with Bacillus anthracis intra-species data to develop a binary classification model and extend the model for human-Bacillus anthracis inter-species prediction. Thus, we test our hypotheses on known human-Bacillus anthracis PPI data and the result shows good performance with 89.0% as average accuracy. The same approach was extended to the prediction of PPI between human-Mycobacterium tuberculosis. The predicted human-M-tuberculosis PPI data were further validated using functional enrichment of experimentally verified secretory proteins in M-tuberculosis, cellular compartment analysis and pathway enrichment analysis. Results show that five of the M-tuberculosis secretory proteins within an infected host macrophage that correspond to the mycobacterial virulent strain H37Rv were extracted from the human-M- tuberculosis PPI dataset predicted by our model. Finally, a web server was created to predict PPIs between human and Mycobacterium tuberculosis which is available online at URL:http://hppredict.sanbi.ac.za. In summary, the concepts, techniques and technologies developed as part of this thesis have the potential to contribute not only to the understanding PPI analysis between human and Mycobacterium tuberculosis, but can be extended to other pathogens. Further materials related to this study are available at ftp://ftp.sanbi.ac.za/machine learning.Item Computational strategies to identify, prioritize and design potential antimalarial agents from natural products(University of the Western Cape, 2015) Egieyeh, Samuel Ayodele; Christoffels, Alan; Malan, Sarel; Syce, JamesIntroduction: There is an exigent need to develop novel antimalarial drugs in view of the mounting disease burden and emergent resistance to the presently used drugs against the malarial parasites. A large amount of natural products, especially those used in ethnomedicine for malaria, have shown varying in-vitro antiplasmodial activities. Facilitating antimalarial drug development from this wealth of natural products is an imperative and laudable mission to pursue. However, the limited resources, high cost, low prospect and the high cost of failure during preclinical and clinical studies might militate against pursue of this mission. Chemoinformatics techniques can simulate and predict essential molecular properties required to characterize compounds thus eliminating the cost of equipment and reagents to conduct essential preclinical studies, especially on compounds that may fail during drug development. Therefore, applying chemoinformatics techniques on natural products with in-vitro antiplasmodial activities may facilitate identification and prioritization of these natural products with potential for novel mechanism of action, desirable pharmacokinetics and high likelihood for development into antimalarial drugs. In addition, unique structural features mined from these natural products may be templates to design new potential antimalarial compounds. Method: Four chemoinformatics techniques were applied on a collection of selected natural products with in-vitro antiplasmodial activity (NAA) and currently registered antimalarial drugs (CRAD): molecular property profiling, molecular scaffold analysis, machine learning and design of a virtual compound library. Molecular property profiling included computation of key molecular descriptors, physicochemical properties, molecular similarity analysis, estimation of drug-likeness, in-silico pharmacokinetic profiling and exploration of structure-activity landscape. Analysis of variance was used to assess statistical significant differences in these parameters between NAA and CRAD. Next, molecular scaffold exploration and diversity analyses were performed on three datasets (NAA, CRAD and malarial data from Medicines for Malarial Ventures (MMV)) using scaffold counts and cumulative scaffold frequency plots. Scaffolds from the NAA were compared to those from CRAD and MMV. A Scaffold Tree was also generated for all the datasets. Thirdly, machine learning approaches were used to build four regression and four classifier models from bioactivity data of NAA using molecular descriptors and molecular fingerprints. Models were built and refined by leave-one-out cross-validation and evaluated with an independent test dataset. Applicability domain (AD), which defines the limit of reliable predictability by the models, was estimated from the training dataset and validated with the test dataset. Possible chemical features associated with reported antimalarial activities of the compounds were also extracted. Lastly, virtual compound libraries were generated with the unique molecular scaffolds identified from the NAA. The virtual compounds generated were characterized by evaluating selected molecular descriptors, toxicity profile, structural diversity from CRAD and prediction of antiplasmodial activity. Results: From the molecular property profiling, a total of 1040 natural products were selected and a total of 13 molecular descriptors were analyzed. Significant differences were observed between the natural products with in-vitro antiplasmodial activities (NAA) and currently registered antimalarial drugs (CRAD) for at least 11 of the molecular descriptors. Molecular similarity and chemical space analysis identified NAA that were structurally diverse from CRAD. Over 50% of NAA with desirable drug-like properties were identified. However, nearly 70% of NAA were identified as potentially "promiscuous" compounds. Structure-activity landscape analysis highlighted compound pairs that formed "activity cliffs". In all, prioritization strategies for the natural products with in-vitro antiplasmodial activities were proposed. The scaffold exploration and analysis results revealed that CRAD exhibited greater scaffold diversity, followed by NAA and MMV respectively. Unique scaffolds that were not contained in any other compounds in the CRAD datasets were identified in NAA. The Scaffold Tree showed the preponderance of ring systems in NAA and identified virtual scaffolds, which maybe potential bioactive compounds or elucidate the NAA possible synthetic routes. From the machine learning study, the regression and classifier models that were most suitable for NAA were identified as model tree M5P (correlation coefficient = 0.84) and Sequential Minimization Optimization (accuracy = 73.46%) respectively. The test dataset fitted into the applicability domain (AD) defined by the training dataset. The “amine” group was observed to be essential for antimalarial activity in both NAA and MMV dataset but hydroxyl and carbonyl groups may also be relevant in the NAA dataset. The results of the characterization of the virtual compound library showed significant difference (p value < 0.05) between the virtual compound library and currently registered antimalarial drugs in some molecular descriptors (molecular weight, log partition coefficient, hydrogen bond donors and acceptors, polar surface area, shape index, chiral centres, and synthetic feasibility). Tumorigenic and mutagenic substructures were not observed in a large proportion (> 90%) of the virtual compound library. The virtual compound libraries showed sufficient diversity in structures and majority were structurally diverse from currently registered antimalarial drugs. Finally, up to 70% of the virtual compounds were predicted as active antiplasmodial agents. Conclusions:Molecular property profiling of natural products with in-vitro antiplasmodial activities (NAA) and currently registered antimalarial drugs (CRAD) produced a wealth of information that may guide decisions and facilitate antimalarial drug development from natural products and led to a prioritized list of natural products with in-vitro antiplasmodial activities. Molecular scaffold analysis identified unique scaffolds and virtual scaffolds from NAA that possess desirable drug-like properties, which make them ideal starting points for molecular antimalarial drug design. The machine learning study built, evaluated and identified amply accurate regression and classifier accurate models that were used for virtual screening of natural compound libraries to mine possible antimalarial compounds without the expense of bioactivity assays. Finally, a good amount of the virtual compounds generated were structurally diverse from currently registered antimalarial drugs and potentially active antiplasmodial agents. Filtering and optimization may lead to a collection of virtual compounds with unique chemotypes that may be synthesized and added to screening deck against Plasmodium.Item Concept Based Knowledge Discovery from Biomedical Literature(University of the Western Cape, 2009) Radovanovic, Aleksandar.; Bajic, Vladimir; Faculty of ScienceThis thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology, resented can be integrated with the researchers own knowledge, experimentation and observations for optimal progression of scientific research.Item Concept Based Knowledge Discovery From Biomedical Literature(University of the Western Cape, 2009) Radovanovic, Aleksandar; Bajic, VladimirAdvancement in biomedical research and continuous growth of scientific literature available in electronic form, calls for innovative methods and tools for information management, knowledge discovery, and data integration. Many biomedical fields such as genomics, proteomics, metabolomics, genetics, and emerging disciplines like systems biology and conceptual biology require synergy between experimental, computational, data mining and text mining technologies. A large amount of biomedical information available in various repositories, such as the US National Library of Medicine Bibliographic Database, emerge as a potential source of textual data for knowledge discovery. Text mining and its application of natural language processing and machine learning technologies to problems of knowledge discovery, is one of the most challenging fields in bioinformatics. This thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology presented can be integrated with the researchers' own knowledge, experimentation and observations for optimal progression of scientific research.Item Detection of positive selection resulting from Nevirapine treatment in longitudinal HIV-1 reverse transcriptase sequences(University of the Western Cape, 2006) Ketwaroo, Bibi Farahnaz K.; Hide, Winston; Seoighe, Cathal; Scheffle, Konrad; South African National Bioinformatics Institute (SANBI); Faculty of ScienceNevirapine (NVP) is a cheap anti-retroviral drug used in poor countries worldwide, administered to pregnant women at the onset of labour to inhibit HIV enzyme reverse transcriptase. Viruses which may get transmitted to newborns are deficient in this enzyme, and HIV-1 infection cannot be established, thereby preventing mother to child transmission (MTCT). In some cases, babies get infected and positive selection for viruses resistant to nevirapine may be inferred. Positive selection can be inferred from sequence data, when the rate of nonsynonymous substitutions is significantly greater than the rate of synonymous substitutions. Unfortunately, it is found that available positive selection methods should not be used to analyse before- and after- NVP treatment sequence pairs associated with MTCT. Methods which use phylogenetic trees to infer positive selection trace synonymous and nonsynonymous substitutions further back in time than the short time duration during which selection for NVP occurred. The other group of methods for inferring positive selection, the pairwise methods, do not have appreciable power, because they average susbtituion rates over all codons in a sequence pair and not just at single codons. We introduce a simple counting method which we call the Pairwise Homologous Codons (PHoCs) method with which we have inferred positive selection resulting from NVP treatment in longitudinal HIV-1 reverse transcriptase sequences. The PHoCs method estimates rates of substitutions between before- and after- NVP treatment codons, using a simple pairwise method.Item Development and implementation of ontology-based systems for mammalian gene expression profiling(2009) Kruger, Adéle; Hide, WinstonThe use of ontologies in the mapping of gene expression events provides an effective and comparable method to determine the expression profile of an entire genome across a large collection of experiments derived from different expression sources. In this dissertation I describe the development of the developmental human and mouse eVOC ontologies and demonstrate the ontologies by identifying genes showing a bias for developmental brain expression in human and mouse, identifying transcription factor complexes, and exploring the mouse orthologs of human cancer/testis genes.Model organisms represent an important resource for understanding the fundamental aspects of mammalian biology. Mapping of biological phenomena between model organisms is complex and if it is to be meaningful, a simplified representation can be a powerful means for comparison. The implementation of the ontologies has been illustrated here in two ways.Firstly, the ontologies have been used to illustrate methods to determine clusters of genes showing tissue-restricted expression in humans. The identification of tissue restricted genes within an organism serves as an indication of the finetuning in the regulation of gene expression in a given tissue. Secondly, due to the differences in human and mouse gene expression on a temporal and spatial level, the ontologies were used to identify mouse orthologs of human cancer/testis genes showing cancer/testis characteristics. With the use of model systems such as mouse in the development of gene-targeted drugs in the treatment of disease, it is important to establish that the expression characteristics and profiles of a drug target in the model system is representative of the characteristics of the target in the system for which it is intended.Item "Development and implementation of ontology-based systems for mammalian gene expression profiling"(University of the Western Cape, 2009) Kruger, Adele; Hide, WinstonThe use of ontologies in the mapping of gene expression events provides an effective and comparable method to determine the expression profile of an entire genome across a large collection of experiments derived from different expression sources. In this dissertation I describe the development of the developmental human and mouse e VOC ontologies and demonstrate the ontologies by identifying genes showing a bias for developmental brain expression in human and mouse, identifying transcription factor complexes, and exploring the mouse orthologs of human cancer/testis genes. Model organisms represents fundamental aspects of mammal biology phenomena between model organism is complex and it is to be the meaningful, a simplified representation can be a powerful means for comparison illustrated here in two ways. Firstly, the ontologies have been used to illustrate methods to determine clusters of genes showing tissue-restricted expression in humans. The identification of tissue-restricted genes within an organism serves as an indication of the finetuning in the regulation of gene expression in a given tissue. Secondly, due to the differences in human and mouse gene expression on a temporal and spatial level, the ontologies were used to identify mouse orthologs of human cancer/testis genes showing cancer/testis characteristics. With the use of model systems such as mouse in the development of gene-targeted drugs in the treatment of disease, it is important to establish that the expression characteristics and profiles of a drug target in the model system is representative of the characteristics of the target in the system for which it is intended.Item Development of a comprehensive annotation and curation framework for analysis of Glossina Morsitans Morsitans expresses sequence tags(University of the Western Cape, 2011) Wamalwa, Mark; Christoffels, Alan; South African National Bioinformatics Institute (SANBI); Faculty of ScienceThis study has successfully identified transcripts differentially expressed in the salivary gland and midgut and provides candidate genes that are critical to response to parasite invasion. Furthermore, an open-source Glossina resource (G-ESTMAP) was developed that provides interactive features and browsing of functional genomics data for researchers working in the field of Trypanosomiasis on the African continent.Item Development of a data processing toolkit for the analysis of next-generation sequencing data generated using the primer ID approach(University of the Western Cape, 2018) Labuschagne, Jan Phillipus Lourens; Travers, SimonSequencing an HIV quasispecies with next generation sequencing technologies yields a dataset with significant amplification bias and errors resulting from both the PCR and sequencing steps. Both the amplification bias and sequencing error can be reduced by labelling each cDNA (generated during the reverse transcription of the viral RNA to DNA prior to PCR) with a random sequence tag called a Primer ID (PID). Processing PID data requires additional computational steps, presenting a barrier to the uptake of this method. MotifBinner is an R package designed to handle PID data with a focus on resolving potential problems in the dataset. MotifBinner groups sequences into bins by their PID tags, identifies and removes false unique bins, produced from sequencing errors in the PID tags, as well as removing outlier sequences from within a bin. MotifBinner produces a consensus sequence for each bin, as well as a detailed report for the dataset, detailing the number of sequences per bin, the number of outlying sequences per bin, rates of chimerism, the number of degenerate letters in the final consensus sequences and the most divergent consensus sequences (potential contaminants). We characterized the ability of the PID approach to reduce the effect of sequencing error, to detect minority variants in viral quasispecies and to reduce the rates of PCR induced recombination. We produced reference samples with known variants at known frequencies to study the effectiveness of increasing PCR elongation time, decreasing the number of PCR cycles, and sample partitioning, by means of dPCR (droplet PCR), on PCR induced recombination. After sequencing these artificial samples with the PID approach, each consensus sequence was compared to the known variants. There are complex relationships between the sample preparation protocol and the characteristics of the resulting dataset. We produce a set of recommendations that can be used to inform sample preparation that is the most useful the particular study. The AMP trial infuses HIV-negative patients with the VRC01 antibody and monitors for HIV infections. Accurately timing the infection event and reconstructing the founder viruses of these infections are critical for relating infection risk to antibody titer and homology between the founder virus and antibody binding sites. Dr. Paul Edlefsen at the Fred Hutch Cancer Research Institute developed a pipeline that performs infection timing and founder reconstruction. Here, we document a portion of the pipeline, produce detailed tests for that portion of the pipeline and investigate the robustness of some of the tools used in the pipeline to violations of their assumptions.Item Development of a hepatitis C virus knowledgebase with computational prediction of functional hypothesis of therapeutic relevance(2011) Samuel, Kojo Kwofie; Bajic, Vladimir; Christoffels, AlanTo ameliorate Hepatitis C Virus (HCV) therapeutic and diagnostic challenges requires robust intervention strategies, including approaches that leverage the plethora of rich data published in biomedical literature to gain greater understanding of HCV pathobiological mechanisms. The multitudes of metadata originating from HCV clinical trials as well as low and high-throughput experiments embedded in text corpora can be mined as data sources for the implementation of HCV-specific resources. HCV-customized resources may support the generation of worthy and testable hypothesis and reveal potential research clues to augment the pursuit of efficient diagnostic biomarkers and therapeutic targets. This research thesis report the development of two freely available HCV-specific web-based resources: (i) Dragon Exploratory System on Hepatitis C Virus (DESHCV) accessible via http://apps.sanbi.ac.za/DESHCV/ or http://cbrc.kaust.edu.sa/deshcv/ and(ii) Hepatitis C Virus Protein Interaction Database (HCVpro) accessible via http://apps.sanbi.ac.za/hcvpro/ or http://cbrc.kaust.edu.sa/hcvpro/.DESHCV is a text mining system implemented using named concept recognition and cooccurrence based approaches to computationally analyze about 32, 000 HCV related abstracts obtained from PubMed. As part of DESHCV development, the pre-constructed dictionaries of the Dragon Exploratory System (DES) were enriched with HCV biomedical concepts, including HCV proteins, name variants and symbols to enable HCV knowledge specific exploration. The DESHCV query inputs consist of user-defined keywords, phrases and concepts. DESHCV is therefore an information extraction tool that enables users to computationally generate association between concepts and support the prediction of potential hypothesis with diagnostic and therapeutic relevance.Additionally, users can retrieve a list of abstracts containing tagged concepts that can be used to overcome the herculean task of manual biocuration. DESHCV has been used to simulate previously reported thalidomide-chronic hepatitis C hypothesis and also to model a potentially novel thalidomide-amantadine hypothesis.HCVpro is a relational knowledgebase dedicated to housing experimentally detected HCV-HCV and HCV-human protein interaction information obtained from other databases and curated from biomedical journal articles. Additionally, the database contains consolidated biological information consisting of hepatocellular carcinoma(HCC) related genes, comprehensive reviews on HCV biology and drug development,functional genomics and molecular biology data, and cross-referenced links to canonical pathways and other essential biomedical databases. Users can retrieve enriched information including interaction metadata from HCVpro by using protein identifiers,gene chromosomal locations, experiment types used in detecting the interactions, PubMed IDs of journal articles reporting the interactions, annotated protein interaction IDs from external databases, and via “string searches”. The utility of HCVpro has been demonstrated by harnessing integrated data to suggest putative baseline clues that seem to support current diagnostic exploratory efforts directed towards vimentin. Furthermore,eight genes comprising of ACLY, AZGP1, DDX3X, FGG, H19, SIAH1, SERPING1 and THBS1 have been recommended for possible investigation to evaluate their diagnostic potential. The data archived in HCVpro can be utilized to support protein-protein interaction network-based candidate HCC gene prioritization for possible validation by experimental biologists.