Philosophiae Doctor - PhD (Bioinformatics)
Permanent URI for this collection
Browse
Browsing by Author "Faculty of Science"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Concept Based Knowledge Discovery from Biomedical Literature(University of the Western Cape, 2009) Radovanovic, Aleksandar.; Bajic, Vladimir; Faculty of ScienceThis thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology, resented can be integrated with the researchers own knowledge, experimentation and observations for optimal progression of scientific research.Item Detection of positive selection resulting from Nevirapine treatment in longitudinal HIV-1 reverse transcriptase sequences(University of the Western Cape, 2006) Ketwaroo, Bibi Farahnaz K.; Hide, Winston; Seoighe, Cathal; Scheffle, Konrad; South African National Bioinformatics Institute (SANBI); Faculty of ScienceNevirapine (NVP) is a cheap anti-retroviral drug used in poor countries worldwide, administered to pregnant women at the onset of labour to inhibit HIV enzyme reverse transcriptase. Viruses which may get transmitted to newborns are deficient in this enzyme, and HIV-1 infection cannot be established, thereby preventing mother to child transmission (MTCT). In some cases, babies get infected and positive selection for viruses resistant to nevirapine may be inferred. Positive selection can be inferred from sequence data, when the rate of nonsynonymous substitutions is significantly greater than the rate of synonymous substitutions. Unfortunately, it is found that available positive selection methods should not be used to analyse before- and after- NVP treatment sequence pairs associated with MTCT. Methods which use phylogenetic trees to infer positive selection trace synonymous and nonsynonymous substitutions further back in time than the short time duration during which selection for NVP occurred. The other group of methods for inferring positive selection, the pairwise methods, do not have appreciable power, because they average susbtituion rates over all codons in a sequence pair and not just at single codons. We introduce a simple counting method which we call the Pairwise Homologous Codons (PHoCs) method with which we have inferred positive selection resulting from NVP treatment in longitudinal HIV-1 reverse transcriptase sequences. The PHoCs method estimates rates of substitutions between before- and after- NVP treatment codons, using a simple pairwise method.Item Development of a comprehensive annotation and curation framework for analysis of Glossina Morsitans Morsitans expresses sequence tags(University of the Western Cape, 2011) Wamalwa, Mark; Christoffels, Alan; South African National Bioinformatics Institute (SANBI); Faculty of ScienceThis study has successfully identified transcripts differentially expressed in the salivary gland and midgut and provides candidate genes that are critical to response to parasite invasion. Furthermore, an open-source Glossina resource (G-ESTMAP) was developed that provides interactive features and browsing of functional genomics data for researchers working in the field of Trypanosomiasis on the African continent.Item Generation of a human gene index and its application to disease candidacy(University of the Western Cape, 2001) Christoffels, Alan; Hide, Winston; Faculty of ScienceWith easy access to technology to generate expressed sequence tags (ESTs), several groups have sequenced from thousands to several thousands of ESTs. These ESTs benefit from consolidation and organization to deliver significant biological value. A number of EST projects are underway to extract maximum value from fragmented EST resources by constructing gene indices, where all transcripts are partitioned into index classes such that transcripts are put into the same index class if they represent the same gene. Therefore a gene index should ideally represent a non-redundant set of transcripts. Indeed, most gene indices aim to reconstruct the gene complement of a genome and their technological developments are directed at achieving this goal. The South African National Bioinformatics Institute (SANBI), on the other hand, embarked on the development of the sequence alignment and consensus knowledgebase (STACK) database that focused on the detection and visualisation of transcript variation in the context of developmental and pathological states, using all publicly available ESTs. Preliminary work on the STACK project employed an approach of partitioning the EST data into arbitrarily chosen tissue categories as a means of reducing the EST sequences to manageable sizes for subsequent processing. The tissue partitioning provided the template material for developing error-checking tools to analyse the information embedded in the error-laden EST sequences. However, tissue partitioning increases redundancy in the sequence data because one gene can be expressed in multiple tissues, with the result that multiple tissue partitioned transcripts will correspond to the same gene.Therefore, the sequence data represented by each tissue category had to be merged in order to obtain a comprehensive view of expressed transcript variation across all available tissues. The need to consolidate all EST information provided the impetus for developing a STACK human gene index, also referred to as a whole-body index. In this dissertation, I report on the development of a STACK human gene index represented by consensus transcripts where all constituent ESTs sample single or multiple tissues in order to provide the correct development and pathological context for investigating sequence variation. Furthermore, the availability of a human gene index is assessed as a diseasecandidate gene discovery resource. A feasible approach to construction of a whole-body index required the ability to process error-prone EST data in excess of one million sequences (1,198,607 ESTs as of December 1998). In the absence of new clustering algorithms, at that time, we successfully ported D2_CLUSTER, an EST clustering algorithm, to the high performance shared multiprocessor machine, Origin2000. Improvements to the parallelised version of D2_CLUSTER included: (i) ability to cluster sequences on as many as 126 processors. For example, 462000 ESTs were clustered in 31 hours on 126 R10000 MHz processors, Origin2000. (ii) enhanced memory management that allowed for clustering of mRNA sequences as long as 83000 base pairs. (iii) ability to have the input sequence data accessible to all processors, allowing rapid access to the sequences. (iv) a restart module that allowed a job to be restarted if it was interrupted. The successful enhancements to the parallelised version of D2_CLUSTER, as listed above, allowed for the processing of EST datasets in excess of 1 million sequences. An hierarchical approach was adopted where 1,198,607 million ESTs from GenBank release 110 (October 1998) were partitioned into "tissue bins" and each tissue bin was processed through a pipeline that included masking for contaminants, clustering, assembly, assembly analysis and consensus generation. A total of 478,707 consensus transcripts were generated for all the tissue categories and these sequences served as the input data for the generation of the wholebody index sequences. The clustering of all tissue-derived consensus transcripts was followed by the collapse of each consensus sequence to its individual ESTs prior to assembly and whole-body index consensus sequence generation. The hierarchical approach demonstrated a consolidation of the input EST data from 1,198607 ESTs to 69,158 multi-sequence clusters and 162,439 singletons (or individual ESTs). Chromosomal locations were added to 25,793 whole-body index sequences through assignment of genetic markers such as radiation hybrid markers and généthon markers. The whole-body index sequences were made available to the research community through a sequence-based search engine (http://ziggy.sanbi.ac.za/~alan/researchINDEX.html).Item Prediction of antimicrobial peptides using hyperparameter optimized support vector machines(University of the Western Cape, 2011) Gabere, Musa Nur; Vladimir, Bajic; Christoffels, Alan; South African National Bioinformatics Institute (SANBI); Faculty of ScienceAntimicrobial peptides (AMPs) play a key role in the innate immune response. They can be ubiquitously found in a wide range of eukaryotes including mammals, amphibians, insects, plants, and protozoa. In lower organisms, AMPs function merely as antibiotics by permeabilizing cell membranes and lysing invading microbes. Prediction of antimicrobial peptides is important because experimental methods used in characterizing AMPs are costly, time consuming and resource intensive and identification of AMPs in insects can serve as a template for the design of novel antibiotic. In order to fulfil this, firstly, data on antimicrobial peptides is extracted from UniProt, manually curated and stored into a centralized database called dragon antimicrobial peptide database (DAMPD). Secondly, based on the curated data, models to predict antimicrobial peptides are created using support vector machine with optimized hyperparameters. In particular, global optimization methods such as grid search, pattern search and derivative-free methods are utilised to optimize the SVM hyperparameters. These models are useful in characterizing unknown antimicrobial peptides. Finally, a webserver is created that will be used to predict antimicrobial peptides in haemotophagous insects such as Glossina morsitan and Anopheles gambiae.