Magister Scientiae - MSc (Bioinformatics)
Permanent URI for this collection
Browse
Browsing by Author "Christoffels, Alan"
Now showing 1 - 19 of 19
Results Per Page
Sort Options
Item A comparative genomics approach towards classifying immunity-related proteins in the tsetse fly(2009) Mpondo, Feziwe; Hide, Winston; Christoffels, AlanTsetse flies (Glossina spp) are vectors of African trypanosome (Trypanosoma spp) parasites, causative agents of Human African trypanosomiasis (sleeping sickness) and Nagana in livestock. Research suggests that tsetse fly immunity factors are key determinants in the success and failure of infection and the maturation process of parasites. An analysis of tsetse fly immunity factors is limited by the paucity of genomic data for Glossina spp. Nevertheless, completely sequenced and assembled genomes of Drosophila melanogaster, Anopheles gambiae and Aedes aegypti provide an opportunity to characterize protein families in species such as Glossina by using a comparative genomics approach. In this study we characterize thioester-containing proteins (TEPs), a sub-family of immunity-related proteins, in Glossina by leveraging the EST data for G.morsitans and the genomic resources of D. melanogaster, A. gambiae as well as A.aegypti.A total of 17 TEPs corresponding to Drosophila (four TEPs), Anopheles (eleven TEPs) and Aedes aegypti (two TEPs) were collected from published data supplemented with Genbank searches. In the absence of genome data for G. morsitans, 124 000 G.morsitans ESTs were clustered and assembled into 18 413 transcripts (contigs and singletons). Five Glossina contigs (Gmcn1115, Gmcn1116, Gmcn2398, Gmcn2281 and Gmcn4297) were identified as putative TEPs by BLAST searches. Phylogenetic analyses were conducted to determine the relationship of collected TEP proteins.Gmcn1115 clustered with DmtepI and DmtepII while Gmcn2398 is placed in a separate branch, suggesting that it is specific to G. morsitans.The TEPs are highly conserved within D. melanogaster as reflected in the conservation of the thioester domain, while only two and one TEPs in A. gambiae and A. aegypti thioester domain show conservation of the thioester domain suggesting that these proteins are subjected to high levels of selection. Despite the absence of a sequenced genome for G. morsitans, at least two putative TEPs where identified from EST data.Item A comparative genomics approach towards classifying immunity-related proteins in the tsetse fly(University of the western cape, 2009) Mpondo, Feziwe; Hide, Winston; Christoffels, AlanTsetse flies (Glossina spp) are vectors of African trypanosome (Trypanosoma spp) parasites, causative agents of Human African trypanosomiasis (sleeping sickness) and Nagana in livestock. Research suggests that tsetse fly immunity factors are key determinants in the success and failure of infection and the maturation process of parasites. An analysis of tsetse fly immunity factors is limited by the paucity of genomic data for Glossina spp. Nevertheless, completely sequenced and assembled genomes Drosophila melanogaster, Anopheles gambiae and Aedes aegypti provide an opportunity to characterize protein families in species such as G/ossiza by using a comparative genomics approach. In this study, we characterize thioester-containing proteins (TEPs), a sub-family of immunity-related proteins, in Glossinaby leveraging the EST data for G. morsitans and the genomic resources of D. melanogaster, A. gambiae as well as A. aegyptiItem A computational characterisation of the relationship between genome structure and disease genes(University of the Western Cape, 2012) Kibler, Tracey Deborah; Tiffin, Nicki; Christoffels, AlanThis is a pilot study to investigate the relationship between disease gene status and the structure of the human genome with specific reference to regions of recombination. It compares certain characteristics of a control set of genes, with no reported association or function in any known disease, with a second set of well-curated genes with a known association to a disease. One of the benefits of recombination is the introduction of new combinations of genetic variation in the genome. Recombination hotspots are regions on the chromosome where higher than normal frequencies of breaking and rejoining between homologous chromosomes occur during meiosis. The hotspot regions exhibit both a non-random distribution across the human genome and varying frequencies of breaking and rejoining. The study analyzed a set of features that represent general properties of human genes; namely base composition (percentage GC content), genetic variation (single nucleotide polymorphisms - SNPs), gene length, and positional effect (distance from chromosome end), in both the disease-associated gene set and the control set. These features were linked to recombination hotspots in the human genome and the frequency of recombination at these hotspots. Descriptive statistics was used to determine differences between the occurrences of these features in disease-associated genes compared to the control set, as well as differences in the occurrence of these same features in subset of genes containing an internal recombination hotspot compared to the genes with no internal recombination hotspot. The study found that disease-associated genes are generally longer than those in the control set, which is consistent with previous studies. It also found that disease-associated genes are much more likely to contain a recombination hotspot than those genes with no disease association. The study did not, however, find any association between disease gene status and the other set of features; namely GC content, SNP numbers or the position of a gene on the chromosome. Further analysis of the data suggested that the increased probability of disease-associated genes containing a recombination hotspot is most likely an effect of longer gene length and that the presence of a recombination hotspot is not sufficient in its own right to cause disease gene status.Item Data Science techniques for predicting plant genes involved in secondary metabolites production(University of the Western Cape, 2018) Muteba, Ben Ilunga; Christoffels, AlanPlant genome analysis is currently experiencing a boost due to reduced costs associated with the development of next generation sequencing technologies. Knowledge on genetic background can be applied to guide targeted plant selection and breeding, and to facilitate natural product discovery and biological engineering. In medicinal plants, secondary metabolites are of particular interest because they often represent the main active ingredients associated with health-promoting qualities. Plant polyphenols are a highly diverse family of aromatic secondary metabolites that act as antimicrobial agents, UV protectants, and insect or herbivore repellents. Most of the genome mining tools developed to understand genetic materials have very seldom addressed secondary metabolite genes and biosynthesis pathways. Little significant research has been conducted to study key enzyme factors that can predict a class of secondary metabolite genes from polyketide synthases. The objectives of this study were twofold: Primarily, it aimed to identify the biological properties of secondary metabolite genes and the selection of a specific gene, naringenin-chalcone synthase or chalcone synthase (CHS). The study hypothesized that data science approaches in mining biological data, particularly secondary metabolite genes, would enable the compulsory disclosure of some aspects of secondary metabolite (SM). Secondarily, the aim was to propose a proof of concept for classifying or predicting plant genes involved in polyphenol biosynthesis from data science techniques and convey these techniques in computational analysis through machine learning algorithms and mathematical and statistical approaches. Three specific challenges experienced while analysing secondary metabolite datasets were: 1) class imbalance, which refers to lack of proportionality among protein sequence classes; 2) high dimensionality, which alludes to a phenomenon feature space that arises when analysing bioinformatics datasets; and 3) the difference in protein sequences lengths, which alludes to a phenomenon that protein sequences have different lengths. Considering these inherent issues, developing precise classification models and statistical models proves a challenge. Therefore, the prerequisite for effective SM plant gene mining is dedicated data science techniques that can collect, prepare and analyse SM genes.Item Development of Open source Laboratory Information Management System (LIMS) For Human Biobanking(University of the Western Cape, 2018) Ademuyiwa, Toluwaleke; Christoffels, AlanBiobanks are collections of biological samples and associated data for future use. The day to day activities in a biobank laboratory is underpinned by a laboratory information management system (LIMS). For example, the LIMS manages the execution of tests on biospecimens and track their movement and processing through the laboratory. There are a range of commercially available Biobank LIMS systems on the market but their costs are prohibitive in a resource limited setting. The cost of Commercial off-the-shelf software includes the initial cost of acquiring the system, as well as the cost of maintenance and support throughout the software's life cycle. The Bika LIMS system on the other hand is Free and open source software (FOSS) with decreased license cost, used routinely in non-medical laboratories. Ideally, if Bika LIMS could be customised to handle human biospecimens, then both biobanks and genetics laboratories could benefit. Central to any biobank functionality in Bika LIMS is the ability to import information from routine biomedical equipment. We identified two instruments that are key to human biobanking and are lacking in Bika LIMS namely BioDrop ?LITE and the Qubit Fluorometric instrument. Import interfaces for importing DNA/RNA concentration analyses from these instruments and management of the results with associated sample information would add value to the LIMS. The aim of the thesis was to customise Bika LIMS for utility in a biomedical laboratory. In collaboration with colleagues at Tygerberg medical school, the Bika LIMS software was customised to accommodate the DNA and RNA concentration analyses results for a pathology laboratory and the LIMS workflows customised for use at Tygerberg medical school. In this process the manual operations of Tygerberg medical school laboratory would migrate to the use of Bika LIMS. The analytical module in Bika LIMS was implemented using PYTHON, by using logic that allows importing of specific analyses. A template was created for the BioDrop ?LITE and Qubit Fluorometric instruments used for developing the interface for an analysis import form. The instruments generate results in CSV file format. A parser was created to read and parse the files uploaded from the import form, by splitting them into parts, extracting the data, and populating key-value pairs. The controller manages the submission of the form by initialising the parser that imports the specific file into the LIMS where it is managed by the configured Bika LIMS workflow.Item Effects of nucleotide variation on the structure and function of human arylamine n-acetyltransferase 1(2012) Akurugu, Wisdom Alemya; Christoffels, AlanThe human arylamine N-acetyltransferase 1 (NAT1) is critical in determining the duration of action and pharmacokinetics of amine-containing drugs such as para-aminosalicylic acid and para-aminobenzoyl glutamate used in clinical therapy of tuberculosis (TB), as well as influencing the balance between detoxification and metabolic activation of these drugs. SNPs in this enzyme are continuously being detected and indicate inter-ethnic and inter-individual variation in the enzyme function. The effect of nsSNPs on the structure and function of proteins are routinely analyzed using SIFT and POLYPHEN-2 prediction algorithms. The false-negative rate of these two algorithms results in as much as 25% of nsSNPs. This study aimed to explore the use of homology modeling including residue interactions, Gibbs free energy change and solvent accessibility as additional evidence for predicting nsSNP effects on enzyme function.This study evaluated the functional effects of 14 nsSNPs identified in a South African mixed ancestry population of which 3 nsSNPs were previously identified in Caucasians. The SNPs were evaluated using structural analysis that included homology modeling, residue interactions, relative solvent accessibility,Gibbs free energy change and sequence conservation in addition to the routinely used nsSNP function prediction algorithms, SIFT and POLYPHEN-2. The structural analysis implemented in this study showed a loss of hydrogen bonds for S259R thereby affecting protein function which contradicts predictions obtained from SIFT and POLYPHEN-2 algorithms. The variant N245I was shown to be neutral but contradicted the predictions from SIFT and POLYPHEN-2. Structural analysis predicted that variant R242M would affect protein stability and therefore NAT1 function in agreement with POLYPHEN-2 predictions but contradicting predictions from SIFT. No structural changes were expected for variant E264K in agreement with predictions obtained from POLYPHEN-2 but contradicting results from SIFT. The functions of the remaining 10 nsSNPs were consistent with those predicted by SIFT and POLYPHEN-2 namely that four variants R117T, E167Q, T193S and T240S do not affect the NAT1 function whereas R166T, F202V, Q210P, D229H, V231G and V235A could affect the enzyme function.This study provided the first evaluation of the functional effects of 11 newly characterized nsSNPs on the NAT1 tuberculosis drug-metabolizing enzyme. The six functionally important nsSNPs predicted by all three methods and the four SNPs with contradictory results will be tested experimentally by creating a SNP construct that will be cloned into an expression vector. These combined computational and experimental studies will advance our understanding of NAT1 structure-function relationships and allow us to interpret the NAT1 genetic polymorphisms in individuals who are slow or fast acetylators. The results, albeit a small dataset demonstrate that the routinely used algorithms are not without flaws and that improvements in functional prediction of nsSNPs can be obtained by close scrutiny of the molecular interactions of wild type and variant amino acids.Item Enabling the processing of bioinformatics workflows where data is located through the use of cloud and container technologies(University of the Western Cape, 2019) de Beste, Eugene; Christoffels, AlanThe growing size of raw data and the lack of internet communication technology to keep up with that growth is introducing unique challenges to academic researchers. This is especially true for those residing in rural areas or countries with sub-par telecommunication infrastructure. In this project I investigate the usefulness of cloud computing technology, data analysis workflow languages and portable computation for institutions that generate data. I introduce the concept of a software solution that could be used to simplify the way that researchers execute their analysis on data sets at remote sources, rather than having to move the data. The scope of this project involved conceptualising and designing a software system to simplify the use of a cloud environment as well as implementing a working prototype of said software for the OpenStack cloud computing platform. I conclude that it is possible to improve the performance of research pipelines by removing the need for researchers to have operating system or cloud computing knowledge and that utilising technologies such as this can ease the burden of moving data.Item Establishing a framework for an African Genome Archive(University of Western Cape, 2019) Southgate, Jamie; Christoffels, AlanThe generation of biomedical research data on the African continent is growing, with numerous studies realizing the importance of African genetic diversity in discoveries of human origins and disease susceptibility. The decrease in costs to purchase and utilize such tools has enabled research groups to produce datasets of significant scientific value. However, this success story has resulted in a new challenge for African Researchers and institutions. An increase in data scale and complexity has led to an imbalance of infrastructure and skills to manage, store and analyse this dataItem An evaluation of galaxy and ruffus-scripting workflows system for DNA-seq analysis(University of the Western Cape, 2018) Oluwaseun, Ajayi Olabode; Christoffels, AlanFunctional genomics determines the biological functions of genes on a global scale by using large volumes of data obtained through techniques including next-generation sequencing (NGS). The application of NGS in biomedical research is gaining in momentum, and with its adoption becoming more widespread, there is an increasing need for access to customizable computational workflows that can simplify, and offer access to, computer intensive analyses of genomic data. In this study, the Galaxy and Ruffus frameworks were designed and implemented with a view to address the challenges faced in biomedical research. Galaxy, a graphical web-based framework, allows researchers to build a graphical NGS data analysis pipeline for accessible, reproducible, and collaborative data-sharing. Ruffus, a UNIX command-line framework used by bioinformaticians as Python library to write scripts in object-oriented style, allows for building a workflow in terms of task dependencies and execution logic. In this study, a dual data analysis technique was explored which focuses on a comparative evaluation of Galaxy and Ruffus frameworks that are used in composing analysis pipelines. To this end, we developed an analysis pipeline in Galaxy, and Ruffus, for the analysis of Mycobacterium tuberculosis sequence data. Furthermore, this study aimed to compare the Galaxy framework to Ruffus with preliminary analysis revealing that the analysis pipeline in Galaxy displayed a higher percentage of load and store instructions. In comparison, pipelines in Ruffus tended to be CPU bound and memory intensive. The CPU usage, memory utilization, and runtime execution are graphically represented in this study. Our evaluation suggests that workflow frameworks have distinctly different features from ease of use, flexibility, and portability, to architectural designs.Item An evolutionary genomics approach towards analysis of genes implicated in transmission of trypanosomes between tsetse fly and mammalian host(2009) Mwangi, Sarah Wambui; Christoffels, AlanHuman African trypanosomiasis is the world’s third most important parasitic disease affecting human health after malaria and schistosomiaisis. The world health organization estimates approximately 60 million people at risk in sub-Saharan Africa and up to 50,000 deaths per year caused by trypanosomiasis. Current management of human African trypanosomiasis relies on active surveillance and chemotherapy of infected patients. Efforts to develop a vaccine to immunize the human host have been hampered by antigenic variation of the parasites cell coat. The advent of the genome era has opened up opportunities for developing novel strategies for interrupting the transmission cycle of trypanosomes, specifically using any of the three players,the human host, the tsetse fly vector and/or the parasite. The human genome has been deciphered and the genomes of several trypanosome species have been sequenced. Sequencing of additional neglected trypanosome species is in progress. The tsetse fly genome is currently being sequenced as part of the genomic activities of the International Glossina genome initiative (IGGI). In an attempt to support the tsetse fly sequencing effort, expressed sequence tags (ESTs) from various tissues and developmental stages of Glossina morsitans have been generated.In this study, tsetse fly EST data was analyzed using bioinformatics approaches, focusing on transcripts encoding serpin genes implicated in the immune defenses of tsetse flies. Glossina morsitans homologues to Drosophila melanogaster serpin4, serpin5, and serpin27A and Anopheles gambiae serpin10 were identified in the tsetse fly EST contigs. Comparison of the reactive center loop of tsetse fly serpins with human α-1-antitrypsin suggests that these tsetse serpins are inhibitory. Preliminary EST clustering did not succeed in assembling 3564 Tsal encoded ESTs into one contig. In this study, these ESTs were assembled together with three published Tsal cDNAs. A total of 29 Tsal-encoded contigs were generated. An analysis of the sequence variation within the Tsal EST assembled contigs identified five single base mismatches namely A-T, T-A, G-T and T-G.Results from this study form a basis onto which genetic and biochemical experimental studies can be designed, a process that will be successfully carried out once we have a reference genome. Specifically, studies aimed at genetic modification of tsetse flies towards populations that are inhabitable to trypanosomes. Ultimately, this will supplement current vector control strategies towards elimination of human African trypanosomiasis.Item Exploring the influence of organisational, environmental, and technological factors on information security policies and compliance at South African higher education institutions: Implications for biomedical research.(University of Western Cape, 2020) Abiodun, Oluwafemi Peter; Christoffels, Alan; Anderson, DominiqueHeadline reports on data breaches worldwide have resulted in heightened concerns about information security vulnerability. In Africa, South Africa is ranked among the top ‘at-risk’ countries with information security vulnerabilities and is the most cybercrime-targeted country. Globally, such cyber vulnerability incidents greatly affect the education sector, due, in part, to the fact that it holds more Personal Identifiable Information (PII) than other sectors. PII refers to (but is not limited to) ID numbers, financial account numbers, and biomedical research data.Item Exploring the influence of organisational, environmental, and technological factors on information security policies and compliance at South African higher education institutions: Implications for biomedical research.(University of the Western Cape, 2020) Abiodun, Oluwafemi Peter; Christoffels, AlanHeadline reports on data breaches worldwide have resulted in heightened concerns about information security vulnerability. In Africa, South Africa is ranked among the top ‘at-risk’ countries with information security vulnerabilities and is the most the most cybercrime-targeted country. Globally, such cyber vulnerability incidents greatly affect the education sector, due, in part, to the fact that it holds more Personal Identifiable Information (PII) than other sectors. PII refers to (but is not limited to) ID numbers, financial account numbers, and biomedical research data. In response to rising threats, South Africa has implemented a regulation called the Protection of Personal Information Act (POPIA), similar to the European Union General Data Protection Regulation (GDPR), which seeks to mitigate cybercrime and information security vulnerabilities. The extent to which African institutions, especially in South Africa, have embraced and responded to these two information security regulations remains vague, making it a crucial matter for biomedical researchers. This study aimed to assess whether the participating universities have proper and reliable information security practices, measures and management in place and whether they fall in line with both national (POPIA) and international (GDPR) regulations. In order to achieve this aim, the study undertook a qualitative exploratory analysis of information security management across three universities in South Africa. A Technology, Organizational, and Environmental (TOE) model was employed to investigate factors that may influence effective information security measures. A Purposeful sampling method was employed to interview participants from each university. From the technological standpoint, Bring Your Own Device (BYOD) policy, whereby on average, a student owns and connects between three to four internet-enabled devices to the network, has created difficulties for IT teams, particularly in the areas of authentication, explosive growth in bandwidth, and access control to security university servers. In order to develop robust solutions to mitigate these concerns, and which are not perceived by users as overly prohibitive, executive management should acknowledge that security and privacy issues are a universal problem and not solely an IT problem and equip the IT teams with the necessary tools and mechanisms to allow them to overcome commonplace challenges. At an organisational level, information security awareness training of all users within the university setting was identified as a key factor in protecting the integrity, confidentiality, and availability of information in highly networked environments. Furthermore, the University’s information security mission must not simply be a link on a website, it should be constantly re-enforced by informing users during, and after, the awareness training. In terms of environmental factors, specifically the GDPR and POPIA legislations, one of the most practical and cost-effective ways universities can achieve data compliance requirements is to help staff (both teaching and non-teaching), students, and other employees understand the business value of all information. Users which are more aware of sensitivity of data, risks to the data, and their responsibilities when handling, storing, processing, and distributing data during their day to day activities will behave in a manner that would makes compliance easier at the institutional level. Results obtained in this study helped to elucidate the current status, issues, and challenges which universities are facing in the area of information security management and compliance, particularly in the South African context. Findings from this study point to organizational factors being the most critical when compared to the technological and environmental contexts examined. Furthermore, several proposed information security policies were developed with a view to assist biomedical practitioners within the institutional setting in protecting sensitive biomedical data.Item The identification of biologically important secondary structures in disease-causing RNA viruses(University of the Western Cape, 2012) Tanov, Emil Pavlov; Harkins, Gordon W.; Christoffels, AlanViral genomes consist of either deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). The viral RNA molecules are responsible for two functions, firstly, their sequences contain the genetic code, which encodes the viral proteins, and secondly, they may form structural elements important in the regulation of the viral life-cycle. Using a host of computational and bioinformatics techniques we investigated how predicted secondary structure may influence the evolutionary dynamics of a group of single-stranded RNA viruses from the Picornaviridae family. We detected significant and marginally significant correlations between regions predicted to be structured and synonymous substitution constraints in these regions, suggesting that selection may be acting on those sites to maintain the integrity of certain structures. Additionally, coevolution analysis showed that nucleotides predicted to be base paired, tended to co-evolve with one another in a complimentary fashion in four out of the eleven species examined. Our analyses were then focused on individual structural elements within the genome-wide predicted structures. We ranked the predicted secondary structural elements according to their degree of evolutionary conservation, their associated synonymous substitution rates and the degree to which nucleotides predicted to be base paired coevolved with one another. Top ranking structures coincided with well characterized secondary structures that have been previously described in the literature. We also assessed the impact that genomic secondary structures had on the recombinational dynamics of picornavirus genomes, observing a strong tendency for recombination breakpoints to occur in non-coding regions. However, convincing evidence for the association between the distribution of predicted RNA structural elements and breakpoint clustering was not detected.Item The identification of biologically important secondary structures in disease-causing RNA viruses.(University of the Western Cape, 2012) Tanov, Emil Pavlov; Harkins, Gordon W.; Christoffels, AlanViral genomes consist of either deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). The viral RNA molecules are responsible for two functions, firstly, their sequences contain the genetic code, which encodes the viral proteins, and secondly, they may form structural elements important in the regulation of the viral life-cycle. Using a host of computational and bioinformatics techniques we investigated how predicted secondary structure may influence the evolutionary dynamics of a group of single-stranded RNA viruses from the Picornaviridae family. We detected significant and marginally significant correlations between regions predicted to be structured and synonymous substitution constraints in these regions, suggesting that selection may be acting on those sites to maintain the integrity of certain structures. Additionally, coevolution analysis showed that nucleotides predicted to be base paired, tended to co-evolve with one another in a complimentary fashion in four out of the eleven species examined. Our analyses were then focused on individual structural elements within the genome-wide predicted structures. We ranked the predicted secondary structural elements according to their degree of evolutionary conservation, their associated synonymous substitution rates and the degree to which nucleotides predicted to be base paired coevolved with one another. Top ranking structures coincided with well characterized secondary structures that have been previously described in the literature. We also assessed the impact that genomic secondary structures had on the recombinational dynamics of picornavirus genomes, observing a strong tendency for recombination breakpoints to occur in non-coding regions. However, convincing evidence for the association between the distribution of predicted RNA structural elements and breakpoint clustering was not detected.Item An investigation into the genetic basis of autosomal recessive Osteogenesis imperfecta (OI) III in a South African family of mixed ancestry(University of the Western Cape, 2022) Fernol, Susan Alicia; Christoffels, AlanOsteogenesis Imperfecta (OI) is a rare skeletal dysplasia that is primarily characterized by bone fragility, recurrent fractures, and bone deformities. Over the years there has been an increase in the number of genes associated with OI. Currently there are twenty causative genes involved in OI spread across an autosomal dominant form, autosomal recessive form, and an X-linked form. Among the different types of OI, the progressively deforming OI, has more than one causative OI gene associated with it, and both AD and AR mode of inheritance. A severe autosomal recessive form of OI type III has been studied in SA for more than 40 years.Item A knowledgebase of stress reponsive gene regulatory elements in arabidopsis Thaliana(University of the Western Cape, 2011) Adam, Muhammed Saleem; Bajic, Vladimir; Christoffels, Alan; South African National Bioinformatics Institute (SANBI); Faculty of ScienceStress responsive genes play a key role in shaping the manner in which plants process and respond to environmental stress. Their gene products are linked to DNA transcription and its consequent translation into a response product. However, whilst these genes play a significant role in manufacturing responses to stressful stimuli, transcription factors coordinate access to these genes, specifically by accessing a gene's promoter region which houses transcription factor binding sites. Here transcriptional elements play a key role in mediating responses to environmental stress where each transcription factor binding site may constitute a potential response to a stress signal. Arabidopsis thaliana, a model organism, can be used to identify the mechanism of how transcription factors shape a plant's survival in a stressful environment. Whilst there are numerous plant stress research groups, globally there is a shortage of publicly available stress responsive gene databases. In addition a number of previous databases such as the Generation Challenge Programme's comparative plant stressresponsive gene catalogue, Stresslink and DRASTIC have become defunct whilst others have stagnated. There is currently a single Arabidopsis thaliana stress response database called STIFDB which was launched in 2008 and only covers abiotic stresses as handled by major abiotic stress responsive transcription factor families. Its data was sourced from microarray expression databases, contains numerous omissions as well as numerous erroneous entries and has not been updated since its inception.The Dragon Arabidopsis Stress Transcription Factor database (DASTF) was developed in response to the current lack of stress response gene resources. A total of 2333 entries were downloaded from SWISSPROT, manually curated and imported into DASTF. The entries represent 424 transcription factor families. Each entry has a corresponding SWISSPROT, ENTREZ GENBANK and TAIR accession number. The 5' untranslated regions (UTR) of 417 families were scanned against TRANSFAC's binding site catalogue to identify binding sites. The relational database consists of two tables, namely a transcription factor table and a transcription factor family table called DASTF_TF and TF_Family respectively. Using a two-tier client-server architecture, a webserver was built with PHP, APACHE and MYSQL and the data was loaded into these tables with a PYTHON script. The DASTF database contains 60 entries which correspond to biotic stress and 167 correspond to abiotic stress while 2106 respond to biotic and/or abiotic stress. Users can search the database using text, family, chromosome and stress type search options. Online tools have been integrated into the DASTF, database, such as HMMER, CLUSTALW, BLAST and HYDROCALCULATOR. User's can upload sequences to identify which transcription factor family their sequences belong to by using HMMER. The website can be accessed at http://apps.sanbi.ac.za/dastf/ and two updates per year are envisaged.Item Normalization and statistical methods for crossplatform expression array analysis(University of the Western Cape, 2012) Mapiye, Darlington S; Gamieldien, Junaid; Christoffels, AlanA large volume of gene expression data exists in public repositories like the NCBI’s Gene Expression Omnibus (GEO) and the EBI’s ArrayExpress and a significant opportunity to re-use data in various combinations for novel in-silico analyses that would otherwise be too costly to perform or for which the equivalent sample numbers would be difficult to collects exists. For example, combining and re-analysing large numbers of data sets from the same cancer type would increase statistical power, while the effects of individual study-specific variability is weakened, which would result in more reliable gene expression signatures. Similarly, as the number of normal control samples associated with various cancer datasets are often limiting, datasets can be combined to establish a reliable baseline for accurate differential expression analysis. However, combining different microarray studies is hampered by the fact that different studies use different analysis techniques, microarray platforms and experimental protocols. We have developed and optimised a method which transforms gene expression measurements from continuous to discrete data points by grouping similarly expressed genes into quantiles on a per-sample basis. After cross mapping each probe on each chip to the gene it represents, thereby enabling us to integrate experiments based on genes they have in common across different platforms. We optimised the quantile discretization method on previously published prostate cancer datasets produced on two different array technologies and then applied it to a larger breast cancer dataset of 411 samples from 8 microarray platforms. Statistical analysis of the breast cancer datasets identified 1371 differentially expressed genes. Cluster, gene set enrichment and pathway analysis identified functional groups that were previously described in breast cancer and we also identified a novel module of genes encoding ribosomal proteins that have not been previously reported, but whose overall functions have been implicated in cancer development and progression. The former indicates that our integration method does not destroy the statistical signal in the original data, while the latter is strong evidence that the increased sample size increases the chances of finding novel gene expression signatures. Such signatures are also robust to inter-population variation, and show promise for translational applications like tumour grading, disease subtype classification, informing treatment selection and molecular prognostics.Item Regulatory attributes of the carotenoid biosynthetic pathway in Arabidopsis Thaliana under abiotic stress(University of the Western Cape, 2012) Khan, Firdous; Christoffels, AlanCarotenoids are tetraprenoid (C40) molecules synthesized in plants, fungi, bacteria and algae, via the carotenoid biosynthetic pathway (CBP). Some carotenoids are readily converted to vitamin A (VA) in humans, e.g. 13-carotene, c(-carotene and B-cryptoxanthin 1,2. Vitamin a deficiency (VAD) affect millions especially children under the age of five. The CBP in plants is a key source of pro-vitamin A and is vital to the biofortification of staple crops such as maize, rice and sorghum, could alleviate the global VAD problem. However the incomplete understanding of regulation of the pathway is a limiting factor to predictably control carotenoid content at the systems level. Previous studies have shown that growth conditions, such as light, play a major role in the biosynthesis of carotenoids. A systems biology approach was therefore used to analyse microarray data sets derived from A. thaliana grown under various conditions and treated with different stimuli. Thirty two genes have previously been identified as being involved in the CBP. These genes were found to be highly differentially expressed depending on stress type. All stimuli including drought, cold, heat, osmotic, oxidative and salt but wounding had a significant influence on the CBP genes. Gene expression induced by abiotic stress occured 30 min after exposure. These findings are indicative that an immediate systemic signal is sent to the rest of the plant in response to stress. A correlation analyses revealed strongly positive correlation between PSY and its co-expressed genes, suggesting they share a common regulatory mechanism. Promoter content analyses identified 20 enriched TFBMs among carotenoid genes. The most prevalent TFBMs found in the promoter regions of the CBP genes show a 1.25-3 fold increase in prevalence with a p-value < 0.05. Similar GO terms are enriched for CBP genes and their co-expressed genes. These findings indicate that carotenoid biosynthetic pathway genes and their co-expressed genes are involved in similar metabolic pathways and functional processes. This study identified cold, drought and heat to influence carotenoid gene expression and has led to the identification of molecular switches that can be modulated to control the biosynthetic pathway. Four motifs without any GO annotation and no specific known motif in plant databases were identified using MEME suite. In this study I propose that these predictions might be novel motifs and could be specific to carotenoid genes, and may be directly involved in the regulation of carotenoid biosynthesis. These findings may lead to a better understanding of the underlying regulatory mechanisms involved in the biosynthesis of carotenoids. Furthermore, these findings may assist in establishing ways of enhancing the production of carotenoids, especially pro-vitamin A, in Arabidopsis thaliana.Item SNP based literature and data retrieval(University of the Western Cape, 2016) Veldsman, Werner Pieter; Christoffels, AlanReference single nucleotide polymorphism (refSNP) identifiers are used to earmark SNPs in the human genome. These identifiers are often found in variant call format (VCF) files. RefSNPs can be useful to include as terms submitted to search engines when sourcing biomedical literature. In this thesis, the development of a bioinformatics software package is motivated, planned and implemented as a web application (http://sniphunter.sanbi.ac.za) with an application programming interface (API). The purpose is to allow scientists searching for relevant literature to query a database using refSNP identifiers and potential keywords assigned to scientific literature by the authors. Multiple queries can be simultaneously launched using either the web interface or the API. In addition, a VCF file parser was developed and packaged with the application to allow users to upload, extract and write information from VCF files to a file format that can be interpreted by the novel search engine created during this project. The parsing feature is seamlessly integrated with the web application's user interface, meaning there is no expectation on the user to learn a scripting language. This multi-faceted software system, called SNiPhunter, envisions saving researchers time during life sciences literature procurement, by suggesting articles based on the amount of times a reference SNP identifier has been mentioned in an article. This will allow the user to make a quantitative estimate as to the relevance of an article. A second novel feature is the inclusion of the email address of a correspondence author in the results returned to the user, which promotes communication between scientists. Moreover, links to external functional information are provided to allow researchers to examine annotations associated with their reference SNP identifier of interest. Standard information such as digital object identifiers and publishing dates, that are typically provided by other search engines, are also included in the results returned to the user.