Magister Scientiae - MSc (Bioinformatics)

Permanent URI for this collection

Browse

collection.page.browse.recent.head

Now showing 1 - 20 of 44
  • Item
    Regulatory attributes of the carotenoid biosynthetic pathway in Arabidopsis Thaliana under abiotic stress
    (University of the Western Cape, 2012) Khan, Firdous; Christoffels, Alan
    Carotenoids are tetraprenoid (C40) molecules synthesized in plants, fungi, bacteria and algae, via the carotenoid biosynthetic pathway (CBP). Some carotenoids are readily converted to vitamin A (VA) in humans, e.g. 13-carotene, c(-carotene and B-cryptoxanthin 1,2. Vitamin a deficiency (VAD) affect millions especially children under the age of five. The CBP in plants is a key source of pro-vitamin A and is vital to the biofortification of staple crops such as maize, rice and sorghum, could alleviate the global VAD problem. However the incomplete understanding of regulation of the pathway is a limiting factor to predictably control carotenoid content at the systems level. Previous studies have shown that growth conditions, such as light, play a major role in the biosynthesis of carotenoids. A systems biology approach was therefore used to analyse microarray data sets derived from A. thaliana grown under various conditions and treated with different stimuli. Thirty two genes have previously been identified as being involved in the CBP. These genes were found to be highly differentially expressed depending on stress type. All stimuli including drought, cold, heat, osmotic, oxidative and salt but wounding had a significant influence on the CBP genes. Gene expression induced by abiotic stress occured 30 min after exposure. These findings are indicative that an immediate systemic signal is sent to the rest of the plant in response to stress. A correlation analyses revealed strongly positive correlation between PSY and its co-expressed genes, suggesting they share a common regulatory mechanism. Promoter content analyses identified 20 enriched TFBMs among carotenoid genes. The most prevalent TFBMs found in the promoter regions of the CBP genes show a 1.25-3 fold increase in prevalence with a p-value < 0.05. Similar GO terms are enriched for CBP genes and their co-expressed genes. These findings indicate that carotenoid biosynthetic pathway genes and their co-expressed genes are involved in similar metabolic pathways and functional processes. This study identified cold, drought and heat to influence carotenoid gene expression and has led to the identification of molecular switches that can be modulated to control the biosynthetic pathway. Four motifs without any GO annotation and no specific known motif in plant databases were identified using MEME suite. In this study I propose that these predictions might be novel motifs and could be specific to carotenoid genes, and may be directly involved in the regulation of carotenoid biosynthesis. These findings may lead to a better understanding of the underlying regulatory mechanisms involved in the biosynthesis of carotenoids. Furthermore, these findings may assist in establishing ways of enhancing the production of carotenoids, especially pro-vitamin A, in Arabidopsis thaliana.
  • Item
    Analyses of sequence divergence using completely sequenced genomes
    (University of the Western Cape, 2003) Nembaware, Victoria P.; Seoighe, Cathal
    Using the complete genome, Saccharomyces cerevisiae, which duplicated after its speciation fuom Kluyveromyces lactics, a dataset of 119 putative S. cerevisiae - K. lactis ortholog-pairs was constructed. S. cerevisiae paralogous pairs that are likely to have duplicated during the whole genome duplication of S. cerevisiae were obtained and the approach taken in our previous work (Nembaware et al., 20OZ), was repeated to test whether the presence of a paralogue in S. cerevisiae had an effect on the rate of sequence divergence of the 119 pairs of orthologous genes. We found, however, that substitutions at synonymous sites had reached saturation and this prevented us from being able to repeat the previous finding with S. cerevistae and K. lactis . From this study a publicly available web-server (http://hamlyn.sanbi.ac.zal-victoria) that automates the calculation of Ka:Ks values given a pairs homologous CDS sequences is presented.
  • Item
    Development and implementation of ontology-based systems for mammalian gene expression profiling
    (University of the Western Cape, 2009) Kruger, Adele; Hide, Winston
    The use of ontologies in the mapping of gene expression events provides an effective and comparable method to determine the expression profile of an entire genome across a large collection of experiments derived from different expression sources. In this dissertation I describe the development of the developmental human and mouse e voe ontologies and demonstrate the ontologies by identifying genes showing a bias for developmental brain expression in human and mouse, identifying transcription factor complexes, and exploring the mouse orthologs of human cancer/testis genes.
  • Item
    Identification of insertion-induced enhancers linked to gene drivers within non-coding DNA using a pipeline for diffuse large b-cell lymphoma H3K27ac ChIP-seq data
    (University of the Western Cape, 2022) Jassiem, Wardah; Bendou, Hocine
    Diffuse large B-cell lymphoma (DLBCL) is the most common subtype of non-Hodgkin lymphoma (NHL) and incorporates a diverse range of illnesses with varying biology, clinical manifestations, and therapeutic responses. Functional insertion mutations represent the driving mechanism behind many oncologic illnesses. Research has shown that variants associated with cancer in the non-coding portion of the genome, which is enriched with enhancer elements, is greatly underappreciated. The present study designed a bioinformatics pipeline using Nextflow DSL2 to identify insertion-induced enhancers associated with DLBCL oncogenes within the non-coding genome using H3K27ac ChIP-seq data. Gapped DLBCL reads identified by bowtie were mapped to the human reference genome with bowtie2. Non-coding insertions were identified with BEDTools and verified by pBla.
  • Item
    An investigation into the genetic basis of autosomal recessive Osteogenesis imperfecta (OI) III in a South African family of mixed ancestry
    (University of the Western Cape, 2022) Fernol, Susan Alicia; Christoffels, Alan
    Osteogenesis Imperfecta (OI) is a rare skeletal dysplasia that is primarily characterized by bone fragility, recurrent fractures, and bone deformities. Over the years there has been an increase in the number of genes associated with OI. Currently there are twenty causative genes involved in OI spread across an autosomal dominant form, autosomal recessive form, and an X-linked form. Among the different types of OI, the progressively deforming OI, has more than one causative OI gene associated with it, and both AD and AR mode of inheritance. A severe autosomal recessive form of OI type III has been studied in SA for more than 40 years.
  • Item
    A comparative genomics approach towards classifying immunity-related proteins in the tsetse fly
    (University of the western cape, 2009) Mpondo, Feziwe; Hide, Winston; Christoffels, Alan
    Tsetse flies (Glossina spp) are vectors of African trypanosome (Trypanosoma spp) parasites, causative agents of Human African trypanosomiasis (sleeping sickness) and Nagana in livestock. Research suggests that tsetse fly immunity factors are key determinants in the success and failure of infection and the maturation process of parasites. An analysis of tsetse fly immunity factors is limited by the paucity of genomic data for Glossina spp. Nevertheless, completely sequenced and assembled genomes Drosophila melanogaster, Anopheles gambiae and Aedes aegypti provide an opportunity to characterize protein families in species such as G/ossiza by using a comparative genomics approach. In this study, we characterize thioester-containing proteins (TEPs), a sub-family of immunity-related proteins, in Glossinaby leveraging the EST data for G. morsitans and the genomic resources of D. melanogaster, A. gambiae as well as A. aegypti
  • Item
    A deep learning approach to predicting potential virus species crossover using convolutional neural networks and viral protein sequence patterns
    (University of the Western Cape, 2022) Serage, Rudolph; Anderson, Dominique
    Medical science has made substantial progress toward diagnosing, understanding the pathogenesis, and treating various causative agents of infectious disease; however, novel microbial pathogens continue to emerge, and existing pathogens continue to evolve alternative means to thrive in ever-changing environments. Various infectious disease etiological agents originate from animal reservoirs, and many have, over time, acquired the ability to cross the species barrier and alter their host range. The emergence and re-emergence of zoonotic pathogens is reported to be a consequence of changes in several factors, including ecological, behavioural, and socioeconomic variables which are arguably impossible to control. Computational methods with the capacity to evaluate large datasets, are considered invaluable tools for predicting and tracking disease outbreaks and are especially powerful when combined with machine learning techniques.
  • Item
    Reconstruction of gene regulatory networks of inflammation-associated genes in different clinical stages of diffuse large B-cell lymphoma
    (University of the Western Cape, 2022) Mfuphi, Nomlindelo Witness; Bendou, Hocine
    Diffuse large B-cell lymphoma (DLBCL) is a heterogeneous malignancy that is driven by complex gene regulatory networks (GRNs). Numerous genes exert distinct effects on the progression and therapeutic outcome of DLBCL. Previous studies have associated DLBCL with inflammation but the GRNs involved in this mechanism have not yet been explored. The objectives of this current study are to reconstruct inflammation-associated networks and to understand the effects of inflammation on the pathogenesis and progression of DLBCL in different clinical stages.
  • Item
    HIV Subtype C Diversity: Analysis of the Relationship of Sequence Diversity to Proposed Epitope Locations
    (University of the Western Cape, 2002) Ernstoff, Elana Ann; Hide, Winston
    Southern Africa is facing one of the most serious HIV epidemics. This project contributes to the HIVNET, Network for Prevention Trials cohort for vaccine development. HIV's biology and rapid mutation rate have made vaccine design difficult. We examined HIV-l subtype C diversity and how it relates to CTL epitope location along viral gag sequences. We found a negative correlation between codon sites under positive selection and epitope regions; suggesting epitope regions are evolutionarily conserved. It is possible that detected due to the reference regions, yet fail to be viral population. To test if CTL clustering is an we calculated differences between the gag codons and the a weak negative correlation, suggesting epitopes in less conserved regions maybe evading detection. Locating conserved and optimal epitopes that can be recognized by CTLs is essential for the design of vaccine reagents.
  • Item
    Molecular modeling and simulation studies to prioritize sequence variants identified by whole-exome sequencing in a South African family with Parkinson's disease
    (University of Western Cape, 2021) Hassan, Maryam; Cloete, Ruben
    Parkinson’s disease (PD) is a neurodegenerative disorder that occurs due to a loss of dopaminergic neurons in the substantia nigra. It is one of the most common neurodegenerative disorders, ranking second only to Alzheimer’s disease. Research on the genetic causes of PD over the past two decades has led to the discovery of several PD-associated genes. Currently, researchers have identified 23 genes that are linked to rare monogenic forms of PD with Mendelian inheritance. In sub-Saharan Africa (SSA), PD has received little attention due to factors such as underfunded healthcare infrastructure, the absence of epidemiological data, and a scarcity of neurologists. In the relatively few published studies, it has been shown that the known PD mutations play a minor role in disease etiology in SSA populations. In the current study, we follow up on previous work done in an MMed study investigating a South African family with several family members (mother and three sons) suffering from PD.
  • Item
    Molecular dynamic simulation studies of the South African HIV-1 Integrase subtype C protein to understand the structural impact of naturally occurring polymorphisms
    (University of the Western Cape, 2021) Isaacs, Matthew Darren; Cloete, Ruben Earl Ashley
    The viral Integrase (IN) protein is an essential enzyme of all known retroviruses, including HIV-1. It is responsible for the insertion of viral DNA into the human genome. It is known that HIV-1 is highly diverse with a high mutation rate as evidenced by the presence of a large number of subtypes and even strains that have become resistant to antiretroviral drugs. It remains inconclusive what effect this diversity in the form of naturally occurring polymorphisms/variants exert on IN in terms of its function, structure and susceptibility to IN inhibitory antiretroviral drugs. South Africa is home to the largest HIV-1 infected population, with (group M) subtype C being the most prevalent subtype. An investigation into IN is therefore pertinent, even more so with the introduction of the IN strand-transfer inhibitor (INSTI) Dolutegravir (DTG). This study makes use of computational methods to determine any structural and DTG drug binding differences between the South African subtype C IN protein and the subtype B IN protein. The methods employed included homology modelling to predict a three-dimensional model for HIV-1C IN, calculating the change in protein stability after variant introduction and molecular dynamics simulation analysis to understand protein dynamics. Here we compared subtype C and B IN complexes without DTG and with DTG.
  • Item
    Molecular dynamic simulation studies of the South African HIV-1 Integrase subtype C protein to understand the structural impact of naturally occurring polymorphisms
    (University of Western Cape, 2021) Isaacs, Darren Mathew; Cloete, Ruben Earl Ashley
    The viral Integrase (IN) protein is an essential enzyme of all known retroviruses, including HIV-1. It is responsible for the insertion of viral DNA into the human genome. It is known that HIV-1 is highly diverse with a high mutation rate as evidenced by the presence of a large number of subtypes and even strains that have become resistant to antiretroviral drugs. It remains inconclusive what effect this diversity in the form of naturally occurring polymorphisms/variants exert on IN in terms of its function, structure and susceptibility to IN inhibitory antiretroviral drugs. South Africa is home to the largest HIV-1 infected population, with (group M) subtype C being the most prevalent subtype. An investigation into IN is therefore pertinent, even more so with the introduction of the IN strand-transfer inhibitor (INSTI) Dolutegravir (DTG).
  • Item
    Investigating the structural effect of Raltegravir resistance associated mutations on the South African HIV-1 Integrase subtype C protein structure
    (University of the Western Cape, 2020) Chitongo, Rumbidzai; Cloete, Ruben
    Background and Aims Human Immunodeficiency Virus (HIV) type 1 group M subtype C (HIV-1C) accounts for nearly half of global HIV-1 infections, with South Africa (SA) being one of the countries with the highest infection burden. In recent years, SA has made great strides in tackling its HIV epidemic, resulting in the country being recognized globally as the one sub-Saharan country with the largest combination antiretroviral therapy (cART) programme. Regardless of the potency of cART, the efficacy of the treatment is limited and hampered by the emergence of drug resistance. The majority of research on HIV-1 infections, effect of antiretroviral (ARV) drugs and understanding resistance to ARV drugs has been extensively conducted, but mainly on HIV-1 subtype B (HIV-1B), with less information known about HIV-1C. HIV-1’s viral Integrase (IN) enzyme has become a viable target for highly specific cART, due to its importance in the infection and replication cycle of the virus. The lack of a complete HIV-1C IN protein structure has negatively impacted the progress on structural studies of nucleoprotein reaction intermediates. The mechanism of HIV-1 viral DNA’s integration has been studied extensively at biochemical and cellular levels, but not at a molecular level. This study aims to use in silico methods that involve molecular modeling and molecular dynamic (MD) simulations to prioritize mutations that could affect HIV-1C IN binding to DNA and the IN strand-transfer inhibitor (INSTI) dolutegravir (DTG). The purpose is to help tailor more effective personalized treatment options for patients living with HIV in SA. This study will in part use patient derived sequence data to identify mutations and model them into the protein structure to understand their impact on the HIV-1C IN protein structure folding and dynamics. Methods Our sample cohort consisted of 11 sample sequences derived from SA HIV-1 treatmentexperienced patients who were being treated with the INSTI raltegravir (RAL). The sequences were submitted to the Stanford HIV resistance database (HIVdb) to screen for any new/novel variants resulting from possible RAL failure. Some of these new variants were analyzed to analyse their effect, if any, on the binding of DTG to the HIV-1C IN protein. Additionally, an HIV-1C IN consensus sequence constructed from SA’s HIV-1 infected population was used to model a complete three-dimensional wild type (WT) HIV-1C IN homology model. All samples were sequenced by our collaborators at the Division of Medical Virology, Stellenbosch University together with the National Health Laboratory Services (NHLS), SA. The HIV-1CZA WT-IN protein enzyme was predicted using SWISS-MODEL, and the quality of the resulting model validated. Various analyses were conducted in order to study and assess the effect of the selected new variants on the protein structure and binding of DTG to the IN protein. The mutation Cutoff Scanning Matrix (mCSM) program was used to predict protein stability after mutation, while PyMol helped to study any changes in polar contact activity before and after mutation. PyMol was also used to generate four mutant HIV-1C IN complex structures and these structures together with the WT IN were subjected to production MD simulations for 150 nanoseconds (ns). Trajectory analyses of the MD simulations were also conducted and reported. Results A total of 21 new variants were detected in our sample cohort, from which only six were chosen for further analyses within the study. A homology model of HIV-1C IN was successfully constructed and validated. The structural quality assessment indicated high reliability of the HIV-1C IN tetrameric structure, with more than 90.0% confidence in modelled regions. Of the six selected variants, only one (S119P) was calculated to be slightly stabilizing to the protein structure, with the other five found to be destabilizing to the IN protein structure. Variant S119P showed a loss in polar contacts that could destabilize the protein structure, while variant Y143R, resulted in the gain of polar contacts which could reduce flexibility of the 140’s region affecting drug binding. Similarly, mutant systems P3 (S119P, Y143R) and P4 (V150A, M154I) showed reduced hydrogen bond formation and the weakest non-bonded pairwise interaction energy. These two systems, P3 and P4, also showed significantly reduced to none polar contacts between DTG, magnesium (MG) ions and the IN protein, compared to the WT IN and P2 mutant IN systems. Interestingly, the WT structure and systems P1 (I113V) and P2 (L63I, V75M, Y143R) showed the highest non-bonded interaction energy, compared to systems P3 and P4. This was further supported by the polar interaction analyses of simulation clusters from the WT IN and mutant IN system P2 (L63I, V75M, Y143R), which were the only protein structures that formed polar contacts with DTG, MG ions and DDE motif residues, while P1 only made contacts with DNA and IN residues. Conclusion Findings from this study leads to a conclusion that double mutants (S119P, Y143R) and (V150A, M154I) may result in a reduction in the efficacy of DTG, especially when in combination. Furthermore, variants identified in systems P1 and P2 may still allow for effective DTG binding to IN and outcompete viral DNA for host DNA to prevent strand transfer. To the best of our knowledge, this is the first study that uses the consensus WT HIV1C IN sequence to build an accurate 3D homology model to understand the effect of less frequently detected/reported variants on DTG binding in a South African context. https://etd.
  • Item
    Exploring the influence of organisational, environmental, and technological factors on information security policies and compliance at South African higher education institutions: Implications for biomedical research.
    (University of the Western Cape, 2020) Abiodun, Oluwafemi Peter; Christoffels, Alan
    Headline reports on data breaches worldwide have resulted in heightened concerns about information security vulnerability. In Africa, South Africa is ranked among the top ‘at-risk’ countries with information security vulnerabilities and is the most the most cybercrime-targeted country. Globally, such cyber vulnerability incidents greatly affect the education sector, due, in part, to the fact that it holds more Personal Identifiable Information (PII) than other sectors. PII refers to (but is not limited to) ID numbers, financial account numbers, and biomedical research data. In response to rising threats, South Africa has implemented a regulation called the Protection of Personal Information Act (POPIA), similar to the European Union General Data Protection Regulation (GDPR), which seeks to mitigate cybercrime and information security vulnerabilities. The extent to which African institutions, especially in South Africa, have embraced and responded to these two information security regulations remains vague, making it a crucial matter for biomedical researchers. This study aimed to assess whether the participating universities have proper and reliable information security practices, measures and management in place and whether they fall in line with both national (POPIA) and international (GDPR) regulations. In order to achieve this aim, the study undertook a qualitative exploratory analysis of information security management across three universities in South Africa. A Technology, Organizational, and Environmental (TOE) model was employed to investigate factors that may influence effective information security measures. A Purposeful sampling method was employed to interview participants from each university. From the technological standpoint, Bring Your Own Device (BYOD) policy, whereby on average, a student owns and connects between three to four internet-enabled devices to the network, has created difficulties for IT teams, particularly in the areas of authentication, explosive growth in bandwidth, and access control to security university servers. In order to develop robust solutions to mitigate these concerns, and which are not perceived by users as overly prohibitive, executive management should acknowledge that security and privacy issues are a universal problem and not solely an IT problem and equip the IT teams with the necessary tools and mechanisms to allow them to overcome commonplace challenges. At an organisational level, information security awareness training of all users within the university setting was identified as a key factor in protecting the integrity, confidentiality, and availability of information in highly networked environments. Furthermore, the University’s information security mission must not simply be a link on a website, it should be constantly re-enforced by informing users during, and after, the awareness training. In terms of environmental factors, specifically the GDPR and POPIA legislations, one of the most practical and cost-effective ways universities can achieve data compliance requirements is to help staff (both teaching and non-teaching), students, and other employees understand the business value of all information. Users which are more aware of sensitivity of data, risks to the data, and their responsibilities when handling, storing, processing, and distributing data during their day to day activities will behave in a manner that would makes compliance easier at the institutional level. Results obtained in this study helped to elucidate the current status, issues, and challenges which universities are facing in the area of information security management and compliance, particularly in the South African context. Findings from this study point to organizational factors being the most critical when compared to the technological and environmental contexts examined. Furthermore, several proposed information security policies were developed with a view to assist biomedical practitioners within the institutional setting in protecting sensitive biomedical data.
  • Item
    Exploring the influence of organisational, environmental, and technological factors on information security policies and compliance at South African higher education institutions: Implications for biomedical research.
    (University of Western Cape, 2020) Abiodun, Oluwafemi Peter; Christoffels, Alan; Anderson, Dominique
    Headline reports on data breaches worldwide have resulted in heightened concerns about information security vulnerability. In Africa, South Africa is ranked among the top ‘at-risk’ countries with information security vulnerabilities and is the most cybercrime-targeted country. Globally, such cyber vulnerability incidents greatly affect the education sector, due, in part, to the fact that it holds more Personal Identifiable Information (PII) than other sectors. PII refers to (but is not limited to) ID numbers, financial account numbers, and biomedical research data.
  • Item
    Establishing a framework for an African Genome Archive
    (University of Western Cape, 2019) Southgate, Jamie; Christoffels, Alan
    The generation of biomedical research data on the African continent is growing, with numerous studies realizing the importance of African genetic diversity in discoveries of human origins and disease susceptibility. The decrease in costs to purchase and utilize such tools has enabled research groups to produce datasets of significant scientific value. However, this success story has resulted in a new challenge for African Researchers and institutions. An increase in data scale and complexity has led to an imbalance of infrastructure and skills to manage, store and analyse this data
  • Item
    Data Science techniques for predicting plant genes involved in secondary metabolites production
    (University of the Western Cape, 2018) Muteba, Ben Ilunga; Christoffels, Alan
    Plant genome analysis is currently experiencing a boost due to reduced costs associated with the development of next generation sequencing technologies. Knowledge on genetic background can be applied to guide targeted plant selection and breeding, and to facilitate natural product discovery and biological engineering. In medicinal plants, secondary metabolites are of particular interest because they often represent the main active ingredients associated with health-promoting qualities. Plant polyphenols are a highly diverse family of aromatic secondary metabolites that act as antimicrobial agents, UV protectants, and insect or herbivore repellents. Most of the genome mining tools developed to understand genetic materials have very seldom addressed secondary metabolite genes and biosynthesis pathways. Little significant research has been conducted to study key enzyme factors that can predict a class of secondary metabolite genes from polyketide synthases. The objectives of this study were twofold: Primarily, it aimed to identify the biological properties of secondary metabolite genes and the selection of a specific gene, naringenin-chalcone synthase or chalcone synthase (CHS). The study hypothesized that data science approaches in mining biological data, particularly secondary metabolite genes, would enable the compulsory disclosure of some aspects of secondary metabolite (SM). Secondarily, the aim was to propose a proof of concept for classifying or predicting plant genes involved in polyphenol biosynthesis from data science techniques and convey these techniques in computational analysis through machine learning algorithms and mathematical and statistical approaches. Three specific challenges experienced while analysing secondary metabolite datasets were: 1) class imbalance, which refers to lack of proportionality among protein sequence classes; 2) high dimensionality, which alludes to a phenomenon feature space that arises when analysing bioinformatics datasets; and 3) the difference in protein sequences lengths, which alludes to a phenomenon that protein sequences have different lengths. Considering these inherent issues, developing precise classification models and statistical models proves a challenge. Therefore, the prerequisite for effective SM plant gene mining is dedicated data science techniques that can collect, prepare and analyse SM genes.
  • Item
    Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes
    (University of the Western Cape, 2018) Esterhuysen, Fanechka Naomi; Gamieldien, Junaid
    INTRODUCTION: Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as Mammaprint®, OncotypeDX®, and FoundationOne® CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level. The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier. METHODOLOGY: For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes. For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes. The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations. RESULTS: The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM. Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals. CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests.
  • Item
    Enabling the processing of bioinformatics workflows where data is located through the use of cloud and container technologies
    (University of the Western Cape, 2019) de Beste, Eugene; Christoffels, Alan
    The growing size of raw data and the lack of internet communication technology to keep up with that growth is introducing unique challenges to academic researchers. This is especially true for those residing in rural areas or countries with sub-par telecommunication infrastructure. In this project I investigate the usefulness of cloud computing technology, data analysis workflow languages and portable computation for institutions that generate data. I introduce the concept of a software solution that could be used to simplify the way that researchers execute their analysis on data sets at remote sources, rather than having to move the data. The scope of this project involved conceptualising and designing a software system to simplify the use of a cloud environment as well as implementing a working prototype of said software for the OpenStack cloud computing platform. I conclude that it is possible to improve the performance of research pipelines by removing the need for researchers to have operating system or cloud computing knowledge and that utilising technologies such as this can ease the burden of moving data.
  • Item
    An evaluation of galaxy and ruffus-scripting workflows system for DNA-seq analysis
    (University of the Western Cape, 2018) Oluwaseun, Ajayi Olabode; Christoffels, Alan
    Functional genomics determines the biological functions of genes on a global scale by using large volumes of data obtained through techniques including next-generation sequencing (NGS). The application of NGS in biomedical research is gaining in momentum, and with its adoption becoming more widespread, there is an increasing need for access to customizable computational workflows that can simplify, and offer access to, computer intensive analyses of genomic data. In this study, the Galaxy and Ruffus frameworks were designed and implemented with a view to address the challenges faced in biomedical research. Galaxy, a graphical web-based framework, allows researchers to build a graphical NGS data analysis pipeline for accessible, reproducible, and collaborative data-sharing. Ruffus, a UNIX command-line framework used by bioinformaticians as Python library to write scripts in object-oriented style, allows for building a workflow in terms of task dependencies and execution logic. In this study, a dual data analysis technique was explored which focuses on a comparative evaluation of Galaxy and Ruffus frameworks that are used in composing analysis pipelines. To this end, we developed an analysis pipeline in Galaxy, and Ruffus, for the analysis of Mycobacterium tuberculosis sequence data. Furthermore, this study aimed to compare the Galaxy framework to Ruffus with preliminary analysis revealing that the analysis pipeline in Galaxy displayed a higher percentage of load and store instructions. In comparison, pipelines in Ruffus tended to be CPU bound and memory intensive. The CPU usage, memory utilization, and runtime execution are graphically represented in this study. Our evaluation suggests that workflow frameworks have distinctly different features from ease of use, flexibility, and portability, to architectural designs.