Browsing by Author "Hide, Winston"
Now showing 1 - 19 of 19
Results Per Page
Sort Options
Item Ancient Genes in Cancer Gene Expression?(University of the Western Cape, 2004) Panji, Sumir; Hide, WinstonBacksround: The Cancer/testis (CT) antigens are a division of germ cell specific genes not expressed in somatic cells, exceptions being placental cells and 20Vo - 4OVo of cancer types. The aptitude of CT antigens to elicit humoral immune responses, their restricted expression profile, absence of major histocompatability complex expression in male germline cells have contributed to the emergent attraction of CT antigens as ideal, prospective cancer vaccination candidates. Motivation: Presently there are M CT gene families containing a total of 97 gene products and isoforms. Due to the promulgation in sensitivity and specificity of rapid serological immunodetection assays e.g. serial analysis of recombinant cDNA expression libraries (SEREX), the magnitude of novel CT genes and gene families will increase. Hence, characteization of this unique subset of CT genes is fundamental to our erudition of this rapidly emerging novel subset of genes. Obiectives: The sequencing of the human genome provides a useful biological framework for the categoization and systematization of rapidly accumulating biological information. A genomic approach was used to ascertain the locations of the CT genes in the human genome and determine if the genomic locations of the CT genes is nonrandom. An in-silico expression study was conducted for the CT genes with the aim of establishing if CT gene expression is restricted to the testis. A portion of the human genome housing the largest proportion of the CT genes was selected for analysis in order to determine if the surrounding genomic architecture influences CT gene expression. A comparative genomics approach was used in determining if the CT genes are "ancient genes.Item Assessment of genome visualization tools relevant to HIV genome research: development of a genome browser prototype(University of the Western Cape, 2004) Boardman, Anelda Philine; Hide, Winston; Faculty of ScienceOver the past two decades of HIV research, effective vaccine candidates have been elusive. Traditionally viral research has been characterized by a gene -by-gene approach, but in the light of the availability of complete genome sequences and the tractable size of the HIV genome, a genomic approach may improve insight into the biology and epidemiology of this virus. A genomic approach to finding HIV vaccine candidates can be facilitated by the use of genome sequence visualization. Genome browsers have been used extensively by various groups to shed light on the biology and evolution of several organisms including human, mouse, rat, Drosophila and C.elegans. Application of a genome browser to HIV genomes and related annotations can yield insight into forces that drive evolution, identify highly conserved regions as well as regions that yields a strong immune response in patients, and track mutations that appear over the course of infection. Access to graphical representations of such information is bound to support the search for effective HIV vaccine candidates. This study aimed to answer the question of whether a tool or application exists that can be modified to be used as a platform for development of an HIV visualization application and to assess the viability of such an implementation. Existing applications can only be assessed for their suitability as a basis for development of an HIV genome browser once a well-defined set of assessment criteria has been compiled.Item A comparative genomics approach towards classifying immunity-related proteins in the tsetse fly(2009) Mpondo, Feziwe; Hide, Winston; Christoffels, AlanTsetse flies (Glossina spp) are vectors of African trypanosome (Trypanosoma spp) parasites, causative agents of Human African trypanosomiasis (sleeping sickness) and Nagana in livestock. Research suggests that tsetse fly immunity factors are key determinants in the success and failure of infection and the maturation process of parasites. An analysis of tsetse fly immunity factors is limited by the paucity of genomic data for Glossina spp. Nevertheless, completely sequenced and assembled genomes of Drosophila melanogaster, Anopheles gambiae and Aedes aegypti provide an opportunity to characterize protein families in species such as Glossina by using a comparative genomics approach. In this study we characterize thioester-containing proteins (TEPs), a sub-family of immunity-related proteins, in Glossina by leveraging the EST data for G.morsitans and the genomic resources of D. melanogaster, A. gambiae as well as A.aegypti.A total of 17 TEPs corresponding to Drosophila (four TEPs), Anopheles (eleven TEPs) and Aedes aegypti (two TEPs) were collected from published data supplemented with Genbank searches. In the absence of genome data for G. morsitans, 124 000 G.morsitans ESTs were clustered and assembled into 18 413 transcripts (contigs and singletons). Five Glossina contigs (Gmcn1115, Gmcn1116, Gmcn2398, Gmcn2281 and Gmcn4297) were identified as putative TEPs by BLAST searches. Phylogenetic analyses were conducted to determine the relationship of collected TEP proteins.Gmcn1115 clustered with DmtepI and DmtepII while Gmcn2398 is placed in a separate branch, suggesting that it is specific to G. morsitans.The TEPs are highly conserved within D. melanogaster as reflected in the conservation of the thioester domain, while only two and one TEPs in A. gambiae and A. aegypti thioester domain show conservation of the thioester domain suggesting that these proteins are subjected to high levels of selection. Despite the absence of a sequenced genome for G. morsitans, at least two putative TEPs where identified from EST data.Item A comparative genomics approach towards classifying immunity-related proteins in the tsetse fly(University of the western cape, 2009) Mpondo, Feziwe; Hide, Winston; Christoffels, AlanTsetse flies (Glossina spp) are vectors of African trypanosome (Trypanosoma spp) parasites, causative agents of Human African trypanosomiasis (sleeping sickness) and Nagana in livestock. Research suggests that tsetse fly immunity factors are key determinants in the success and failure of infection and the maturation process of parasites. An analysis of tsetse fly immunity factors is limited by the paucity of genomic data for Glossina spp. Nevertheless, completely sequenced and assembled genomes Drosophila melanogaster, Anopheles gambiae and Aedes aegypti provide an opportunity to characterize protein families in species such as G/ossiza by using a comparative genomics approach. In this study, we characterize thioester-containing proteins (TEPs), a sub-family of immunity-related proteins, in Glossinaby leveraging the EST data for G. morsitans and the genomic resources of D. melanogaster, A. gambiae as well as A. aegyptiItem Detection of positive selection resulting from Nevirapine treatment in longitudinal HIV-1 reverse transcriptase sequences(University of the Western Cape, 2006) Ketwaroo, Bibi Farahnaz K.; Hide, Winston; Seoighe, Cathal; Scheffle, Konrad; South African National Bioinformatics Institute (SANBI); Faculty of ScienceNevirapine (NVP) is a cheap anti-retroviral drug used in poor countries worldwide, administered to pregnant women at the onset of labour to inhibit HIV enzyme reverse transcriptase. Viruses which may get transmitted to newborns are deficient in this enzyme, and HIV-1 infection cannot be established, thereby preventing mother to child transmission (MTCT). In some cases, babies get infected and positive selection for viruses resistant to nevirapine may be inferred. Positive selection can be inferred from sequence data, when the rate of nonsynonymous substitutions is significantly greater than the rate of synonymous substitutions. Unfortunately, it is found that available positive selection methods should not be used to analyse before- and after- NVP treatment sequence pairs associated with MTCT. Methods which use phylogenetic trees to infer positive selection trace synonymous and nonsynonymous substitutions further back in time than the short time duration during which selection for NVP occurred. The other group of methods for inferring positive selection, the pairwise methods, do not have appreciable power, because they average susbtituion rates over all codons in a sequence pair and not just at single codons. We introduce a simple counting method which we call the Pairwise Homologous Codons (PHoCs) method with which we have inferred positive selection resulting from NVP treatment in longitudinal HIV-1 reverse transcriptase sequences. The PHoCs method estimates rates of substitutions between before- and after- NVP treatment codons, using a simple pairwise method.Item The development and application of informatics-based systems for the analysis of the human transcriptome(University of the Western Cape, 2003) Kelso, Janet; Hide, Winston; Faculty of ScienceDespite the fact that the sequence of the human genome is now complete it has become clear that the elucidation of the transcriptome is more complicated than previously expected. There is mounting evidence for unexpected and previously underestimated phenomena such as alternative splicing in the transcriptome. As a result, the identification of novel transcripts arising from the genome continues. Furthermore, as the volume of transcript data grows it is becoming increasingly difficult to integrate expression information which is from different sources, is stored in disparate locations, and is described using differing terminologies. Determining the function of translated transcripts also remains a complex task. Information about the expression profile – the location and timing of transcript expression – provides evidence that can be used in understanding the role of the expressed transcript in the organ or tissue under study, or in developmental pathways or disease phenotype observed. In this dissertation I present novel computational approaches with direct biological applications to two distinct but increasingly important areas of research in gene expression research. The first addresses detection and characterisation of alternatively spliced transcripts. The second is the construction of an hierarchical controlled vocabulary for gene expression data and the annotation of expression libraries with controlled terms from the hierarchies. In the final chapter the biological questions that can be approached, and the discoveries that can be made using these systems are illustrated with a view to demonstrating how the application of informatics can both enable and accelerate biological insight into the human transcriptome.Item Development and implementation of ontology-based systems for mammalian gene expression profiling(2009) Kruger, Adéle; Hide, WinstonThe use of ontologies in the mapping of gene expression events provides an effective and comparable method to determine the expression profile of an entire genome across a large collection of experiments derived from different expression sources. In this dissertation I describe the development of the developmental human and mouse eVOC ontologies and demonstrate the ontologies by identifying genes showing a bias for developmental brain expression in human and mouse, identifying transcription factor complexes, and exploring the mouse orthologs of human cancer/testis genes.Model organisms represent an important resource for understanding the fundamental aspects of mammalian biology. Mapping of biological phenomena between model organisms is complex and if it is to be meaningful, a simplified representation can be a powerful means for comparison. The implementation of the ontologies has been illustrated here in two ways.Firstly, the ontologies have been used to illustrate methods to determine clusters of genes showing tissue-restricted expression in humans. The identification of tissue restricted genes within an organism serves as an indication of the finetuning in the regulation of gene expression in a given tissue. Secondly, due to the differences in human and mouse gene expression on a temporal and spatial level, the ontologies were used to identify mouse orthologs of human cancer/testis genes showing cancer/testis characteristics. With the use of model systems such as mouse in the development of gene-targeted drugs in the treatment of disease, it is important to establish that the expression characteristics and profiles of a drug target in the model system is representative of the characteristics of the target in the system for which it is intended.Item Development and implementation of ontology-based systems for mammalian gene expression profiling(University of the Western Cape, 2009) Kruger, Adele; Hide, WinstonThe use of ontologies in the mapping of gene expression events provides an effective and comparable method to determine the expression profile of an entire genome across a large collection of experiments derived from different expression sources. In this dissertation I describe the development of the developmental human and mouse e voe ontologies and demonstrate the ontologies by identifying genes showing a bias for developmental brain expression in human and mouse, identifying transcription factor complexes, and exploring the mouse orthologs of human cancer/testis genes.Item "Development and implementation of ontology-based systems for mammalian gene expression profiling"(University of the Western Cape, 2009) Kruger, Adele; Hide, WinstonThe use of ontologies in the mapping of gene expression events provides an effective and comparable method to determine the expression profile of an entire genome across a large collection of experiments derived from different expression sources. In this dissertation I describe the development of the developmental human and mouse e VOC ontologies and demonstrate the ontologies by identifying genes showing a bias for developmental brain expression in human and mouse, identifying transcription factor complexes, and exploring the mouse orthologs of human cancer/testis genes. Model organisms represents fundamental aspects of mammal biology phenomena between model organism is complex and it is to be the meaningful, a simplified representation can be a powerful means for comparison illustrated here in two ways. Firstly, the ontologies have been used to illustrate methods to determine clusters of genes showing tissue-restricted expression in humans. The identification of tissue-restricted genes within an organism serves as an indication of the finetuning in the regulation of gene expression in a given tissue. Secondly, due to the differences in human and mouse gene expression on a temporal and spatial level, the ontologies were used to identify mouse orthologs of human cancer/testis genes showing cancer/testis characteristics. With the use of model systems such as mouse in the development of gene-targeted drugs in the treatment of disease, it is important to establish that the expression characteristics and profiles of a drug target in the model system is representative of the characteristics of the target in the system for which it is intended.Item The genease activity of mung bean nuclease: fact or fiction?(University of the Western Cape, 2004) Kula, Nothemba; Hide, Winston; Dept. of Biotechnology; Faculty of ScienceThe action of Mung Bean Nuclease (MBN) on DNA makes it possible to clone intact gene fragments from genes of the malaria parasite, Plasmodium. This “genease” activity has provided a foundation for further investigation of the coding elements of the Plasmodium genome. MBN has been reported to cleave genomic DNA of Plasmodium preferentially at positions before and after genes, but not within gene coding regions. This mechanism has overcome the difficulty encountered in obtaining genes with low expression levels because the cleavage mechanism of the enzyme yields sequences of genes from genomic DNA rather than mRNA. However, as potentially useful as MBN may be, evidence to support its genease activity comes from analysis of a limited number of genes. It is not clear whether this mechanism is specific to certain genes or species of Plasmodia or whether it is a general cleavage mechanism for Plasmodium DNA .There have also been some projects (Nomura et al., 2001;van Lin, Janse, and Waters, 2000) which have identified MBN generated fragments which contain fragments of genes with both introns and exons, rather than the intact genes expected from MBN-digestion of genomic DNA, which raises concerns about the efficiency of the MBN mechanism in generating complete genes.Using a large-scale, whole genome mapping approach, 7242 MBN generated genome survey sequences (GSSs) have been mapped to determine their position relative to coding sequences within the complete genome sequences of the human malaria parasite Plasmodium falciparum and the incomplete genome of a rodent malaria parasite Plasmodium berghei. The location of MBN cleavage sites was determined with respect to coding regions in orthologous genes, non-coding intergenic regions and exon-intron boundaries in these two species of Plasmodium. The survey illustrates that for P. falciparum 79% of GSSs had at least one terminal mapping within an ortholog coding sequence and 85% of GSSs which overlapped coding sequence boundaries mapped within 50 bp of the start or end of the gene. Similarly, despite the partial nature of P.berghei genome sequence information, 73% of P.berghei GSSs had at least one terminal mapping within an ortholog coding sequence and 37% of these mapped between 0-50 bp of the start or end of the gene. This indicates that a larger percentage of cleavage sites in both P.falciparum and P.berghei were found proximal to coding regions. Furthermore, 86% of P.falciparum GSSs had at least one terminal mapping within a coding exon and 85% of GSSs which overlapped exon-intron boundaries mapped within 50bp of the exon start and end site. The fact that 11% of GSSs mapped completely to intronic regions, suggests that some introns contain specific cleavage sites sensitive to cleavage and this also indicates that MBN cleavage of Plasmodium DNA does not always yield complete exons. Finally, the results presented herein were obtained from analysis of several thousand Plasmodium genes which have different coding sequences, in different locations on individual chromosomes/contigs in two different species of Plasmodium. Therefore it appears that the MBN mechanism is neither species specific nor is it limited to specific genes.Item Generation of a human gene index and its application to disease candidacy(University of the Western Cape, 2001) Christoffels, Alan; Hide, Winston; Faculty of ScienceWith easy access to technology to generate expressed sequence tags (ESTs), several groups have sequenced from thousands to several thousands of ESTs. These ESTs benefit from consolidation and organization to deliver significant biological value. A number of EST projects are underway to extract maximum value from fragmented EST resources by constructing gene indices, where all transcripts are partitioned into index classes such that transcripts are put into the same index class if they represent the same gene. Therefore a gene index should ideally represent a non-redundant set of transcripts. Indeed, most gene indices aim to reconstruct the gene complement of a genome and their technological developments are directed at achieving this goal. The South African National Bioinformatics Institute (SANBI), on the other hand, embarked on the development of the sequence alignment and consensus knowledgebase (STACK) database that focused on the detection and visualisation of transcript variation in the context of developmental and pathological states, using all publicly available ESTs. Preliminary work on the STACK project employed an approach of partitioning the EST data into arbitrarily chosen tissue categories as a means of reducing the EST sequences to manageable sizes for subsequent processing. The tissue partitioning provided the template material for developing error-checking tools to analyse the information embedded in the error-laden EST sequences. However, tissue partitioning increases redundancy in the sequence data because one gene can be expressed in multiple tissues, with the result that multiple tissue partitioned transcripts will correspond to the same gene.Therefore, the sequence data represented by each tissue category had to be merged in order to obtain a comprehensive view of expressed transcript variation across all available tissues. The need to consolidate all EST information provided the impetus for developing a STACK human gene index, also referred to as a whole-body index. In this dissertation, I report on the development of a STACK human gene index represented by consensus transcripts where all constituent ESTs sample single or multiple tissues in order to provide the correct development and pathological context for investigating sequence variation. Furthermore, the availability of a human gene index is assessed as a diseasecandidate gene discovery resource. A feasible approach to construction of a whole-body index required the ability to process error-prone EST data in excess of one million sequences (1,198,607 ESTs as of December 1998). In the absence of new clustering algorithms, at that time, we successfully ported D2_CLUSTER, an EST clustering algorithm, to the high performance shared multiprocessor machine, Origin2000. Improvements to the parallelised version of D2_CLUSTER included: (i) ability to cluster sequences on as many as 126 processors. For example, 462000 ESTs were clustered in 31 hours on 126 R10000 MHz processors, Origin2000. (ii) enhanced memory management that allowed for clustering of mRNA sequences as long as 83000 base pairs. (iii) ability to have the input sequence data accessible to all processors, allowing rapid access to the sequences. (iv) a restart module that allowed a job to be restarted if it was interrupted. The successful enhancements to the parallelised version of D2_CLUSTER, as listed above, allowed for the processing of EST datasets in excess of 1 million sequences. An hierarchical approach was adopted where 1,198,607 million ESTs from GenBank release 110 (October 1998) were partitioned into "tissue bins" and each tissue bin was processed through a pipeline that included masking for contaminants, clustering, assembly, assembly analysis and consensus generation. A total of 478,707 consensus transcripts were generated for all the tissue categories and these sequences served as the input data for the generation of the wholebody index sequences. The clustering of all tissue-derived consensus transcripts was followed by the collapse of each consensus sequence to its individual ESTs prior to assembly and whole-body index consensus sequence generation. The hierarchical approach demonstrated a consolidation of the input EST data from 1,198607 ESTs to 69,158 multi-sequence clusters and 162,439 singletons (or individual ESTs). Chromosomal locations were added to 25,793 whole-body index sequences through assignment of genetic markers such as radiation hybrid markers and généthon markers. The whole-body index sequences were made available to the research community through a sequence-based search engine (http://ziggy.sanbi.ac.za/~alan/researchINDEX.html).Item HIV subtype C diversity: analysis of the relationship of sequence diversity to proposed epitope locations(University of the Western Cape, 2002) Ernstoff, Elana Ann; Hide, Winston; Seoighe, Cathal; South African National Bioinformatics Institute (SANBI); Faculty of ScienceSouthern Africa is facing one of the most serious HIV epidemics. This project contributes to the HIVNET, Network for Prevention Trials cohort for vaccine development. HIVÂ’s biology and rapid mutation rate have made vaccine design difficult. We examined HIV-1 subtype C diversity and how it relates to CTL epitope location along viral gag sequences. We found a negative correlation between codon sites under positive selection and epitope regions; suggesting epitope regions are evolutionarily conserved. It is possible that epitopes exist in non-conserved regions, yet fail to be detected due to the reference strain diverging from the circulating viral population. To test if CTL clustering is an artifact of the reference strain, we calculated differences between the gag codons and the reference strain. We found a weak negative correlation, suggesting epitopes in less conserved regions maybe evading detection. Locating conserved and optimal epitopes that can be recognized by CTLs is essential for the design of vaccine reagents.Item HIV Subtype C Diversity: Analysis of the Relationship of Sequence Diversity to Proposed Epitope Locations(University of the Western Cape, 2002) Ernstoff, Elana Ann; Hide, WinstonSouthern Africa is facing one of the most serious HIV epidemics. This project contributes to the HIVNET, Network for Prevention Trials cohort for vaccine development. HIV's biology and rapid mutation rate have made vaccine design difficult. We examined HIV-l subtype C diversity and how it relates to CTL epitope location along viral gag sequences. We found a negative correlation between codon sites under positive selection and epitope regions; suggesting epitope regions are evolutionarily conserved. It is possible that detected due to the reference regions, yet fail to be viral population. To test if CTL clustering is an we calculated differences between the gag codons and the a weak negative correlation, suggesting epitopes in less conserved regions maybe evading detection. Locating conserved and optimal epitopes that can be recognized by CTLs is essential for the design of vaccine reagents.Item Identification of bacterial pathogenic gene classes subject to diversifying selection(University of the Western Cape, 2009) Panji, Sumir; Hide, Winston; Bajic, Vladimir; Faculty of ScienceAvailability of genome sequences for numerous bacterial species comprising of different bacterial strains allows elucidation of species and strain specific adaptations that facilitate their survival in widely fluctuating micro-environments and enhance their pathogenic potential. Different bacterial species use different strategies in their pathogenesis and the pathogenic potential of a bacterial species is dependent on its genomic complement of virulence factors. A bacterial virulence factor, within the context of this study, is defined as any endogenous protein product encoded by a gene that aids in the adhesion, invasion, colonization, persistence and pathogenesis of a bacterium within a host. Anecdotal evidence suggests that bacterial virulence genes are undergoing diversifying evolution to counteract the rapid adaptability of its host’s immune defences. Genome sequences of pathogenic bacterial species and strains provide unique opportunities to study the action of diversifying selection operating on different classes of bacterial genes.Item Low detection of exon skipping in mouse genes orthologous to human genes on chromosome 22(University of the Western Cape, 2002) Chern, Tzu-Ming; Hide, Winston; Faculty of ScienceAlternative RNA splicing is one of the leading mechanisms contributing towards transcript and protein diversity. Several alternative splicing surveys have confirmed the frequent occurrence of exon skipping in human genes. However, the occurrence of exon skipping in mouse genes has not yet been extensively examined. Recent improvements in mouse genome sequencing have permitted the current study to explore the occurrence of exon skipping in mouse genes orthologous to human genes on chromosome 22. A low number (5/72 multi-exon genes) of mouse exon-skipped genes were captured through alignments of mouse ESTs to mouse genomic contigs. Exon-skipping events in two mouse exon-skipped genes (GNB1L, SMARCB1) appear to affect biological processes such as electron and protein transport. All mouse, skipped exons were observed to have ubiquitous tissue expression. Comparison of our mouse exon-skipping events to previously detected human exon-skipping events on chromosome 22 by Hide et al.2001, has revealed that mouse and human exon-skipping events were never observed together within an orthologous gene-pair. Although the transcript identity between mouse and human orthologous transcripts were high (greater than 80% sequence identity), the exon order in these gene-pairs may be different between mouse and human orthologous genes. Main factors contributing towards the low detection of mouse exon-skipping events include the lack of mouse transcripts matching to mouse genomic sequences and the under-prediction of mouse exons. These factors resulted in a large number (112/269) of mouse transcripts lacking matches to mouse genomic contigs and nearly half (12/25) of the mouse multi-exon genes, which have matching Ensembl transcript identifiers, have under-predicted exons. The low frequency of mouse exon skipping on chromosome 22 cannot be extrapolated to represent a genome-wide estimate due to the small number of observed mouse exon-skipping events. However, when compared to a higher estimate (52/347) of exon skipping in human genes for chromosome 22 produced under similar conditions by Hide et al.2001, it is possible that our mouse exon-skipping frequency may be lower than the human frequency. Our hypothesis contradicts with a previous study by Brett et al.2002, in which the authors claim that mouse and human alternative splicing is comparable. Our conclusion that the mouse exon-skipping frequency may be lower than the human estimate remains to be tested with a larger mouse multi-exon gene set. However, the mouse exon-skipping frequency may represent the highest estimate that can be obtained given that the current number (87) of mouse genes orthologous to chromosome 22 in Ensembl (v1 30th Jan. 2002) does not deviate significantly from our total number (72) of mouse multi-exon genes. The quality of the current mouse genomic data is higher than the one utilized in this study. The capture of mouse exon-skipping events may increase as the quality and quantity of mouse genomic and transcript sequences improves.Item Mice and men: Their promoter properties(PLoS Genetics, 2006) Bajic, Vladimir B.; Tan, Sin lam; Christoffels, Alan; Schonbach, Christian; Lipovich, Leonard; Yang, Liang; Hofmann, Oliver; Kruger, Adele; Hide, Winston; Kai, Chikatoshi; Kawai, Jun; Hume, David, A.; Carninci, Piero; Hayashizaki, YoshihideUsing the two largest collections of Mus musculus and Homo sapiens transcription start sites (TSSs) determined based on CAGE tags, ditags, full-length cDNAs, and other transcript data, we describe the compositional landscape surrounding TSSs with the aim of gaining better insight into the properties of mammalian promoters. We classified TSSs into four types based on compositional properties of regions immediately surrounding them. These properties highlighted distinctive features in the extended core promoters that helped us delineate boundaries of the transcription initiation domain space for both species. The TSS types were analyzed for associations with initiating dinucleotides, CpG islands, TATA boxes, and an extensive collection of statistically significant cis-elements in mouse and human. We found that different TSS types show preferences for different sets of initiating dinucleotides and ciselements. Through Gene Ontology and eVOC categories and tissue expression libraries we linked TSS characteristics to expression. Moreover, we show a link of TSS characteristics to very specific genomic organization in an example of immune-response-related genes (GO:0006955). Our results shed light on the global properties of the two transcriptomes not revealed before and therefore provide the framework for better understanding of the transcriptional mechanisms in the two species, as well as a framework for development of new and more efficient promoter- and gene-finding tools.Item New algorithms for EST clustering(University of the Western Cape, 2000) Ptitsyn, Andrey; Hide, Winston; Davidson, Sean; Dept. of Microbiology; Faculty of ScienceSummary: Expressed sequence tag database is a rich and fast growing source of data for gene expression analysis and drug discovery. Clustering of raw EST data is a necessary step for further analysis and one of the most challenging problems of modem computational biology. There are a few systems, designed for this purpose and a few more are currently under development. These systems are reviewed in the "Literature and software review". Different strategies of supervised and unsupervised clustering are discussed, as well as sequence comparison techniques, such as based on alignment or oligonucleotide compositions. Analysis of potential bottlenecks and estimation of computation complexity of EST clustering is done in Chapter 2. This chapter also states the goals for the research and justifies the need for new algorithm that has to be fast, but still sensitive to relatively short (40 bp) regions of local similarity. A new sequence comparison algorithm is developed and described in Chapter 3. This algorithm has a linear computation complexity and sufficient sensitivity to detect short regions of local similarity between nucleotide sequences. The algorithm utilizes an asymmetric approach, when one of the compared sequences is presented in a form of oligonucleotide table, while the second sequence is in standard, linear form. A short window is moved along the linear sequence and all overlapping oligonucleotides of a constant length in the frame are compared for the oligonucleotide table. The result of 85 comparison of two sequencesis a single figure, which can be compared to a threshold. For each measure of sequence similarity a probability of false positive and false negative can be estimated. The algorithm was set up and implemented to recognize matching ESTs with overlapping regions of 40bp with 95% identity, which is better than resolution ability of contemporary EST clustering tools. This algorithm was used as a sequence comparison engine for two EST clustering programs, described in Chapter 4. These programs implement two different strategies: stringent and loose clustering. Both are tested on small, but realistic benchmark data sets and show the results, similar to one of the best existing clustering programs, D2_cluster, but with a significant advantage in speed and sensitivity to small overlapping regions of ESTs. On three different CPUs the new algorithm run at least two times faster, leaving less singletons and producing bigger clusters. With parallel optimization this algorithm is capable of clustering millions of ESTs on relatively inexpensive computers. The loose clustering variant is a highly portable application, relying on third-party software for cluster assembly. It was built to the same specifications as D2_cluster and can be immediately included into the ST ACKPack package for EST clustering. The stringent clustering program produces already assembled clusters and can apprehend alternatively processed variants during the clustering process.Item New Algorithms for EST clustering(University of the Western Cape, 2000) Ptitsyn, Andrey; Hide, WinstonExpressed sequence tag database is a rich and fast growing source of data for gene expression analysis and drug discovery. Clustering of raw EST data is a necessary step for further analysis and one of the most challenging problems of modem computational biology. There are a few systems, designed for this purpose and a few more are currently under development. These systems are reviewed in the "Literature and software review". Different strategies of supervised and unsupervised clustering are discussed, as well as sequence comparison techniques, such as based on alignment or oligonucleotide compositions. Analysis of potential bottlenecks and estimation of computation complexity of EST clustering is done in Chapter 2. This chapter also states the goals for the research and justifies the need for new algorithm that has to be fast, but still sensitive to relatively short (40 bp) regions of local similarity. A new sequence comparison algorithm is developed and described in Chapter 3. This algorithm has a linear computation complexity and sufficient sensitivity to detect short regions of local similarity between nucleotide sequences. The algorithm utilizes an asymmetric approach, when one of the compared sequences is presented in a form of oligonucleotide table, while the second sequence is in standard, linear form. A short window is moved along the linear sequence and all overlapping oligonucleotides of a constant length in the frame are compared for the oligonucleotide table. The result of comparison of two sequences is a single figure, which can be compared to a threshold. For each measure of sequence similarity a probability of false positive and false negative can be estimated. The algorithm was set up and implemented to recognize matching ESTs with overlapping regions of 40bp with 95% identity, which is better than resolution ability of contemporary EST clustering tools This algorithm was used as a sequence comparison engine for two EST clustering programs, described in Chapter 4. These programs implement two different strategies: stringent and loose clustering. Both are tested on small, but realistic benchmark data sets and show the results, similar to one of the best existing clustering programs, 02_cluster, but with a significant advantage in speed and sensitivity to small overlapping regions of ESTs. On three different CPUs the new algorithm run at least two times faster, leaving less singletons and producing bigger clusters. With parallel optimization this algorithm is capable of clustering millions of ESTs on relatively inexpensive computers. The loose clustering variant is a highly portable application, relying on third-party software for cluster assembly. It was built to the same specifications as 02_ cluster and can be immediately included into the STACKPack package for EST clustering. The stringent clustering program produces already assembled clusters and can apprehend alternatively processed variants during the clustering process.Item Novel genomic approaches for the identification of virulence genes and drug targets in pathogenic bacteria(University of the Western Cape, 2001) Gamieldien, Junaid; Hide, Winston; Faculty of ScienceWhile the many completely sequenced genomes of bacterial pathogens contain all the determinants of the host-pathogen interaction, and also every possible drug target and recombinant vaccine candidate, computational tools for selecting suitable candidates for further experimental analyses are limited to date. The overall objective of my PhD project was to attempt to design reusable systems that employ the two most important features of bacterial evolution, horizontal gene transfer and adaptive mutation, for the identification of potentially novel virulence-associated factors and possible drug targets. In this dissertation, I report the development of two novel technologies that uncover novel virulence-associated factors and mechanisms employed by bacterial pathogens to effectively inhabit the host niche. More importantly, I illustrate that these technologies may present a reliable starting point for the development of screens for novel drug targets and vaccine candidates, significantly reducing the time for the development of novel therapeutic strategies. Our initial analyses of proteins predicted from the preliminary genomic sequences released by the Sanger Center indicated that a significant number appeared to be more similar to eukaryotic proteins than to their bacterial orthologs. In order determine whether acquisition of genetic material from eukaryotes has played a role in the evolution of pathogenic bacteria, we developed a system that detects genes in a bacterial genome that have been acquired by interkingdom horizontal gene transfer.. Initially, 19 eukaryotic genes were identified in the genome of Mycobacterium tuberculosis of which 2 were later found in the genome of Pseudomonas aeruginosa, along with two novel eukaryotic genes.Surprisingly, six of the M. tuberculosis genes and all four eukaryotic genes in P. aeruginosa may be involved in modulating the host immune response through altering the steroid balance and the production of pro-inflammatory lipids. We also compared the genome of the H37Rv M. tuberculosis strain to that of the CDC- 1551 strain that was sequenced by TIGR and found that the organisms were virtually identical with respect to their gene content, and hypothesized that the differences in virulence may be due to evolved differences in shared genes, rather than the absence/presence of unique genes. Using this observation as rationale, we developed a system that compares the orthologous gene complements of two strains of a bacterial species and mines for genes that have undergone adaptive evolution as a means to identify possibly novel virulence –associated genes. By applying this system to the genome sequences of two strains of Helicobacter pylori and Neisseria meningitidis, we identified 41 and 44 genes that are under positive selection in these organisms, respectively. As approximately 50% of the genes encode known or potential virulence factors, the remaining genes may also be implicated in virulence or pathoadaptation. Furthermore, 21 H. pylori genes, none of which are classic virulence factors or associated with a pathogenicity island, were tested for a role in colonization by gene knockout experiments. Of these, 61% were found to be either essential, or involved in effective stomach colonization in a mouse infection model. A significant amount of strong circumstantial and empirical evidence is thus presented that finding genes under positive selection is a reliable method of identifying novel virulence-associated genes and promising leads for drug targets.