Data Science techniques for predicting plant genes involved in secondary metabolites production

dc.contributor.advisorChristoffels, Alan
dc.contributor.authorMuteba, Ben Ilunga
dc.date.accessioned2019-10-01T10:40:41Z
dc.date.accessioned2024-05-17T07:20:11Z
dc.date.available2019-10-01T10:40:41Z
dc.date.available2024-05-17T07:20:11Z
dc.date.issued2018
dc.descriptionMasters of Scienceen_US
dc.description.abstractPlant genome analysis is currently experiencing a boost due to reduced costs associated with the development of next generation sequencing technologies. Knowledge on genetic background can be applied to guide targeted plant selection and breeding, and to facilitate natural product discovery and biological engineering. In medicinal plants, secondary metabolites are of particular interest because they often represent the main active ingredients associated with health-promoting qualities. Plant polyphenols are a highly diverse family of aromatic secondary metabolites that act as antimicrobial agents, UV protectants, and insect or herbivore repellents. Most of the genome mining tools developed to understand genetic materials have very seldom addressed secondary metabolite genes and biosynthesis pathways. Little significant research has been conducted to study key enzyme factors that can predict a class of secondary metabolite genes from polyketide synthases. The objectives of this study were twofold: Primarily, it aimed to identify the biological properties of secondary metabolite genes and the selection of a specific gene, naringenin-chalcone synthase or chalcone synthase (CHS). The study hypothesized that data science approaches in mining biological data, particularly secondary metabolite genes, would enable the compulsory disclosure of some aspects of secondary metabolite (SM). Secondarily, the aim was to propose a proof of concept for classifying or predicting plant genes involved in polyphenol biosynthesis from data science techniques and convey these techniques in computational analysis through machine learning algorithms and mathematical and statistical approaches. Three specific challenges experienced while analysing secondary metabolite datasets were: 1) class imbalance, which refers to lack of proportionality among protein sequence classes; 2) high dimensionality, which alludes to a phenomenon feature space that arises when analysing bioinformatics datasets; and 3) the difference in protein sequences lengths, which alludes to a phenomenon that protein sequences have different lengths. Considering these inherent issues, developing precise classification models and statistical models proves a challenge. Therefore, the prerequisite for effective SM plant gene mining is dedicated data science techniques that can collect, prepare and analyse SM genes.en_US
dc.identifier.urihttps://hdl.handle.net/10566/15240
dc.language.isoenen_US
dc.publisherUniversity of the Western Capeen_US
dc.rights.holderUniversity of the Western Capeen_US
dc.subjectMedicinal plantsen_US
dc.subjectPolyphenolsen_US
dc.subjectFeature selectionen_US
dc.subjectData visualisationen_US
dc.subjectFeature engineeringen_US
dc.titleData Science techniques for predicting plant genes involved in secondary metabolites productionen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
muteba_msc_nsc_2018.pdf
Size:
8.04 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: