Data Science techniques for predicting plant genes involved in secondary metabolites production
Loading...
Date
2018
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of the Western Cape
Abstract
Plant genome analysis is currently experiencing a boost due to reduced costs associated with the
development of next generation sequencing technologies. Knowledge on genetic background can be
applied to guide targeted plant selection and breeding, and to facilitate natural product discovery and
biological engineering. In medicinal plants, secondary metabolites are of particular interest because they
often represent the main active ingredients associated with health-promoting qualities.
Plant polyphenols are a highly diverse family of aromatic secondary metabolites that act as antimicrobial
agents, UV protectants, and insect or herbivore repellents. Most of the genome mining tools developed
to understand genetic materials have very seldom addressed secondary metabolite genes and biosynthesis
pathways. Little significant research has been conducted to study key enzyme factors that can predict a
class of secondary metabolite genes from polyketide synthases.
The objectives of this study were twofold: Primarily, it aimed to identify the biological properties of
secondary metabolite genes and the selection of a specific gene, naringenin-chalcone synthase or
chalcone synthase (CHS). The study hypothesized that data science approaches in mining biological data,
particularly secondary metabolite genes, would enable the compulsory disclosure of some aspects of
secondary metabolite (SM).
Secondarily, the aim was to propose a proof of concept for classifying or predicting plant genes involved
in polyphenol biosynthesis from data science techniques and convey these techniques in computational
analysis through machine learning algorithms and mathematical and statistical approaches.
Three specific challenges experienced while analysing secondary metabolite datasets were: 1) class
imbalance, which refers to lack of proportionality among protein sequence classes; 2) high
dimensionality, which alludes to a phenomenon feature space that arises when analysing bioinformatics
datasets; and 3) the difference in protein sequences lengths, which alludes to a phenomenon that protein
sequences have different lengths.
Considering these inherent issues, developing precise classification models and statistical models proves
a challenge. Therefore, the prerequisite for effective SM plant gene mining is dedicated data science
techniques that can collect, prepare and analyse SM genes.
Description
Masters of Science
Keywords
Medicinal plants, Polyphenols, Feature selection, Data visualisation, Feature engineering