Philosophiae Doctor - PhD (Statistics and Population Studies)
Permanent URI for this collection
Browse
Browsing by Author "Blignaut, Renette"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item Imputation techniques for non-ordered categorical missing data(University of the Western Cape, 2016) Karangwa, Innocent; Kotze, Danelle; Blignaut, RenetteMissing data are common in survey data sets. Enrolled subjects do not often have data recorded for all variables of interest. The inappropriate handling of missing data may lead to bias in the estimates and incorrect inferences. Therefore, special attention is needed when analysing incomplete data. The multivariate normal imputation (MVNI) and the multiple imputation by chained equations (MICE) have emerged as the best techniques to impute or fills in missing data. The former assumes a normal distribution of the variables in the imputation model, but can also handle missing data whose distributions are not normal. The latter fills in missing values taking into account the distributional form of the variables to be imputed. The aim of this study was to determine the performance of these methods when data are missing at random (MAR) or completely at random (MCAR) on unordered or nominal categorical variables treated as predictors or response variables in the regression models. Both dichotomous and polytomous variables were considered in the analysis. The baseline data used was the 2007 Demographic and Health Survey (DHS) from the Democratic Republic of Congo. The analysis model of interest was the logistic regression model of the woman’s contraceptive method use status on her marital status, controlling or not for other covariates (continuous, nominal and ordinal). Based on the data set with missing values, data sets with missing at random and missing completely at random observations on either the covariates or response variables measured on nominal scale were first simulated, and then used for imputation purposes. Under MVNI method, unordered categorical variables were first dichotomised, and then K − 1 (where K is the number of levels of the categorical variable of interest) dichotomised variables were included in the imputation model, leaving the other category as a reference. These variables were imputed as continuous variables using a linear regression model. Imputation with MICE considered the distributional form of each variable to be imputed. That is, imputations were drawn using binary and multinomial logistic regressions for dichotomous and polytomous variables respectively. The performance of these methods was evaluated in terms of bias and standard errors in regression coefficients that were estimated to determine the association between the woman’s contraceptive methods use status and her marital status, controlling or not for other types of variables. The analysis was done assuming that the sample was not weighted fi then the sample weight was taken into account to assess whether the sample design would affect the performance of the multiple imputation methods of interest, namely MVNI and MICE. As expected, the results showed that for all the models, MVNI and MICE produced less biased smaller standard errors than the case deletion (CD) method, which discards items with missing values from the analysis. Moreover, it was found that when data were missing (MCAR or MAR) on the nominal variables that were treated as predictors in the regression model, MVNI reduced bias in the regression coefficients and standard errors compared to MICE, for both unweighted and weighted data sets. On the other hand, the results indicated that MICE outperforms MVNI when data were missing on the response variables, either the binary or polytomous. Furthermore, it was noted that the sample design (sample weights), the rates of missingness and the missing data mechanisms (MCAR or MAR) did not affect the behaviour of the multiple imputation methods that were considered in this study. Thus, based on these results, it can be concluded that when missing values are present on the outcome variables measured on a nominal scale in regression models, the distributional form of the variable with missing values should be taken into account. When these variables are used as predictors (with missing observations), the parametric imputation approach (MVNI) would be a better option than MICE.Item Missing imputation methods explored in big data analytics(University of the Western Cape, 2018) Brydon, Humphrey Charles; Blignaut, RenetteThe aim of this study is to look at the methods and processes involved in imputing missing data and more specifically, complete missing blocks of data. A further aim of this study is to look at the effect that the imputed data has on the accuracy of various predictive models constructed on the imputed data and hence determine if the imputation method involved is suitable. The identification of the missingness mechanism present in the data should be the first process to follow in order to identify a possible imputation method. The identification of a suitable imputation method is easier if the mechanism can be identified as one of the following; missing completely at random (MCAR), missing at random (MAR) or not missing at random (NMAR). Predictive models constructed on the complete imputed data sets are shown to be less accurate for those models constructed on data sets which employed a hot-deck imputation method. The data sets which employed either a single or multiple Monte Carlo Markov Chain (MCMC) or the Fully Conditional Specification (FCS) imputation methods are shown to result in predictive models that are more accurate. The addition of an iterative bagging technique in the modelling procedure is shown to produce highly accurate prediction estimates. The bagging technique is applied to variants of the neural network, a decision tree and a multiple linear regression (MLR) modelling procedure. A stochastic gradient boosted decision tree (SGBT) is also constructed as a comparison to the bagged decision tree. Final models are constructed from 200 iterations of the various modelling procedures using a 60% sampling ratio in the bagging procedure. It is further shown that the addition of the bagging technique in the MLR modelling procedure can produce a MLR model that is more accurate than that of the other more advanced modelling procedures under certain conditions. The evaluation of the predictive models constructed on imputed data is shown to vary based on the type of fit statistic used. It is shown that the average squared error reports little difference in the accuracy levels when compared to the results of the Mean Absolute Prediction Error (MAPE). The MAPE fit statistic is able to magnify the difference in the prediction errors reported. The Normalized Mean Bias Error (NMBE) results show that all predictive models constructed produced estimates that were an over-prediction, although these did vary depending on the data set and modelling procedure used. The Nash Sutcliffe efficiency (NSE) was used as a comparison statistic to compare the accuracy of the predictive models in the context of imputed data. The NSE statistic showed that the estimates of the models constructed on the imputed data sets employing a multiple imputation method were highly accurate. The NSE statistic results reported that the estimates from the predictive models constructed on the hot-deck imputed data were inaccurate and that a mean substitution of the fully observed data would have been a better method of imputation. The conclusion reached in this study shows that the choice of imputation method as well as that of the predictive model is dependent on the data used. Four unique combinations of imputation methods and modelling procedures were concluded for the data considered in this study.Item Survival modelling and analysis of HIV/AIDS patients on HIV care and antiretroviral treatment to determine longevity prognostic factors(University of the Western Cape, 2016) Maposa, Innocent; Blignaut, RenetteThe HIV/AIDS pandemic has been a torment to the African developmental agenda, especially the Southern African Development Countries (SADC), for the past two decades. The disease and condition tends to affect the productive age groups. Children have also not been spared from the severe effects associated with the disease. The advent of antiretroviral treatment (ART) has brought a great relief to governments and patients in these regions. More people living with HIV/AIDS have experienced a boost in their survival prospects and hence their contribution to national developmental projects. Survival analysis methods are usually used in biostatistics, epidemiological modelling and clinical research to model time to event data. The most interesting aspect of this analysis comes when survival models are used to determine risk factors for the survival of patients undergoing some treatment or living with a certain disease condition. The purpose of this thesis was to determine prognostic risk factors for patients' survival whilst on ART. The study sought to highlight the risk factors that impact the survival time negatively at different survival time points. The study utilized a sample of paediatric and adult datasets from Namibia and Zimbabwe respectively. The paediatric dataset from Katutura hospital (Namibia) comprised of the adolescents and children on ART, whilst the adult dataset from Bulawayo hospital (Zimbabwe) comprised of those patients on ART in the 15 years and above age categories. All datasets used in this thesis were based on retrospective cohorts followed for some period of time. Different methods to reduce errors in parameter estimation were employed to the datasets. The proportional hazards, Bayesian proportional hazards and the censored quantile regression models were utilized in this study. The results from the proportional hazards model show that most of the variables considered were not signifcant overall. The Bayesian proportional hazards model shows us that all the considered factors had different risk profiles at the different quartiles of the survival times. This highlights that by using the proportional hazards models, we only get a fixed constant effect of the risk factors, yet in reality, the effect of risk factors differs at different survival time points. This picture was strongly highlighted by the censored quantile regression model which indicated that some variables were significant in the early periods of initiation whilst they did not significantly affect survival time at any other points in the survival time distribution. The censored quantile regression models clearly demonstrate that there are significant insights gained on the dynamics of how different prognostic risk factors affect patient survival time across the survival time distribution compared to when we use proportional hazards and Bayesian propotional hazards models. However, the advantages of using the proportional hazards framework, due to the estimation of hazard rates as well as it's application in the competing risk framework are still unassailable. The hazard rate estimation under the censored quantile regression framework is an area that is still under development and the computational aspects are yet to be incorporated into the mainstream statistical softwares. This study concludes that, with the current literature and computational support, using both model frameworks to ascertain the dynamic effects of different prognostic risk factors for survival in people living with HIV/AIDS and on ART would give the researchers more insights. These insights will then help public health policy makers to draft relevant targeted policies aimed at improving these patients' survival time on treatment.