Statistical modelling of clustered and incomplete data with applications in population health studies in developing countries

Adegboye, Oyelola Abdulwasiu

Statistical modelling of clustered and incomplete data with applications in population health studies in developing countries

Files

Adegboye_PHD_2014.pdf (17.96 MB)

Date

2014

Authors

Adegboye, Oyelola Abdulwasiu

Publisher

University of Western Cape

Abstract

The United Nations (UN) Millennium Development Goals (MDGs) drafted eight goals to be achieved by the year 2015, namely: eradicating extreme poverty and hunger, achieving universal primary education, promoting gender equality and women empowerment, reducing child mortality, improving maternal health, combating HIV/AIDS, malaria and other diseases, ensuring environmental sustainability and lastly developing a global partnership for development. Many public health studies often result in complicated and complex data sets, the nature of these data sets could be clustered, multivariate, longitudinal, hierarchical, spatial, temporal or spatio-temporal. This often results in what is called correlated data, because the assumption of independence among observations may not be appropriate. The shared genetic traits in the studies of illness or shared household characteristics among family members in the studies of poverty are examples of correlated data. In cross-sectional studies, individuals may be nested within sub-clusters (e.g., families) that are nested within clusters (e.g., environment), thus causing correlation within clusters. Ignoring the structure of the data may result in asymptotically biased parameter estimates. Clustered data may also be a result of geographical location or time (spatial and temporal). A crucial step in modelling correlated data is the speci cation of the dependency by choosing the covariance/correlation function. However, often the choice for a particular application is unclear and diagnostic tests will have to be carried out, following tting of a model. This study's view of developing countries investigates the prospects of achieving MDGs through the development of flexible predictor statistical models. The first objective of this study is to explore the existing methods for modelling correlated data sets (hierarchical, multilevel and spatial) and then apply the methods in a novel way to several data sets addressing the underlying MDGs. One of the most challenging issue in spatial or spatio-temporal analysis is the choice of a valid and yet exible correlation (covariance) structure. In cases of high dimensionality of the data, where the number of spatial locations or time points that produced the observations is large, the analysis of such data presents great computational challenges. It is debatable whether some of the classical correlation structures adequately reect the dependency in the data. The second objective is to propose a new flexible technique for handling spatial, temporal and spatio-temporal correlations. The goal of this study is to resolve the dependencies problems by proposing a more robust method for modelling spatial correlation. The techniques are used for di erent correlation structures and then combined to form the resulting estimating equations using the platform of the Generalized Method of Moments. The proposed model will therefore be built on a foundation of the Generalized Estimating Equations; this has the advantage of producing consistent regression parameter estimates under mild conditions due to separation of the processes of estimating the regression parameters from the modelling of the correlation. These estimates of the regression parameters are consistent under mild conditions. Thirdly, to account for spatio-temporal correlation in data sets, a method that decouples the two sources of correlations is proposed. Speci cally, the spatial and temporal e ects were modelled separately and then combined optimally. The approach circumvents the need of inverting the full covariance matrix and simpli es the modelling of complex relationships such as anisotropy, which is known to be extremely di cult or Lastly, large public health data sets consist of a high degree of zero counts where it is very di cult to distinguish between "true zeros" and "imputed" zeros. This can be due to the reporting mechanism as a result of insecurity, technical and logistics issues. The focus is therefore on the implementation of a technique that is capable of handling such a problem. The study will make the assumption that "imputed" zeros are a random event and consider the option of discarding the zeros, and then model a conditional Poisson model, conditioning on all cases greater than 0.