odohring-blog - Tumblr blog

odohring-blog · 6 years ago

Text

Previous academic work and contributions to pharmaceutical conferences

I am Orlando Dohring. This page lists previous academic work and contributions to pharmaceutical conferences. Firstly, links to those documents are provided. Then, further down, I list the abstracts for those documents. Academic work: - PhD Thesis: Identification of breed contributions in crossbred dogs

- MPhil Thesis: Peak selection in metabolic profiles using functional data analysis

Contributions to Statisticians in the Pharmaceutical Industry (PSI) conference: - Talk PSI 2018: Introduction to Machine Learning for Longitudinal Medical Data

- Poster PSI 2017: Big Data Meets Pharma

- Poster PSI 2016: Sparse Principal Component Analysis for clinical variable selection in longitudinal data

- PhD Thesis Abstract: Identification of breed contributions in crossbred dogs: There has been a strong public interest recently in the interrogation of canine ancestries using direct-to-consumer (DTC) genetic ancestry inference tools. Our goal is to improve the accuracy of the associated computational tools, by developing superior algorithms for identifying the breed composition of mixed breed dogs. Genetic test data has been provided by Mars Veterinary, using SNP markers. We approach this ancestry inference problem from two main directions. The first approach is optimized for datasets composed of a small number of ancestry informative markers (AIM). Firstly, we compute haplotype frequencies from purebred ancestral panels which characterize genetic variation within breeds and are utilized to predict breed compositions. Due to a large number of possible breed combinations in admixed dogs we approximately sample this search space with a Metropolis-Hastings algorithm. As proposal density we either uniformly sample new breeds for the lineage, or we bias the Markov Chain so that breeds in the lineage are more likely to be replaced by similar breeds. The second direction we explore is dominated by HMM approaches which view genotypes as realizations of latent variable sequences corresponding to breeds. In this approach an admixed canine sample is viewed as a linear combination of segments from dogs in the ancestral panel. Results were evaluated using two different performance measures. Firstly, we looked at a generalization of binary ROC-curves to multi-class classification problems. Secondly, to more accurately judge breed contribution approximations we computed the difference between expected and predicted breed contributions. Experimental results on a synthetic, admixed test dataset using AIMs showed that the MCMC approach successfully predicts breed proportions for a variety of lineage complexities. Furthermore, due to exploration in the MCMC algorithm true breed contributions are underestimated. The HMM approach performed less well which is presumably due to using less information of the dataset. - MPhil Thesis Abstract: Peak selection in metabolic profiles using functional data analysis: In this thesis we describe sparse principal component analysis (PCA) methods and apply them to the analysis of short multivariate time series in order to perform both dimensionality reduction and variable selection. We take a functional data analysis (FDA) modelling approach in which each time series is treated as a continuous smooth function of time or curve. These techniques have been applied to analyse time series data arising in the area of metabonomics. Metabonomics studies chemical processes involving small molecule metabolites in a cell. We use experimental data obtained from the COnsortium for MEtabonomic Toxicology (COMET) project which is formed by six pharmaceutical companies and Imperial College London, UK. In the COMET project repeated measurements of several metabolites over time were collected which are taken from rats subjected to different drug treatments. The aim of our study is to detect important metabolites by analysing the multivariate time series. Multivariate functional PCA is an exploratory technique to describe the observed time series. In its standard form, PCA involves linear combinations of all variables (i.e. metabolite peaks) and does not perform variable selection. In order to select a subset of important metabolites we introduce sparsity into the model. We develop a novel functional Sparse Grouped Principal Component Analysis (SGPCA) algorithm using ideas related to Least Absolute Shrinkage and Selection Operator (LASSO), a regularized regression technique, with grouped variables. This SGPCA algorithm detects a sparse linear combination of metabolites which explain a large proportion of the variance. Apart from SGPCA, we also propose two alternative approaches for metabolite selection. The first one is based on thresholding the multivariate functional PCA solution, while the second method computes the variance of each metabolite curve independently and then proceeds to these rank curves in decreasing order of importance. To the best of our knowledge, this is the first application of sparse functional PCA methods to the problem of modelling multivariate metabonomic time series data and selecting a subset of metabolite peaks. We present comprehensive experimental results using simulated data and COMET project data for different multivariate and functional PCA variants from the literature and for SGPCA. Simulation results show that that the SGPCA algorithm recovers a high proportion of truly important metabolite variables. Furthermore, in the case of SGPCA applied to the COMET dataset we identify a small number of important metabolites independently for two different treatment conditions. A comparison of selected metabolites in both treatment conditions reveals that there is an overlap of over 75 percent. - Talk PSI 2018 Abstract: Introduction to Machine Learning for Longitudinal Medical Data: In the era of big data, there has been a surge in collected biomedical data, which has provided ample challenges for distributed computing but also posed novel inference questions. Application areas range from Bioinformatics (disease diagnosis from microarray data, drug discovery from molecular compounds), medical imaging (brain reconstruction, organ segmentation, tumour detection from MRI/CT/X-Ray images), sensing (anomaly detection, human activity recognition from images, wearable devices), public health (prediction of epidemic alerts from social media data and meta-information in mobile devices) to healthcare informatics (inference regarding length of hospital stay, readmission probability within next days, mortality prediction from electronic health records). Classical machine learning techniques, such as logistic regression, neural networks, support vector machine and Gaussian processes performed very well in non-temporal prediction tasks but typically relied on the independence assumption. However, many recent application have longitudinal context in the form of short- and long-term dependencies, e.g. local spatial features in brain images, sentiment in medical reports and summaries of medical research. Hidden Markov Models proved popular to model longitudinal data but increasingly become less computationally feasible for a large number of hidden states. Recently, advances in parallel computing led to widespread use of deep learning approaches, such as recurrent neural networks and convolutional networks, and attracted attention due to their impressive results on sequence data. Finally, we will look in more detail at a case study from healthcare analytics which infers disease type from multiple irregularly sampled longitudinal observations, such as blood pressure, heart rate and blood oxygen saturation. - Poster PSI 2017 Abstract: Big Data Meets Pharma: In this work we present a tutorial introduction to show how SAS can be leveraged for large datasets in the pharmaceutical sector: Big data plays an increasingly important role within drug compound discovery, genomic data analysis in clinical trials and real-time streaming data from wearable devices or sensors which monitor patients’ health and treatment compliance. SAS adopted Hadoop as highly scalable data platform for data warehouse operations, descriptive statistics and statistical analysis with a bias towards machine learning approaches. However, Hadoop’ MapReduce framework is slow and batch-oriented which is not very suitable for iterative, multi-step parallel algorithms with a focus on in-memory computations. To address these limitations SAS added layers for in-memory computation, interactive data queries using a SQL variant, support for streaming analytics and predictive models implemented in SAS Visual Statistics/ Analytics. In the data science sector, the similar open-source Apache Spark project with its machine learning library MLlib is commonly used. Both Visual Statistics and MLlib have implementations for linear/logistic regression, decision-tree based classifiers, and clustering. Furthermore, SAS focusses on group-by processing and GLMs while MLlib has methods for feature extraction, dimensionality reduction, SVM classifiers, matrix completion and basic hypothesis tests. At the moment the SAS Hadoop implementation is a good selection for data management and dataset derivations which often can be parallelized. However, currently there is lack of procedures typically in pharmaceutical statistics, such as mixed effect models for repeated measurements analysis or survival analysis models. - Poster PSI 2016 Abstract: Sparse Principal Component Analysis for clinical variable selection in longitudinal data: Background: Data collection is a time-consuming and expensive process. To minimise costs and reduce time, statistical methods can be applied to determine which variables are required for a clinical trial. Principal component analysis (PCA) is a popular exploratory technique to select a subset of variables at one timepoint. For multiple timepoints, typically each variables’ measurements are aggregated, which ignores temporal relationships. An alternative method is Sparse Grouped Principal Component Analysis (SGPCA), which also incorporates the temporal relationship of each variable. SGPCA is based on ideas related to Least Absolute Shrinkage and Selection Operator (LASSO), a regularised regression technique, with grouped variables. SGPCA selects a sparse linear combination of temporal variables where each patient is represented as short multivariate time series which are modelled as a continuous smooth function of time using functional data analysis (FDA). Aim: Compare the ability of the PCA and SGPCA to identify required variables for clinical trials. Methods PCA and SGPCA will be applied to a longitudinal clinical dataset to select required variables. We will compare the required variables, and the amount of variability retained for each technique under the SGPCA model. Conclusion This research will provide awareness of techniques to identify required variables in clinical trials, and aims to demonstrate the potential benefit of incorporating the temporal relationships in variable selection.

1 note · View note