Cancer Control Research5R01CA158113-02
Johnson, Valen Earl
CONSISTENT MODEL SELECTION IN THE P>>N SETTING
DESCRIPTION (provided by applicant): Among the most fundamental and commonly encountered statistical problems in medical research is the problem of model selection. Model selection is the process by which researchers identify the relationships between measured quantities; thus it plays a central role in the analysis of essentially all high-throughput screening data. Model selection procedures represent the primary analytical mechanism through which the associations between diseases and large numbers of biochemical, genetic and pharmacological variables are discovered. The fundamental hypothesis tested in this application is that a new class of model selection procedures can be used to effectively identify associations between biological variables and disease outcomes, even in settings where there are many more potential biological correlates than there are observations on each variable. The goals of this project are to develop these variable selection procedures so that they can be applied to high-throughput screening data, and to apply the resulting methodology in three important application areas. To achieve these goals, the following specific aims will be addressed. Known theoretical properties of the proposed model selection procedures will be extended to cases in which there are many more biological measurements available than there are observations on each measurement (i.e., p n setting). Constraints on the number of variables that can be included in final models for outcome variables will be determined, and efficient numerical algorithms will be developed so that these methods can be applied to actual high-throughput screening data. The new model selection procedures will be used to define binary classification algorithms that can predict clinical outcomes from high-dimensional gene expression data sets. The new model selection procedures will be used to identify and analyze interactions between genes that are associated with cancer and other diseases in genome-wide association studies using single-nucleotide polymorphism data. The new model selection procedures will be used to analyze biological pathways as informed by high- throughput molecular interrogation data. The algorithms developed during this project constitute a major innovation in the field of model selection and will provide medical researchers with a new and unique set of tools for effectively identifying biological associations among biomarkers, disease attributes, and patient outcomes from high-throughput screening data. PUBLIC HEALTH RELEVANCE: Model selection procedures are statistical techniques that allow researchers to discover the associations between disease and the large number of variables that are measured in emerging high-throughput screening technologies. For example, model selection techniques are used to discover which genes are associated with particular forms of cancer. This project proposes a new class of model selection procedures that will make it easier for researchers to discover such associations.