Lorenzo Masoero (Amazon)
17 December 2021 @ 12:00 - 13:00
“Improved prediction and optimal sequencing strategies for genomic variant discovery via Bayesian nonparametrics”
Abstract: Despite the advent of Big Data, data-gathering in many domains can still be an expensive process that necessitates careful planning when operating under a fixed, limited budget. For instance, sequencing new genomic data is a complex procedure that requires careful tuning: researchers can spend resources to sequence a greater number of genomes (quantity), or spend resources to sequence genomes with increased accuracy (quality). In this talk, I consider the common setting in which scientists have already conducted a pilot study to reveal variants in a genome and are contemplating a follow-up study. Spending additional resources has the potential to reveal new variations in the genome, and thereby new genetic insights. Therefore, practitioners are interested in (i) predicting how many new discoveries they will make under different experimental design choices. In turn, they can leverage these predictions to optimally allocate available resources in the design of a future experiment, e.g. (ii) to maximize the number of future discoveries or (iii) to optimize the usefulness of a future experiment for the task at hand, e.g. the power of an associated statistical test.
I discuss novel methodologies to solve the problems mentioned above. Our approach relies on a Bayesian nonparametric formulation that facilitates (i) prediction for the number of new variants in the follow-up study based on the pilot study. We show empirically that, when experimental conditions are kept constant between the pilot and follow-up, our method’s prediction is competitive with the best existing methods. Unlike current methods, though, our new method allows practitioners to change experimental conditions between the pilot and the follow-up. We demonstrate how this distinction allows our method to be used for more realistic predictions and for optimal allocation of a fixed budget between quality and quantity. In particular, we first show how, under a fixed budget, my predictions can be used to maximize (ii) the number of new genomic variants discovered in a follow-up study. Last, we show how our framework can guide practitioners in other experimental design problems, and specifically how to achieve (iii) the highest possible power in statistical tests in the context of rare variants association studies.