Genetic models to predict the development of colorectal cancer.

Type of content
Theses / Dissertations
Publisher's DOI/URI
Thesis discipline
Biological Sciences
Degree name
Master of Science
Publisher
University of Canterbury
Journal Title
Journal ISSN
Volume Title
Language
English
Date
2021
Authors
Ainsworth, Rachel
Abstract

Background: Survival rates for colorectal cancer are highest when cancer is diagnosed at an early stage but very few cancers are diagnosed before they progress to later stages. A model which could predict who will develop colorectal cancer based on genetic information would allow targeted screening of high-risk individuals. Genome-wide association studies (GWAS) have identified ~100 genetic variants (SNPs) that are individually associated with the development of colorectal cancer, but models built using these SNPs do not identify all high-risk individuals (AUC of 0.629).

Methods: To improve the performance of polygenic risk score models, three methods were tested: first, the use of rare allele principal components; second, the identification of clusters of colorectal cancer patients with the same underlying genetic causes of cancer; third, the incorporation of interactions within gradient based tree models.

Results: Both rare and common allele principal components were found to identify population groups, but this did not improve the performance of models to predict the development of colorectal cancer. Clusters which represented similar underlying genetic causes of colorectal cancer were unable to be identified, although models that predict the location of colorectal cancer performed significantly better than models built with linear discriminant analysis (p-value=0.022). The use of gradient boosted tree models significantly improved the performance of models to predict the development of colorectal cancer, compared with linear models for the same dataset (p−value=0.0258). However, there was only weak evidence of interactions in the gradient boosted tree models. When variables were selected with random forests or gradient boosted trees, some of the SNPs selected had missing genotypes that were highly favourable or unfavourable for colorectal cancer (odds ratios of 0.446 and 1.77).

Conclusion: The performance of models to identify individuals at high-risk for the development of colorectal cancer may be able to be improved through the use of gradient boosted tree models. The treatment of missing genotypes warrants further study due to the strong odds ratios attached to some genotypes that are missing.

Description
Citation
Keywords
Ngā upoko tukutuku/Māori subject headings
ANZSRC fields of research
Rights
All Rights Reserved