Abstract
AbstractSeveral studies have made it possible to envision a translational application of plasma DNA sequencing in cancer diagnosis and monitoring. However, the extremely low concentration of circulating tumour DNA (ctDNA) fragments among the total cell-free DNA (cfDNA) remains a formidable challenge to overcome and statistical models have yet to be improved enough to become of practical use. In this study, we set about appraising the predictive value of a variety of binary classification models based on cfDNA sequencing using fragmentation features extracted around transcription start sites (TSSs). We investigated (1) features summarising mapped fragment density around each TSS, (2) long non-coding RNA (lncRNA) genes versus coding genes and (3) selection criteria to generate gene classes to be assigned by the model. Given that, in healthy samples, most of the cfDNA comes from lymphomyeloid lineages, we could identify the model parametrisation with the best accuracy in those lineages using publicly available datasets of healthy patients’ cfDNA. Our results show that (1) the way tissue-specific gene classes are defined matters more than what fragmentation features are included, and (2) in particular, lncRNAs are more tissue specific than coding genes and stand out in terms of both sensitivity and specificity in our results.Author summaryDying cells, even in healthy individuals, release a fraction of the digested fragments of their genetic material into the bloodstream. Interestingly, these circulating cell-free DNA (cfDNA) fragments bear the footprint left by nucleosomes ; the position of which depends on the transcriptional state specific to each tissue. This footprint, given away by the sizes and genomic positions of cfDNA fragments, can be revealed by deep sequencing and statistical models can be trained to recognise the tissue from where those fragments originate. This information, if made sensitive enough, could be a useful medical technique to carry out so-called “liquid biopsies”, allowing clinicians to diagnose at an early stage or to precisely monitor a number of diseases, including cancer.In this work, we comprehensively evaluated the features of circulating DNA fragments in the vicinity of transcriptional start sites to increase the ability of statistical models to recognise the tissue of origin of cfDNA of healthy individuals. Broadly speaking, nucleosome patterns allow a statistical model distinguish active versus inactive genes or tissue-specific versus housekeeping genes. The purpose of this study was to find the classes of genes with the strongest ability to recognise the true tissue of origin. From this work, we conclude that long non-coding RNA genes allow for a more sensitive and specific detection of the tissue of origin.
Publisher
Cold Spring Harbor Laboratory