In recent years, a pipeline of massively parallel reporter assay (MPRA), and next-generation sequencing (NGS) provided large-scale datasets to investigate biological mechanisms in detail. However, bigger data often leads to larger complexity. As a result, theories derived from low-throughput experiments lose explanatory power, requiring new methods to create predictive models. Here we focus on modeling functions of nucleic acid sequences, as a study case of massive-throughput assays. We report a deep learning approach, training a two-dimensional convolutional neural network (CNN) on an ordered graph representation of nucleic acid sequences to predict their functions (Seq2DFunc). To compare the performance of Se... More
In recent years, a pipeline of massively parallel reporter assay (MPRA), and next-generation sequencing (NGS) provided large-scale datasets to investigate biological mechanisms in detail. However, bigger data often leads to larger complexity. As a result, theories derived from low-throughput experiments lose explanatory power, requiring new methods to create predictive models. Here we focus on modeling functions of nucleic acid sequences, as a study case of massive-throughput assays. We report a deep learning approach, training a two-dimensional convolutional neural network (CNN) on an ordered graph representation of nucleic acid sequences to predict their functions (Seq2DFunc). To compare the performance of Seq2DFunc with conventional methods, we obtained customized database on a CRISPR RNA processing assay. For this specific assay, analyses of sequence and RNA structure determinants failed to explain the results regardless of dataset size. 1-dimensional CNN of raw sequences generate generally failed to converge at < 10,000 or fewer sequences. By contrast, Seq2DFunc trained on ∼ 7,000 sequences still provided 86% accuracy. Given a sufficient dataset (∼ 120,000 sequences) for training, Seq2DFunc (96% accuracy, 0.93 f1-score) still outperformed the best 1D CNN (92% accuracy, 0.83 f1-score). We anticipate Seq2DFunc can be a versatile downstream tool for deciphering massive-throughput assays for many fundamental studies. In addition, the use of smaller dataset is especially beneficial to reduce the experiment budget or required sequencing depth.