Why can't we just use the omic sequence data that’s already publicly available?
Updated: Dec 7, 2021
I’ve shared with you before, my thoughts about publicly available omic (DNA / RNA / Protein) sequence data. In my opinion, there is no good reason to delay discoveries or hypothesis validation when you have publicly available sequenced omic data. When we are looking to analyze patients' disease sequence data, kicking off the analysis with the readily available sequence data can help solidify the validation step with newly sequenced samples. When seeking discoveries for rare diseases, there is a significant importance to using as much sequence data as possible in order to increase the chance for meaningful discovery.
Unfortunately, there are barriers to using sequence omic data for immediate use. Theoretically, every publicly available sample data should be an accessible set of the sequence data and its matching annotation data. The annotation data could potentially describe everything about the sample, such as: disease, gender, tissue, sample culture information, sensitivity to therapy, response to therapy, etc. In reality, there isn’t a table that lists all sets of the samples annotation and their matching omic sequence data. This is the first barrier when wanting to use the publicly available data. The National Institutes of Health (NIH) is putting in an immense amount of effort to collect this data, however, there is still work to be done to make this data easily accessible.
The second barrier to using the publicly available samples data is having the samples’ sequence data processed with identical bioinformatics tools and the clinical data aligned to a uniform values dictionary prior to using it. ORT public data service executes its up-to-date catalog proprietary search mechanism to filter these samples’ sets out of the public domain based on annotation or sequence data.
See below for an example of retrieving a sample with matching RNA sequence data. This example could be exchangeable with DNA or protein sequence data. There are multiple ways to search for sample data. The easiest way is to select the samples of interest based on the annotation data of the sample. For example, search for all samples with a specific disease name.
There are many sequencing technologies and bioinformatics tools available (see figures below for RNA-seq technologies and processing-steps tools). New bioinformatics tools are being developed regularly and each of them offer a range of strengths and weaknesses in relation to accuracy, compute and storage needs. The publicly available sequence data was processed in mix and match from these tools. ORT public data service processes each of the searched samples with the same bioinformatics tools. This uniformity allows a further downstream analysis with apples to apples comparison of harmonized samples data. The raw sequence data of the samples from the selected diseases samples list are processed together using a defined set of tools and with a common input of the relevant reference genome (reference genome is required in the case of bulk RNA sequence processing). The ORT public data service is also harmonizing all sample annotation data to allow a complete common data processing baseline for its annotation samples data.
ORT is sharing its data with the biotech community and is looking forward to driving precision medicine onward with its premium harmonized omic and annotation data samples.