Unlocking Genomic Discoveries: Seamless Access to the Sequence Read Archive

The sharing of scientific data enables researchers to have access to an unprecedented amount of studies. With the NIH’s updated Data Management and Sharing Policy, the amount of publicly available multi-omics data is going to exponentially increase. Easy access to this data without redundant copies guarantees the latest versions are available to all, following the FAIR principles.

A new approach, developed by Velsera in close collaboration with the SRA’s team, the dbGaP team, the RAS’ team, funded by the Common Fund, the NCI and with the support of ODSS personnel and the NCBI leadership, is now available on all the Seven Bridges Platforms, which provides researchers quick access to data housed in the Sequence Read Archive (SRA) without downloading or paying any storage fees.

The Sequence Read Archive (SRA) is the largest public repository of high throughput sequencing data, with data from various life forms, metagenomic studies, and environmental surveys. SRA stores raw sequencing data and alignment details, supporting reproducibility and facilitating new discoveries through data analysis. SRA includes sequencing data submitted to the database of Genotypes and Phenotypes (dbGaP), which contains insights into a myriad of human conditions and diseases, allowing researchers the ability to make novel discoveries without the need execute an entirely new study. However, using data from dbGaP has always required downloading files with the SRA Toolkit or copying files to a user-owned cloud bucket. This process can take significant time and either utilize valuable server/computer space or incur cloud storage charges for a large amount of data, generating multiple copies.

SRA at your fingertips, in all the Seven Bridges Platforms

Developed in close collaboration with the SRA’s team, the dbGaP team, the Researcher Authentication Service (RAS) team, funded by the Common Fund and the National Cancer Institute (NCI), and with the support of the Office of Data Science Strategy (ODSS) personnel and the National Center for Biotechnology Information (NCBI) leadership, the “SRA to DRS” tool provides access to data in the SRA by leveraging the Data Repository Service (DRS) standard developed in the Global Alliance for Genomics and Health (GA4GH). This connection to DRS provides symbolic links to the data source, eliminating redundant copies, accelerating discovery, and reducing costs.

Commercial users have access to all open-access data in SRA on SBPLA

To protect individuals who contribute data, much of the human genomics data is controlled access, requiring an approved data access request through dbGaP. Controlled data can be accessed only through CAVATICA, one of our NIH-funded academic platforms, thanks to the integration with Researcher Authentication Services (RAS). RAS ensures that those accessing files via DRS have proper authorization to do so. With Velsera’s interoperability stack, it is then possible to move data between any of the Seven Bridges Platforms.

Commercial users interested in accessing controlled access data, can work with their scientific partner by emailing support@velsera.com to gain access.

The SRA housed 23 petabytes of raw data in 2019, projected to grow roughly three-fold by 2023. The SRA compresses raw data into SRA file format to more efficiently store and distribute the growing amount of data. Although the raw files for many studies are kept in cold storage, effectively making them unavailable through DRS, most of the SRA Normalized Formats and SRA Lite are kept in hot storage and are available through this approach. Using the SRA tool “SRA fasterq-dump” available on the platform, which extracts the data into the raw format (i.e. FASTQ, BAMs, etc.), the .sra files can be directly input into workflows, streamlining the data analysis process.

The data are immediately available in SRA Normalized, SRA Lite Formats or in raw format. In case the files are available in the `.sra` format, it’s possible to use the `SRA fastq-dump` application present on all Velsera’s platforms, which can extract the data into the raw format (i.e. FASTQ, BAMs, etc.), which can be passed directly to any downstream analysis applications. This new approach provides two immediate major benefits: 1) avoid the storage cost for the user, which can reallocate the funds tied to storage to analysis, generating additional meaningful scientific insights2) optimize the resources already allocated via the STRIDES initiative from ODSS, eliminating the double spending, and making full use of the data on the Cloud.

How to Use the New Approach

Follow these simple steps to use the new approach. For any questions, please contact support@velsera.com and our 24/7/365 teams will be happy to help!

Step 1: Go to SRA (https://www.ncbi.nlm.nih.gov/Traces/study/) to find your study of interest. Copy the SRA study number or download the metadata file found on SRA trace (Figure 1).

Figure 1

Step 2: Go to Seven Bridges Platform Public Apps Gallery and select the “SRA to DRS converter” app (figure 2). As input, the tool takes either a metadata file or SRA study number found on SRA Trace from Step 1.

Figure 2

Step 3: Run the task on the platform. The execution time will be proportional to the number of files requested, so please be patient. For instance, a few minutes might be needed for a single file, while a complete study could take up to 45 minutes. The output will include an execution log, an online DRS Files List, and an Updated SRA Table. The Updated SRA Table provides information about each of the files of interest, specifically indicating if they are available in hot storage or not. The “online DRS Files List” can then be used directly to “import” files into a project using the “GA4GH Data Repository Service (DRS)” import function (figure 3).

Step 4: Go to the project you want to add these files to. Select the “Add files” option using “GA4GH Data Repository Service (DRS)” import function (figure 3). Copy or upload the “online DRS Files List”.

Figure 3

Step 5: The DRS-linked files will be visible in the project (figure 4) and can be used the same way as any project file to run any analysis, without incurring storage costs.

Figure 4

Step 6. Data dependent

If the data imported are not in the raw format desired, it’s possible to use the “SRA fasterq-dump” application and add it to any workflows as pre-processing step, to extract the files on the fly and provide them as input to the following steps of the workflow.

Figure 5