Extensive whole-genome sequencing research has facilitated the examination of rare variants (RV) in noncoding and coding regions and their links to complex human diseases and characteristics.
The STAARpipeline tools offer researchers a computationally efficient and robust RVs association-detection approach, which can be used for the automatic annotation of whole-genome/whole-exome sequencing (WGS/WES) studies and perform flexible noncoding and coding RV association analysis [1]. These tools include single variant, gene-centric coding, gene-centric noncoding, noncoding RNA (ncRNA), and sliding window tests for continuous or dichotomous outcomes [2], while further incorporating multiple functional annotations to empower RV (set) association analysis using the variant-set Test for Association using Annotation infoRmation (STAAR) [3] method. The main highlights of STAARpipeline compared to other GWAS tools are:
To enable researchers to use the STAARpipeline together with publicly available data in Velsera platforms and participating ecosystems, Velsera collaborated with tool authors, Xihao Li: Assistant Professor at UNC, and Zilin Li: Professor at Northeast Normal University in China, to create STAARpipeline Common Workflow Language (CWL) tools and make them available to the researchers on the NHLBI BioData Catalyst Powered by Seven Bridges platform, to be used for null model fitting, single variant and aggregate association testing.
To run the STAARpipeline, thorough variant annotation is necessary. The Functional Annotation of Variants Online Resources (FAVOR) is a comprehensive whole genome variant annotation database and a variant browser that provides hundreds of functional annotation scores from a variety of biological functional dimensions for all possible 9 billion Single Nucleotide Variants (SNVs) and 80 million observed short insertions/deletions (indels) [4]. The FAVOR database is used to functionally annotate genotype data in the Genomic Data Structure (GDS) file [9,10] of any WGS/WES study and stored in an annotated GDS (aGDS) (Figure 1). The database includes functional annotations of all possible 9 billion SNVs and 80 million observed indels in the whole genome by integrating data from multiple different sources, including CADD v1.5, GENCODE v31, Annovar, WGSA, ClinVar, ENCODE, SnpEff, 1000 Genome, TOPMed Bravo Freeze 8, gnomAD v3 and other individual studies [4]. The FAVOR database includes annotation Principal Components (aPCs), which summarize multiple aspects of variant function by calculating the first variant-specific PC from the individual functional annotation scores in a functional category [3, 4]. For example, aPC-Protein-Function is the first PC of the seven individual standardized protein function scores. Accordingly, the FAVORannotator app, available on BDC-SB, uses the FAVOR Essential Database [11] to annotate variants.
With emerging need to identify genetic components of complex traits in the noncoding genome, the STAARpipeline was designed for both noncoding and coding rare variant association detection across the genome [1]. STAARpipeline has been used in various rare variant studies, including rare variant association studies as multiple aggregate tests across the genome to identify gene-specific functional categories and noncoding genomic regions influencing plasma lipid concentrations (low-density lipoprotein cholesterol, high-density, lipoprotein cholesterol, triglycerides and total cholesterol) [1, 12, 13], fasting glucose and fasting insulin [1], kidney function [1], telomere length [1], height [14], inflammatory biomarkers [15], circulating metabolites [16], using TOPMed WGS data. Researchers can now run a functionally-informed genome-wide analysis using the STAARpipeline on the Seven Bridges Platforms using the following apps:
In contrast to GWAS, which starts from a phenotype and analyses variants across the genome to find association with given phenotype, phenome-wide association studies (PheWAS) has a reverse approach (Figure 2) starting from a specific variant and analysing many phenotypes to investigate if they have an association with analysed genetic variants [18]. This enables researchers to find multiple phenotypes having individual associations with the same variant, using the STAARpipeline PheWAS app, also available on NHLBI BioData Catalyst Powered by Seven Bridges. STAARpipeline PheWAS tool is designed to run single variant and aggregate testing for biobank-scale WGS/WES sequencing data in a resource-efficient fashion for PheWAS [17]. The same as for STAARpipeline app, used for GWAS, this tool can perform different test types: single variant, gene-centric coding, gene-centric noncoding, ncRNA, and sliding window. The tests can be performed for continuous or dichotomous outcomes. Similarly, as STAARpipeline for GWAS, STAARpipeline PheWAS incorporates multiple functional annotations to empower rare variant (set) association analysis for each phenotype using the STAAR method [17].
You can easily upload your data to Velsera’s Seven Bridges environment and run the STAARpipeline, or access publicly available data such as TopMED, in the Biodata Catalyst powered by the Seven Bridges platform. Additional information on how to get started is available in the SevenBridges QuickStart, CGC knowledge center, BDC, and CAVATICA documentation. Please contact us if having any questions or need for support.
For more details on the pipeline, its specific inputs and outputs, as well as detailed instructions on how to run the workflow, please see its description page on Velsera’s Seven Bridges Public Apps Gallery.
References
[1] Li, Z., Li, X., Zhou, H. et al. A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies. Nat Methods 19, 1599–1611 (2022). https://doi.org/10.1038/s41592-022-01640-x
[2] https://tinyurl.com/staarpipelineapps
[3] Li, X., Li, Z., Zhou, H. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat Genet 52, 969–983 (2020). https://doi.org/10.1038/s41588-020-0676-4
[4] Zhou, H., Arapoglou, T., Li, X. et al. FAVOR: functional annotation of variants online resource and annotator for variation across the human genome. Nucleic Acids Res 51, D1300–D1311 (2023). https://doi.org/10.1093/nar/gkac966
[5] https://favor.genohub.org/
[6] Li, Z., Li, X., Liu, Y. et al. Dynamic Scan Procedure for Detecting Rare-Variant Association Regions in Whole-Genome Sequencing Studies. Am J Hum Genet, 104, 802 – 814 (2019). https://doi.org/10.1016/j.ajhg.2019.03.002
[7] Gogarten, S.M., Sofer, T., Chen, H. et al. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics 35, 5346–5348 (2019). https://doi.org/10.1093/bioinformatics/btz567
[8] https://github.com/rounakdey/FastSparseGRM
[9] Zheng, X., Levine, D., Shen, J. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012). https://doi.org/10.1093/bioinformatics/bts606
[10] Zheng, X., Gogarten, S.M., Lawrence, M. et al. SeqArray—a storage-efficient high-performance data format for WGS variant calls, Bioinformatics 33, 2251–2257 (2017). https://doi.org/10.1093/bioinformatics/btx145
[11] https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1VGTJI
[12] Selvaraj, M.S., Li, X., Li, Z. et al. Whole genome sequence analysis of blood lipid levels in >66,000 individuals. Nat Commun 13, 5995 (2022). https://doi.org/10.1038/s41467-022-33510-7
[13] Wang, Y., Selvaraj, M.S., Li, X. et al. Rare variants in long non-coding RNAs are associated with blood lipid levels in the TOPMed whole-genome sequencing study. Am J Hum Genet, 110, 1704 - 1717 (2023). https://doi.org/10.1016/j.ajhg.2023.09.003
[14] Hawkes, G., Beaumont, R.N., Li, Z. et al. Whole genome association testing in 333,100 individuals across three biobanks identifies rare non-coding single variant and genomic aggregate associations with height. Preprint version (2023). https://doi.org/10.1101/2023.11.19.566520
[15] Jiang, M., Gaynor, S.M., Li, X. et al. Whole genome sequencing based analysis of inflammation biomarkers in the Trans-Omics for Precision Medicine (TOPMed) consortium. Hum Mol Genet (2024). https://doi.org/10.1093/hmg/ddae050
[16] Feofanova, E.V., Brown, M.R., Alkis, T. et al. Whole-Genome Sequencing Analysis of Human Metabolome in Multi-Ethnic Populations. Nat Commun 14, 3111 (2023). https://doi.org/10.1038/s41467-023-38800-2
[17] https://tinyurl.com/staarpipelinephewasapps
[18] Robinson, J.R., Denny, J.C., Roden, D.M. et al. Genome-wide and Phenome-wide Approaches to Understand Variable Drug Actions in Electronic Health Records. Clin Transl Sci. 11, 112-122 (2018). https://doi.org/10.1111/cts.12522