Overview of STAARpipeline apps

Extensive whole-genome sequencing research has facilitated the examination of rare variants (RV) in noncoding and coding regions and their links to complex human diseases and characteristics.

The STAARpipeline tools offer researchers a computationally efficient and robust RVs association-detection approach, which can be used for the automatic annotation of whole-genome/whole-exome sequencing (WGS/WES) studies and perform flexible noncoding and coding RV association analysis [1]. These tools include single variant, gene-centric coding, gene-centric noncoding, noncoding RNA (ncRNA), and sliding window tests for continuous or dichotomous outcomes [2], while further incorporating multiple functional annotations to empower RV (set) association analysis using the variant-set Test for Association using Annotation infoRmation (STAAR) [3] method. The main highlights of STAARpipeline compared to other GWAS tools are:

Functional annotation of both noncoding and coding variants and creation of an annotated genotype dataset using the multi-faceted functional annotation database FAVOR [1, 4, 5].

Association analysis power is increased by dynamically incorporating functional annotations.

Additional strategies for grouping noncoding variants in gene-centric analysis based on functional annotations within STAAR [1]. STAAR statistics weigh variants using multiple annotation Principal Components (aPCs), which provide multi-dimensional summaries of variant annotations and capture the multi-faceted biological impact, and three integrative scores (CADD, LINSIGHT, and FATHMM-XF).

The STAAR-O procedure is used for calculating P value of each variant set. This is an omnibus test aggregating multiple annotation-weighted burden test, SKAT, and ACAT-V in the STAAR framework [1].

Flexible data-adaptive window size rare variant association test method. This approach extends SCANG (scan the genome) [6] method by incorporating multiple functional annotations through STAAR (SCANG-STAAR) while accounting for both relatedness and population structure through the generalized linear mixed model (GLMM) framework for quantitative and dichotomous traits [1]. This option is currently under development in the STAARpipeline app on the BDC-SB platform.

The STAAR method accounts for both relatedness and population structure, as well as longitudinal follow-up designs, for both quantitative and dichotomous traits, using a generalized linear mixed model (GLMM) framework that includes linear and logistic mixed models [3].

The STAAR method is computationally scalable for very large WGS studies and biobanks of hundreds of thousands of samples, using sparse genetic relatedness matrices (GRMs) [3,7,8].

To enable researchers to use the STAARpipeline together with publicly available data in Velsera platforms and participating ecosystems, Velsera collaborated with tool authors, Xihao Li: Assistant Professor at UNC, and Zilin Li: Professor at Northeast Normal University in China, to create STAARpipeline Common Workflow Language (CWL) tools and make them available to the researchers on the NHLBI BioData Catalyst Powered by Seven Bridges platform, to be used for null model fitting, single variant and aggregate association testing.

To run the STAARpipeline, thorough variant annotation is necessary. The Functional Annotation of Variants Online Resources (FAVOR) is a comprehensive whole genome variant annotation database and a variant browser that provides hundreds of functional annotation scores from a variety of biological functional dimensions for all possible 9 billion Single Nucleotide Variants (SNVs) and 80 million observed short insertions/deletions (indels) [4]. The FAVOR database is used to functionally annotate genotype data in the Genomic Data Structure (GDS) file [9,10] of any WGS/WES study and stored in an annotated GDS (aGDS) (Figure 1). The database includes functional annotations of all possible 9 billion SNVs and 80 million observed indels in the whole genome by integrating data from multiple different sources, including CADD v1.5, GENCODE v31, Annovar, WGSA, ClinVar, ENCODE, SnpEff, 1000 Genome, TOPMed Bravo Freeze 8, gnomAD v3 and other individual studies [4]. The FAVOR database includes annotation Principal Components (aPCs), which summarize multiple aspects of variant function by calculating the first variant-specific PC from the individual functional annotation scores in a functional category [3, 4]. For example, aPC-Protein-Function is the first PC of the seven individual standardized protein function scores. Accordingly, the FAVORannotator app, available on BDC-SB, uses the FAVOR Essential Database [11] to annotate variants.

Running GWAS and PheWAS using the STAARpipeline on the Seven Bridges Platforms

With emerging need to identify genetic components of complex traits in the noncoding genome, the STAARpipeline was designed for both noncoding and coding rare variant association detection across the genome [1]. STAARpipeline has been used in various rare variant studies, including rare variant association studies as multiple aggregate tests across the genome to identify gene-specific functional categories and noncoding genomic regions influencing plasma lipid concentrations (low-density lipoprotein cholesterol, high-density, lipoprotein cholesterol, triglycerides and total cholesterol) [1, 12, 13], fasting glucose and fasting insulin [1], kidney function [1], telomere length [1], height [14], inflammatory biomarkers [15], circulating metabolites [16], using TOPMed WGS data. Researchers can now run a functionally-informed genome-wide analysis using the STAARpipeline on the Seven Bridges Platforms using the following apps:

FAVORannotator – Automatic functional annotation of the variants from WGS/WES studies and integrating the functional annotations with the genotype data in GDS format.

STAARpipeline – Performs phenotype-genotype association analyses using STAAR procedure.

STAARpipeline PheWAS – Runs single variant and aggregate test for biobank-scale whole-genome/whole-exome sequencing data in a resource-efficient fashion for Phenome-Wide Association Study (PheWAS) [17].

STAARpipelineSummary VarSet – Takes the single variant or aggregate test results generated from the STAARpipeline app summarizes these results across all chromosomes and creates a unified list of results. This tool can also perform conditional analysis for (unconditionally) significant single variants or variant sets by adjusting for a given list of known variants.

STAARpipelineSummary IndVar – Extracts information of individual variants from a user-specified variant set, for variants belonging to a specific gene category or genetic region.

In contrast to GWAS, which starts from a phenotype and analyses variants across the genome to find association with given phenotype, phenome-wide association studies (PheWAS) has a reverse approach (Figure 2) starting from a specific variant and analysing many phenotypes to investigate if they have an association with analysed genetic variants [18]. This enables researchers to find multiple phenotypes having individual associations with the same variant, using the STAARpipeline PheWAS app, also available on NHLBI BioData Catalyst Powered by Seven Bridges. STAARpipeline PheWAS tool is designed to run single variant and aggregate testing for biobank-scale WGS/WES sequencing data in a resource-efficient fashion for PheWAS [17]. The same as for STAARpipeline app, used for GWAS, this tool can perform different test types: single variant, gene-centric coding, gene-centric noncoding, ncRNA, and sliding window. The tests can be performed for continuous or dichotomous outcomes. Similarly, as STAARpipeline for GWAS, STAARpipeline PheWAS incorporates multiple functional annotations to empower rare variant (set) association analysis for each phenotype using the STAAR method [17].

You can easily upload your data to Velsera’s Seven Bridges environment and run the STAARpipeline, or access publicly available data such as TopMED, in the Biodata Catalyst powered by the Seven Bridges platform. Additional information on how to get started is available in the SevenBridges QuickStart, CGC knowledge center, BDC, and CAVATICA documentation. Please contact us if having any questions or need for support.

For more details on the pipeline, its specific inputs and outputs, as well as detailed instructions on how to run the workflow, please see its description page on Velsera’s Seven Bridges Public Apps Gallery.

References

[1] Li, Z., Li, X., Zhou, H. et al. A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies. Nat Methods 19, 1599–1611 (2022). https://doi.org/10.1038/s41592-022-01640-x

[2] https://tinyurl.com/staarpipelineapps

[3] Li, X., Li, Z., Zhou, H. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat Genet 52, 969–983 (2020). https://doi.org/10.1038/s41588-020-0676-4

[4] Zhou, H., Arapoglou, T., Li, X. et al. FAVOR: functional annotation of variants online resource and annotator for variation across the human genome. Nucleic Acids Res 51, D1300–D1311 (2023). https://doi.org/10.1093/nar/gkac966

[5] https://favor.genohub.org/

[6] Li, Z., Li, X., Liu, Y. et al. Dynamic Scan Procedure for Detecting Rare-Variant Association Regions in Whole-Genome Sequencing Studies. Am J Hum Genet, 104, 802 – 814 (2019). https://doi.org/10.1016/j.ajhg.2019.03.002

[7] Gogarten, S.M., Sofer, T., Chen, H. et al. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics 35, 5346–5348 (2019). https://doi.org/10.1093/bioinformatics/btz567

[8] https://github.com/rounakdey/FastSparseGRM

[9] Zheng, X., Levine, D., Shen, J. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012). https://doi.org/10.1093/bioinformatics/bts606

[10] Zheng, X., Gogarten, S.M., Lawrence, M. et al. SeqArray—a storage-efficient high-performance data format for WGS variant calls, Bioinformatics 33, 2251–2257 (2017). https://doi.org/10.1093/bioinformatics/btx145

[11] https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1VGTJI

[12] Selvaraj, M.S., Li, X., Li, Z. et al. Whole genome sequence analysis of blood lipid levels in >66,000 individuals. Nat Commun 13, 5995 (2022). https://doi.org/10.1038/s41467-022-33510-7

[13] Wang, Y., Selvaraj, M.S., Li, X. et al. Rare variants in long non-coding RNAs are associated with blood lipid levels in the TOPMed whole-genome sequencing study. Am J Hum Genet, 110, 1704 – 1717 (2023). https://doi.org/10.1016/j.ajhg.2023.09.003

[14] Hawkes, G., Beaumont, R.N., Li, Z. et al. Whole genome association testing in 333,100 individuals across three biobanks identifies rare non-coding single variant and genomic aggregate associations with height. Preprint version (2023). https://doi.org/10.1101/2023.11.19.566520

[15] Jiang, M., Gaynor, S.M., Li, X. et al. Whole genome sequencing based analysis of inflammation biomarkers in the Trans-Omics for Precision Medicine (TOPMed) consortium. Hum Mol Genet (2024). https://doi.org/10.1093/hmg/ddae050

[16] Feofanova, E.V., Brown, M.R., Alkis, T. et al. Whole-Genome Sequencing Analysis of Human Metabolome in Multi-Ethnic Populations. Nat Commun 14, 3111 (2023). https://doi.org/10.1038/s41467-023-38800-2

[17] https://tinyurl.com/staarpipelinephewasapps

[18] Robinson, J.R., Denny, J.C., Roden, D.M. et al. Genome-wide and Phenome-wide Approaches to Understand Variable Drug Actions in Electronic Health Records. Clin Transl Sci. 11, 112-122 (2018). https://doi.org/10.1111/cts.12522