News & Updates
GP2 Release Notes – December 2021
Storage updates and release schedules: We will attempt to make at least quarterly updates to stored data as scheduled releases. These are the top level directories within gp2_tier1 and gp2_tier2 that contain public summary level and private participant level data (respectively). For example, release 1 on November 29th 2021 would be in the top level directory /release1_29112021 in both the tier1 and tier2 storage buckets.
Contact: For questions relating to data processing, please email firstname.lastname@example.org.
Release specific info follows below.
For a current list of samples, studies, cohorts and geographic territories covered by GP2 please see the GP2 website here.
- 4908 samples are added in this release, the number of available GP2 samples now equals 4908.
- New genotype samples were processed using GenoTools version 0.1 [https://github.com/dvitale199/GenoTools]. All samples were imputed to TOPMed reference detailed in the GenoTools pipeline.
- All data provided is GRCh38 (hg38).
* GDPR note: Currently, all data in this release is not governed by GDPR.
Bucket and Directory Structure:
Bucket and Directory Overview:
- gp2_tier1, this is the bucket for summary statistics and other non-participant level data. The top level directories for this always correspond to each release, with a mirrored structure for each release.
- summary_statistics - The file META5_no23_with_rsids2.txt contains open access summary statistics from the most recent Parkinson’s GWAS (excluding 23andMe samples, from Nalls et al 2019, https://pubmed.ncbi.nlm.nih.gov/31701892/) can be found here as well as in the tier 2 storage bucket. Column headers conform to the standard METAL meta-analysis output [https://genome.sph.umich.edu/wiki/METAL_Documentation].
- gp2_tier2, this is the bucket for participant level data. The top level directories for this always correspond to each release, with a mirrored structure for each release. Its content is mirrored below.
- raw_genotypes - PLINK binary files for each ancestry group for all samples passing quality control prior to imputation. Each PLINK binary includes all attempted variants from the array for that ancestry group. As a note, for flexibility in community analyses, all known duplicate samples were removed but related samples remain.
- imputed_genotypes - All genotype data has been imputed using the TOPMed reference panel and is contained in the PLINK2 files chr*.* Prior to upload, these files have been filtered for minor allele count > 10 and imputation quality > 0.3 as is industry standard. Each file set is separated by genetically defined ancestry groups prior to imputation.
- meta_data - Meta data included in the HDF5 file GP2_round1.QC.metrics.h5 is currently comprised of QC, ancestry counts, ancestry labels, confusion matrix, new samples UMAP, projected principal components, pruned samples, reference principal component, reference UMAP, total (samples and reference) UMAP.
- clinical_data - The corresponding data dictionary for an explanation of the columns can be found in release1_29112021_data_dictionary.csv.
- summary_statistics - this includes basic summary statistics from gp2_tier1
Ancestry group definitions
- AAC - African Admixed
- AFR - African Ancestry
- AJ - Ashkenazi Jewish
- AMR - Latino and Indigenous Americas populations
- EUR - general European ancestry
- EAS - East Asian ancestry
- SAS - South Asian ancestry
- FIN - Finnish population isolate
This release does not contain AFR or FIN due to insufficient sample size for imputation quality