READ ME

This text describes the data presented in the paper: Genome-wide association study identifies susceptibility loci for acute myeloid leukemia. Lin et al.

========================
Introductory information
========================
Files included in the data deposit (include a short description of what data are contained): 

1)Genome-wide association summary statistics file for risk of developing acute myeloid leukemia.

Key words used to describe the data: acute myeloid leukemia; AML

========================== 
Methodological information
==========================
The primary outcome assessed in this study was risk of developing acute myeloid leukemia. For each GWAS, association tests were performed for all cases and cytogenetically normal AML assuming an additive genetic model, with nominally significant principal components included in the analysis as covariates. Association summary statistics were combined for variants common to all four GWAS, in fixed effects models using PLINK. Cochran’s Q statistic was used to test for heterogeneity and the I2 statistic was used to quantify variation due to heterogeneity.

Date(s) of data collection: August 2021

Geographic coverage of data: Europe and the United States of America

Data validation (how was the data checked, proofed and cleaned): For each GWAS, we excluded SNPs with extreme departure from Hardy-Weinberg equilibrium (HWE; P < 10-3 in either cases or controls) and with a low call rate (< 95%). We also excluded SNPs that showed significant differences (P < 10-3) between genotype batches and with significant differences (P < 0.05) in missingness between cases and controls. Individual samples with a call rate of < 95% or with extreme heterozygosity rates (+/- 3 standard deviation from the mean) were also excluded from each GWAS. Individuals were removed such that there were no two individuals with estimated relatedness pihat > 0.1875, both within and across GWAS. The individual with the higher call rate was retained unless relatedness was identified between a case and a control, where the case was preferentially retained. Ancestry was assessed using principal component analysis and super-populations from the 1000 genomes project as a reference, with individuals of non-European ancestry excluded based on the first two principal components. In order to minimise any impact of population stratification among the European population we excluded outlying cases and controls identified using principal components 1 and 2 for each GWAS 


=========================
Data-specific information
=========================
Definitions of names, labels, acronyms or specialist terminology uses for variables, records and their values: 
phenotype; chromosome number; base position in genome build 37; SNP base; allele 1; allele 2; meta P value (fixed-effect); meta P value (random-effect); Q statistic; I2 value; ref sequence identifier


=======
Contact
=======
Please contact rdm@ncl.ac.uk for further information