Preprocessing of genetic data

The UKB has performed whole-genome sequencing on roughly half a million participants. DNA was extracted from blood samples collected at the participants’ visit to the assessment centre. For genotyping, the UKB used a genotype calling pipeline from Affymetrix Research Services Laboratory, which was applied to 106 batches of 4700 samples each. A detailed description of the processing pipeline is described in Bycroft et al.¹. The marker-based QC pipeline incorporated tests to account for batch and array effects, deviations from the Hardy-Weinberg equilibrium, sex effects, and discordance across control replicates. The sample based QC pipeline included measures of missing rate and heterozygosity. Abnormal values in either of these measures indicated a poor-quality sample. Samples with a missing rate larger than 5% were marked. Abnormality in the heterozygosity was adjusted for the population structure by applying a linear regression model to the first six principal components of a PCA. Samples were not excluded from the UKB release, but they were flagged so researchers could apply their own thresholds. The top 40 principal components were also included in the UKB data release. In addition, the UKB also determined ancestral backgrounds using the PCA, and performed a kinship coefficient estimation to adjust for family structure and performed imputation procedure, HLA imputation and validation¹.

For the analyses in this project, we applied the QC measures as recommended in the UKB paper¹. In summary, we excluded SNPs with an imputation quality below 0.5, a minor allele frequency below 0.001, missingness above 5%, and SNPs that failed the Hardy-Weinberg equilibrium test at p < 1e^-9.