QC of Target Data¶
Obtaining the target data¶
Target data consist of individual-level genotype-phenotype data, usually generated within your lab/department/collaboration. For this tutorial, we have simulated some genotype-phenotype data using the 1000 Genomes Project European samples. You can download the data here
Unzip the data as follow:
unzip EUR.zip
Note
You will need plink
in this section, which can be download from here.
Install the program plink
and include its location in your PATH directory, which allows us to use plink
instead of ./plink
in the commands below. If PLINK is not in your PATH directory and is instead in your working directory, replace all instances of plink
in the tutorial with ./plink
.
QC checklist: Target data¶
Below are the QC steps that comprise the QC checklist for the target data.
# Sample size¶
We recommend that users only perform PRS analyses on target data of at least 100 individuals. The sample size of our target data here is 503 individuals.
# File transfer¶
Usually we do not need to download and transfer the target data file because it is typically generated locally. However, the file should contain an md5sum code in case we send the data file to collaborators who may want to confirm that the file has not changed during the transfer.
What is the md5sum code for each of the target files?
File | md5sum |
---|---|
EUR.bed | 98bcef133f683b1272d3ea5f97742e0e |
EUR.bim | 6b286002904880055a9c94e01522f059 |
EUR.cov | 85ed18288c708e095385418517e9c3bd |
EUR.fam | e7b856f0c7bcaffc8405926d08386e97 |
EUR.height | dd445ce969a81cded20da5c88b82d4df |
# Genome build¶
As stated in the base data section, the genome build for our base and target data is the same, as it should be.
# Standard GWAS QC¶
The target data must be quality controlled to at least the standards implemented in GWAS studies, e.g. removing SNPs with low genotyping rate, low minor allele frequency, out of Hardy-Weinberg Equilibrium, removing individuals with low genotyping rate (see Marees et al).
The following plink
command applies some of these QC metrics to the target data:
plink \
--bfile EUR \
--maf 0.01 \
--hwe 1e-6 \
--geno 0.01 \
--mind 0.01 \
--write-snplist \
--make-just-fam \
--out EUR.QC
Each of the parameters corresponds to the following
Parameter | Value | Description |
---|---|---|
bfile | EUR | Informs plink that the input genotype files should have a prefix of EUR |
maf | 0.01 | Removes all SNPs with minor allele frequency less than 0.01. Genotyping errors typically have a larger influence on SNPs with low MAF. Studies with large sample sizes could apply a lower MAF threshold |
hwe | 1e-6 | Removes SNPs with low P-value from the Hardy-Weinberg Equilibrium Fisher's exact or chi-squared test. SNPs with significant P-values from the HWE test are more likely affected by genotyping error or the effects of natural selection. Filtering should be performed on the control samples to avoid filtering SNPs that are causal (under selection in cases). When phenotype information is included, plink will automatically perform the filtering in the controls. |
geno | 0.01 | Excludes SNPs that are missing in a high fraction of subjects. A two-stage filtering process is usually performed (see Marees et al). |
mind | 0.01 | Excludes individuals who have a high rate of genotype missingness, since this may indicate problems in the DNA sample or processing. (see Marees et al for more details). |
make-just-fam | - | Informs plink to only generate the QC'ed sample name to avoid generating the .bed file. |
write-snplist | - | Informs plink to only generate the QC'ed SNP list to avoid generating the .bed file. |
out | EUR.QC | Informs plink that all output should have a prefix of EUR.QC |
How many SNPs and samples were filtered?
14
samples were removed due to a high rate of genotype missingness5,353
SNP were removed due to missing genotype data944
SNPs were removed due to being out of Hardy-Weinberg Equilibrium5,061
SNPs were removed due to low minor allele frequency
Note
Normally, we can generate a new genotype file using the new sample list.
However, this will use up a lot of storage space. Using plink
's
--extract
, --exclude
, --keep
, --remove
, --make-just-fam
and --write-snplist
functions, we can work
solely on the list of samples and SNPs without duplicating the
genotype file, reducing the storage space usage.
Very high or low heterozygosity rates in individuals could be due to DNA contamination or to high levels of inbreeding. Therefore, samples with extreme heterozygosity are typically removed prior to downstream analyses.
First, we perform pruning to remove highly correlated SNPs:
plink \
--bfile EUR \
--keep EUR.QC.fam \
--extract EUR.QC.snplist \
--indep-pairwise 200 50 0.25 \
--out EUR.QC
Each of the parameters corresponds to the following
Parameter | Value | Description |
---|---|---|
bfile | EUR | Informs plink that the input genotype files should have a prefix of EUR |
keep | EUR.QC.fam | Informs plink that we only want to use samples in EUR.QC.fam in the analysis |
extract | EUR.QC.snplist | Informs plink that we only want to use SNPs in EUR.QC.snplist in the analysis |
indep-pairwise | 200 50 0.25 | Informs plink that we wish to perform pruning with a window size of 200 variants, sliding across the genome with step size of 50 variants at a time, and filter out any SNPs with LD \(r^2\) higher than 0.25 |
out | EUR.QC | Informs plink that all output should have a prefix of EUR.QC |
This will generate two files 1) EUR.QC.prune.in and 2) EUR.QC.prune.out. All SNPs within EUR.QC.prune.in have a pairwise \(r^2 < 0.25\).
Heterozygosity rates can then be computed using plink
:
plink \
--bfile EUR \
--extract EUR.QC.prune.in \
--keep EUR.QC.fam \
--het \
--out EUR.QC
This will generate the EUR.QC.het file, which contains F coefficient estimates for assessing heterozygosity.
We will remove individuals with F coefficients that are more than 3 standard deviation (SD) units from the mean, which can be performed using the following R
command (assuming that you have R downloaded, then you can open an R
session by typing R
in your terminal):
dat <- read.table("EUR.QC.het", header=T) # Read in the EUR.het file, specify it has header
m <- mean(dat$F) # Calculate the mean
s <- sd(dat$F) # Calculate the SD
valid <- subset(dat, F <= m+3*s & F >= m-3*s) # Get any samples with F coefficient within 3 SD of the population mean
write.table(valid[,c(1,2)], "EUR.valid.sample", quote=F, row.names=F) # print FID and IID for valid samples
q() # exit R
library(data.table)
# Read in file
dat <- fread("EUR.QC.het")
# Get samples with F coefficient within 3 SD of the population mean
valid <- dat[F<=mean(F)+3*sd(F) & F>=mean(F)-3*sd(F)]
# print FID and IID for valid samples
fwrite(valid[,c("FID","IID")], "EUR.valid.sample", sep="\t")
q() # exit R
How many samples were excluded due to high heterozygosity rate?
2
samples were excluded
# Ambiguous SNPs¶
These were removed during the base data QC.
# Mismatching SNPs¶
SNPs that have mismatching alleles reported in the base and target data may be resolvable by strand-flipping the alleles to their complementary alleles in e.g. the target data, such as for a SNP with A/C in the base data and G/T in the target. This can be achieved with the following steps:
Note
Most PRS software will perform strand-flipping automatically, thus this step is usually not required.
1. Load the bim file, the summary statistic and the QC SNP list into R
# Read in bim file
bim <- read.table("EUR.bim")
colnames(bim) <- c("CHR", "SNP", "CM", "BP", "B.A1", "B.A2")
# Read in QCed SNPs
qc <- read.table("EUR.QC.snplist", header = F, stringsAsFactors = F)
# Read in the GWAS data
height <-
read.table(gzfile("Height.QC.gz"),
header = T,
stringsAsFactors = F,
sep="\t")
# Change all alleles to upper case for easy comparison
height$A1 <- toupper(height$A1)
height$A2 <- toupper(height$A2)
bim$B.A1 <- toupper(bim$B.A1)
bim$B.A2 <- toupper(bim$B.A2)
# magrittr allow us to do piping, which help to reduce the
# amount of intermediate data types
library(data.table)
library(magrittr)
# Read in bim file
bim <- fread("EUR.bim") %>%
# Note: . represents the output from previous step
# The syntax here means, setnames of the data read from
# the bim file, and replace the original column names by
# the new names
setnames(., colnames(.), c("CHR", "SNP", "CM", "BP", "B.A1", "B.A2")) %>%
# And immediately change the alleles to upper cases
.[,c("B.A1","B.A2"):=list(toupper(B.A1), toupper(B.A2))]
# Read in summary statistic data (require data.table v1.12.0+)
height <- fread("Height.QC.gz") %>%
# And immediately change the alleles to upper cases
.[,c("A1","A2"):=list(toupper(A1), toupper(A2))]
# Read in QCed SNPs
qc <- fread("EUR.QC.snplist", header=F)
2. Identify SNPs that require strand flipping
# Merge summary statistic with target
info <- merge(bim, height, by = c("SNP", "CHR", "BP"))
# Filter QCed SNPs
info <- info[info$SNP %in% qc$V1,]
# Function for finding the complementary allele
complement <- function(x) {
switch (
x,
"A" = "T",
"C" = "G",
"T" = "A",
"G" = "C",
return(NA)
)
}
# Get SNPs that have the same alleles across base and target
info.match <- subset(info, A1 == B.A1 & A2 == B.A2)
# Identify SNPs that are complementary between base and target
info$C.A1 <- sapply(info$B.A1, complement)
info$C.A2 <- sapply(info$B.A2, complement)
info.complement <- subset(info, A1 == C.A1 & A2 == C.A2)
# Update the complementary alleles in the bim file
# This allow us to match the allele in subsequent analysis
complement.snps <- bim$SNP %in% info.complement$SNP
bim[complement.snps,]$B.A1 <-
sapply(bim[complement.snps,]$B.A1, complement)
bim[complement.snps,]$B.A2 <-
sapply(bim[complement.snps,]$B.A2, complement)
# Merge summary statistic with target
info <- merge(bim, height, by=c("SNP", "CHR", "BP")) %>%
# And filter out QCed SNPs
.[SNP %in% qc[,V1]]
# Function for calculating the complementary allele
complement <- function(x){
switch (x,
"A" = "T",
"C" = "G",
"T" = "A",
"G" = "C",
return(NA)
)
}
# Get SNPs that have the same alleles across base and target
info.match <- info[A1 == B.A1 & A2 == B.A2, SNP]
# Identify SNPs that are complementary between base and target
com.snps <- info[sapply(B.A1, complement) == A1 &
sapply(B.A2, complement) == A2, SNP]
# Now update the bim file
bim[SNP %in% com.snps, c("B.A1", "B.A2") :=
list(sapply(B.A1, complement),
sapply(B.A2, complement))]
3. Identify SNPs that require recoding in the target (to ensure the coding allele in the target data is the effective allele in the base summary statistic)
# identify SNPs that need recoding
info.recode <- subset(info, A1 == B.A2 & A2 == B.A1)
# Update the recode SNPs
recode.snps <- bim$SNP %in% info.recode$SNP
tmp <- bim[recode.snps,]$B.A1
bim[recode.snps,]$B.A1 <- bim[recode.snps,]$B.A2
bim[recode.snps,]$B.A2 <- tmp
# identify SNPs that need recoding & complement
info.crecode <- subset(info, A1 == C.A2 & A2 == C.A1)
# Update the recode + strand flip SNPs
com.snps <- bim$SNP %in% info.crecode$SNP
tmp <- bim[com.snps,]$B.A1
bim[com.snps,]$B.A1 <- as.character(sapply(bim[com.snps,]$B.A2, complement))
bim[com.snps,]$B.A2 <- as.character(sapply(tmp, complement))
# Output updated bim file
write.table(
bim[,c("SNP", "B.A1")],
"EUR.a1",
quote = F,
row.names = F,
col.names = F,
sep="\t"
)
# identify SNPs that need recoding
recode.snps <- info[B.A1==A2 & B.A2==A1, SNP]
# Update the bim file
bim[SNP %in% recode.snps, c("B.A1", "B.A2") :=
list(B.A2, B.A1)]
# identify SNPs that need recoding & complement
com.recode <- info[sapply(B.A1, complement) == A2 &
sapply(B.A2, complement) == A1, SNP]
# Now update the bim file
bim[SNP %in% com.recode, c("B.A1", "B.A2") :=
list(sapply(B.A2, complement),
sapply(B.A1, complement))]
# Write the updated bim file
fwrite(bim[,c("SNP", "B.A1")], "EUR.a1", col.names=F, sep="\t")
4. Identify SNPs that have different allele in base and target (usually due to difference in genome build or Indel)
mismatch <-
bim$SNP[!(bim$SNP %in% info.match$SNP |
bim$SNP %in% info.complement$SNP |
bim$SNP %in% info.recode$SNP |
bim$SNP %in% info.crecode$SNP)]
write.table(
mismatch,
"EUR.mismatch",
quote = F,
row.names = F,
col.names = F
)
q() # exit R
mismatch <- bim[!(SNP %in% info.match |
SNP %in% com.snps |
SNP %in% recode.snps |
SNP %in% com.recode), SNP]
write.table(mismatch, "EUR.mismatch", quote=F, row.names=F, col.names=F)
q() # exit R
We can then use the EUR.a1 file to update the A1 alleles
# Duplicate SNPs¶
Make sure to remove any duplicate SNPs in your target data (these target data were simulated and so include no duplicated SNPs).
# Sex chromosomes¶
Sometimes sample mislabelling can occur, which may lead to invalid results. One indication of a mislabelled sample is a difference between reported sex and that indicated by the sex chromosomes. While this may be due to a difference in sex and gender identity, it could also reflect mislabeling of samples or misreporting and, thus, individuals in which there is a mismatch between biological and reported sex are typically removed. A sex check can be performed in PLINK, in which individuals are called as females if their X chromosome homozygosity estimate (F statistic) is < 0.2 and as males if the estimate is > 0.8.
Before performing a sex check, pruning should be performed (see here).
A sex check can then easily be conducted using plink
plink \
--bfile EUR \
--extract EUR.QC.prune.in \
--keep EUR.valid.sample \
--check-sex \
--out EUR.QC
This will generate a file called EUR.QC.sexcheck containing the F-statistics for each individual. Individuals are typically called as being biologically male if the F-statistic is > 0.8 and biologically female if F < 0.2.
# Read in file
valid <- read.table("EUR.valid.sample", header=T)
dat <- read.table("EUR.QC.sexcheck", header=T)
valid <- subset(dat, STATUS=="OK" & FID %in% valid$FID)
write.table(valid[,c("FID", "IID")], "EUR.QC.valid", row.names=F, col.names=F, sep="\t", quote=F)
q() # exit R
library(data.table)
# Read in file
valid <- fread("EUR.valid.sample")
dat <- fread("EUR.QC.sexcheck")[FID%in%valid$FID]
fwrite(dat[STATUS=="OK",c("FID","IID")], "EUR.QC.valid", sep="\t")
q() # exit R
How many samples were excluded due mismatched Sex information?
4
samples were excluded
# Sample overlap¶
Since the target data were simulated there are no overlapping samples between the base and target data here (see the relevant section of the paper for discussion of the importance of avoiding sample overlap).
# Relatedness¶
Closely related individuals in the target data may lead to overfitted results, limiting the generalisability of the results.
Before calculating the relatedness, pruning should be performed (see here). Individuals that have a first or second degree relative in the sample (\(\hat{\pi} > 0.125\)) can be removed with the following command:
plink \
--bfile EUR \
--extract EUR.QC.prune.in \
--keep EUR.QC.valid \
--rel-cutoff 0.125 \
--out EUR.QC
How many related samples were excluded?
0
samples were excluded
Note
A greedy algorithm is used to remove closely related individuals in a way that optimizes the size of the sample retained. However, the algorithm is dependent on the random seed used, which can generate different results. Therefore, to reproduce the same result, you will need to specify the same random seed.
PLINK's algorithm for removing related individuals does not account for the phenotype under study. To minimize the removal of cases of a disease, the following algorithm can be used instead: GreedyRelated.
Generate final QC'ed target data file¶
After performing the full analysis, you can generate a QC'ed data set with the following command:
plink \
--bfile EUR \
--make-bed \
--keep EUR.QC.rel.id \
--out EUR.QC \
--extract EUR.QC.snplist \
--exclude EUR.mismatch \
--a1-allele EUR.a1
Each of the parameters corresponds to the following
Parameter | Value | Description |
---|---|---|
bfile | EUR | Informs plink that the input genotype files should have a prefix of EUR |
keep | EUR.QC.rel.id | Informs plink that we only want to keep samples in EUR.QC.rel.id |
extract | EUR.QC.snplist | Informs plink that we only want to use SNPs in EUR.QC.snplist in the analysis |
exclude | EUR.mismatch | Informs plink that we wish to remove any SNPs in EUR.mismatch |
a1-allele | EUR.a1 | Fix all A1 alleles to those specified in EUR.a1 |
out | EUR.QC | Informs plink that all output should have a prefix of EUR.QC |