QC of Target Data

Obtaining the target data

Target data consist of individual-level genotype-phenotype data, usually generated within your lab/department/collaboration. For this tutorial, we have simulated some genotype-phenotype data using the 1000 Genomes Project European samples. You can download the data here

Unzip the data as follow:

unzip EUR.zip

Note

You will need plink in this section, which can be download from here.

Install the program plink and include its location in your PATH directory, which allows us to use plink instead of ./plink in the commands below. If PLINK is not in your PATH directory and is instead in your working directory, replace all instances of plink in the tutorial with ./plink.

QC checklist: Target data

Below are the QC steps that comprise the QC checklist for the target data.

# Sample size

We recommend that users only perform PRS analyses on target data of at least 100 individuals. The sample size of our target data here is 503 individuals.

# File transfer

Usually we do not need to download and transfer the target data file because it is typically generated locally. However, the file should contain an md5sum code in case we send the data file to collaborators who may want to confirm that the file has not changed during the transfer.

What is the md5sum code for each of the target files?
File md5sum
EUR.bed 98bcef133f683b1272d3ea5f97742e0e
EUR.bim 6b286002904880055a9c94e01522f059
EUR.cov 85ed18288c708e095385418517e9c3bd
EUR.fam e7b856f0c7bcaffc8405926d08386e97
EUR.height dd445ce969a81cded20da5c88b82d4df

# Genome build

As stated in the base data section, the genome build for our base and target data is the same, as it should be.

# Standard GWAS QC

The target data must be quality controlled to at least the standards implemented in GWAS studies, e.g. removing SNPs with low genotyping rate, low minor allele frequency, out of Hardy-Weinberg Equilibrium, removing individuals with low genotyping rate (see Marees et al).

The following plink command applies some of these QC metrics to the target data:

plink \
    --bfile EUR \
    --maf 0.01 \
    --hwe 1e-6 \
    --geno 0.01 \
    --mind 0.01 \
    --write-snplist \
    --make-just-fam \
    --out EUR.QC

Each of the parameters corresponds to the following

Parameter Value Description
bfile EUR Informs plink that the input genotype files should have a prefix of EUR
maf 0.01 Removes all SNPs with minor allele frequency less than 0.01. Genotyping errors typically have a larger influence on SNPs with low MAF. Studies with large sample sizes could apply a lower MAF threshold
hwe 1e-6 Removes SNPs with low P-value from the Hardy-Weinberg Equilibrium Fisher's exact or chi-squared test. SNPs with significant P-values from the HWE test are more likely affected by genotyping error or the effects of natural selection. Filtering should be performed on the control samples to avoid filtering SNPs that are causal (under selection in cases). When phenotype information is included, plink will automatically perform the filtering in the controls.
geno 0.01 Excludes SNPs that are missing in a high fraction of subjects. A two-stage filtering process is usually performed (see Marees et al).
mind 0.01 Excludes individuals who have a high rate of genotype missingness, since this may indicate problems in the DNA sample or processing. (see Marees et al for more details).
make-just-fam - Informs plink to only generate the QC'ed sample name to avoid generating the .bed file.
write-snplist - Informs plink to only generate the QC'ed SNP list to avoid generating the .bed file.
out EUR.QC Informs plink that all output should have a prefix of EUR.QC
How many SNPs and samples were filtered?
  • 14 samples were removed due to a high rate of genotype missingness
  • 5,353 SNP were removed due to missing genotype data
  • 944 SNPs were removed due to being out of Hardy-Weinberg Equilibrium
  • 5,061 SNPs were removed due to low minor allele frequency

Note

Normally, we can generate a new genotype file using the new sample list. However, this will use up a lot of storage space. Using plink's --extract, --exclude, --keep, --remove, --make-just-fam and --write-snplist functions, we can work solely on the list of samples and SNPs without duplicating the genotype file, reducing the storage space usage.

Very high or low heterozygosity rates in individuals could be due to DNA contamination or to high levels of inbreeding. Therefore, samples with extreme heterozygosity are typically removed prior to downstream analyses.

First, we perform pruning to remove highly correlated SNPs:

plink \
    --bfile EUR \
    --keep EUR.QC.fam \
    --extract EUR.QC.snplist \
    --indep-pairwise 200 50 0.25 \
    --out EUR.QC

Each of the parameters corresponds to the following

Parameter Value Description
bfile EUR Informs plink that the input genotype files should have a prefix of EUR
keep EUR.QC.fam Informs plink that we only want to use samples in EUR.QC.fam in the analysis
extract EUR.QC.snplist Informs plink that we only want to use SNPs in EUR.QC.snplist in the analysis
indep-pairwise 200 50 0.25 Informs plink that we wish to perform pruning with a window size of 200 variants, sliding across the genome with step size of 50 variants at a time, and filter out any SNPs with LD \(r^2\) higher than 0.25
out EUR.QC Informs plink that all output should have a prefix of EUR.QC

This will generate two files 1) EUR.QC.prune.in and 2) EUR.QC.prune.out. All SNPs within EUR.QC.prune.in have a pairwise \(r^2 < 0.25\).

Heterozygosity rates can then be computed using plink:

plink \
    --bfile EUR \
    --extract EUR.QC.prune.in \
    --keep EUR.QC.fam \
    --het \
    --out EUR.QC

This will generate the EUR.QC.het file, which contains F coefficient estimates for assessing heterozygosity. We will remove individuals with F coefficients that are more than 3 standard deviation (SD) units from the mean, which can be performed using the following R command (assuming that you have R downloaded, then you can open an R session by typing R in your terminal):

dat <- read.table("EUR.QC.het", header=T) # Read in the EUR.het file, specify it has header
m <- mean(dat$F) # Calculate the mean  
s <- sd(dat$F) # Calculate the SD
valid <- subset(dat, F <= m+3*s & F >= m-3*s) # Get any samples with F coefficient within 3 SD of the population mean
write.table(valid[,c(1,2)], "EUR.valid.sample", quote=F, row.names=F) # print FID and IID for valid samples
q() # exit R
library(data.table)
# Read in file
dat <- fread("EUR.QC.het")
# Get samples with F coefficient within 3 SD of the population mean
valid <- dat[F<=mean(F)+3*sd(F) & F>=mean(F)-3*sd(F)] 
# print FID and IID for valid samples
fwrite(valid[,c("FID","IID")], "EUR.valid.sample", sep="\t") 
q() # exit R
How many samples were excluded due to high heterozygosity rate?
  • 2 samples were excluded

# Ambiguous SNPs

These were removed during the base data QC.

# Mismatching SNPs

SNPs that have mismatching alleles reported in the base and target data may be resolvable by strand-flipping the alleles to their complementary alleles in e.g. the target data, such as for a SNP with A/C in the base data and G/T in the target. This can be achieved with the following steps:

Note

Most PRS software will perform strand-flipping automatically, thus this step is usually not required.

1. Load the bim file, the summary statistic and the QC SNP list into R

# Read in bim file
bim <- read.table("EUR.bim")
colnames(bim) <- c("CHR", "SNP", "CM", "BP", "B.A1", "B.A2")
# Read in QCed SNPs
qc <- read.table("EUR.QC.snplist", header = F, stringsAsFactors = F)
# Read in the GWAS data
height <-
    read.table(gzfile("Height.QC.gz"),
            header = T,
            stringsAsFactors = F, 
            sep="\t")
# Change all alleles to upper case for easy comparison
height$A1 <- toupper(height$A1)
height$A2 <- toupper(height$A2)
bim$B.A1 <- toupper(bim$B.A1)
bim$B.A2 <- toupper(bim$B.A2)
# magrittr allow us to do piping, which help to reduce the 
# amount of intermediate data types
library(data.table)
library(magrittr)
# Read in bim file 
bim <- fread("EUR.bim") %>%
    # Note: . represents the output from previous step
    # The syntax here means, setnames of the data read from
    # the bim file, and replace the original column names by 
    # the new names
    setnames(., colnames(.), c("CHR", "SNP", "CM", "BP", "B.A1", "B.A2")) %>%
    # And immediately change the alleles to upper cases
    .[,c("B.A1","B.A2"):=list(toupper(B.A1), toupper(B.A2))]
# Read in summary statistic data (require data.table v1.12.0+)
height <- fread("Height.QC.gz") %>%
    # And immediately change the alleles to upper cases
    .[,c("A1","A2"):=list(toupper(A1), toupper(A2))]
# Read in QCed SNPs
qc <- fread("EUR.QC.snplist", header=F)

2. Identify SNPs that require strand flipping

# Merge summary statistic with target
info <- merge(bim, height, by = c("SNP", "CHR", "BP"))
# Filter QCed SNPs
info <- info[info$SNP %in% qc$V1,]
# Function for finding the complementary allele
complement <- function(x) {
    switch (
        x,
        "A" = "T",
        "C" = "G",
        "T" = "A",
        "G" = "C",
        return(NA)
    )
}
# Get SNPs that have the same alleles across base and target
info.match <- subset(info, A1 == B.A1 & A2 == B.A2)
# Identify SNPs that are complementary between base and target
info$C.A1 <- sapply(info$B.A1, complement)
info$C.A2 <- sapply(info$B.A2, complement)
info.complement <- subset(info, A1 == C.A1 & A2 == C.A2)
# Update the complementary alleles in the bim file
# This allow us to match the allele in subsequent analysis
complement.snps <- bim$SNP %in% info.complement$SNP
bim[complement.snps,]$B.A1 <-
    sapply(bim[complement.snps,]$B.A1, complement)
bim[complement.snps,]$B.A2 <-
    sapply(bim[complement.snps,]$B.A2, complement)
# Merge summary statistic with target
info <- merge(bim, height, by=c("SNP", "CHR", "BP")) %>%
    # And filter out QCed SNPs
    .[SNP %in% qc[,V1]]

# Function for calculating the complementary allele
complement <- function(x){
    switch (x,
        "A" = "T",
        "C" = "G",
        "T" = "A",
        "G" = "C",
        return(NA)
    )
} 
# Get SNPs that have the same alleles across base and target
info.match <- info[A1 == B.A1 & A2 == B.A2, SNP]
# Identify SNPs that are complementary between base and target
com.snps <- info[sapply(B.A1, complement) == A1 &
                    sapply(B.A2, complement) == A2, SNP]
# Now update the bim file
bim[SNP %in% com.snps, c("B.A1", "B.A2") :=
        list(sapply(B.A1, complement),
            sapply(B.A2, complement))]

3. Identify SNPs that require recoding in the target (to ensure the coding allele in the target data is the effective allele in the base summary statistic)

# identify SNPs that need recoding
info.recode <- subset(info, A1 == B.A2 & A2 == B.A1)
# Update the recode SNPs
recode.snps <- bim$SNP %in% info.recode$SNP
tmp <- bim[recode.snps,]$B.A1
bim[recode.snps,]$B.A1 <- bim[recode.snps,]$B.A2
bim[recode.snps,]$B.A2 <- tmp

# identify SNPs that need recoding & complement
info.crecode <- subset(info, A1 == C.A2 & A2 == C.A1)
# Update the recode + strand flip SNPs
com.snps <- bim$SNP %in% info.crecode$SNP
tmp <- bim[com.snps,]$B.A1
bim[com.snps,]$B.A1 <- as.character(sapply(bim[com.snps,]$B.A2, complement))
bim[com.snps,]$B.A2 <- as.character(sapply(tmp, complement))

# Output updated bim file
write.table(
    bim[,c("SNP", "B.A1")],
    "EUR.a1",
    quote = F,
    row.names = F,
    col.names = F,
    sep="\t"
)
# identify SNPs that need recoding
recode.snps <- info[B.A1==A2 & B.A2==A1, SNP]
# Update the bim file
bim[SNP %in% recode.snps, c("B.A1", "B.A2") :=
        list(B.A2, B.A1)]

# identify SNPs that need recoding & complement
com.recode <- info[sapply(B.A1, complement) == A2 &
                    sapply(B.A2, complement) == A1, SNP]
# Now update the bim file
bim[SNP %in% com.recode, c("B.A1", "B.A2") :=
        list(sapply(B.A2, complement),
            sapply(B.A1, complement))]
# Write the updated bim file
fwrite(bim[,c("SNP", "B.A1")], "EUR.a1", col.names=F, sep="\t")

4. Identify SNPs that have different allele in base and target (usually due to difference in genome build or Indel)

mismatch <-
    bim$SNP[!(bim$SNP %in% info.match$SNP |
                bim$SNP %in% info.complement$SNP | 
                bim$SNP %in% info.recode$SNP |
                bim$SNP %in% info.crecode$SNP)]
write.table(
    mismatch,
    "EUR.mismatch",
    quote = F,
    row.names = F,
    col.names = F
)
q() # exit R
mismatch <- bim[!(SNP %in% info.match |
                    SNP %in% com.snps |
                    SNP %in% recode.snps |
                    SNP %in% com.recode), SNP]
write.table(mismatch, "EUR.mismatch", quote=F, row.names=F, col.names=F)
q() # exit R

We can then use the EUR.a1 file to update the A1 alleles

# Duplicate SNPs

Make sure to remove any duplicate SNPs in your target data (these target data were simulated and so include no duplicated SNPs).

# Sex chromosomes

Sometimes sample mislabelling can occur, which may lead to invalid results. One indication of a mislabelled sample is a difference between reported sex and that indicated by the sex chromosomes. While this may be due to a difference in sex and gender identity, it could also reflect mislabeling of samples or misreporting and, thus, individuals in which there is a mismatch between biological and reported sex are typically removed. A sex check can be performed in PLINK, in which individuals are called as females if their X chromosome homozygosity estimate (F statistic) is < 0.2 and as males if the estimate is > 0.8.

Before performing a sex check, pruning should be performed (see here). A sex check can then easily be conducted using plink

plink \
    --bfile EUR \
    --extract EUR.QC.prune.in \
    --keep EUR.valid.sample \
    --check-sex \
    --out EUR.QC

This will generate a file called EUR.QC.sexcheck containing the F-statistics for each individual. Individuals are typically called as being biologically male if the F-statistic is > 0.8 and biologically female if F < 0.2.

# Read in file
valid <- read.table("EUR.valid.sample", header=T)
dat <- read.table("EUR.QC.sexcheck", header=T)
valid <- subset(dat, STATUS=="OK" & FID %in% valid$FID)
write.table(valid[,c("FID", "IID")], "EUR.QC.valid", row.names=F, col.names=F, sep="\t", quote=F) 
q() # exit R
library(data.table)
# Read in file
valid <- fread("EUR.valid.sample")
dat <- fread("EUR.QC.sexcheck")[FID%in%valid$FID]
fwrite(dat[STATUS=="OK",c("FID","IID")], "EUR.QC.valid", sep="\t") 
q() # exit R
How many samples were excluded due mismatched Sex information?
  • 4 samples were excluded

# Sample overlap

Since the target data were simulated there are no overlapping samples between the base and target data here (see the relevant section of the paper for discussion of the importance of avoiding sample overlap).

# Relatedness

Closely related individuals in the target data may lead to overfitted results, limiting the generalisability of the results.

Before calculating the relatedness, pruning should be performed (see here). Individuals that have a first or second degree relative in the sample (\(\hat{\pi} > 0.125\)) can be removed with the following command:

plink \
    --bfile EUR \
    --extract EUR.QC.prune.in \
    --keep EUR.QC.valid \
    --rel-cutoff 0.125 \
    --out EUR.QC
How many related samples were excluded?
  • 0 samples were excluded

Note

A greedy algorithm is used to remove closely related individuals in a way that optimizes the size of the sample retained. However, the algorithm is dependent on the random seed used, which can generate different results. Therefore, to reproduce the same result, you will need to specify the same random seed.

PLINK's algorithm for removing related individuals does not account for the phenotype under study. To minimize the removal of cases of a disease, the following algorithm can be used instead: GreedyRelated.

Generate final QC'ed target data file

After performing the full analysis, you can generate a QC'ed data set with the following command:

plink \
    --bfile EUR \
    --make-bed \
    --keep EUR.QC.rel.id \
    --out EUR.QC \
    --extract EUR.QC.snplist \
    --exclude EUR.mismatch \
    --a1-allele EUR.a1

Each of the parameters corresponds to the following

Parameter Value Description
bfile EUR Informs plink that the input genotype files should have a prefix of EUR
keep EUR.QC.rel.id Informs plink that we only want to keep samples in EUR.QC.rel.id
extract EUR.QC.snplist Informs plink that we only want to use SNPs in EUR.QC.snplist in the analysis
exclude EUR.mismatch Informs plink that we wish to remove any SNPs in EUR.mismatch
a1-allele EUR.a1 Fix all A1 alleles to those specified in EUR.a1
out EUR.QC Informs plink that all output should have a prefix of EUR.QC