Background¶
lassosum
is one of the dedicated PRS programs which is an R
package that uses penalised regression (LASSO) in its approach to PRS calculation.
Installing lassosum¶
Note
The script used here is based on lassosum version 0.4.4
Note
For more details, please refer to lassosum's homepage
You can install lassosum
and its dependencies in R
with the following command:
install.packages(c("devtools","RcppArmadillo", "data.table", "Matrix"), dependencies=TRUE)
library(devtools)
install_github("tshmak/lassosum")
Required Data¶
Again, we assume that we have the following files (or you can download it from here):
File Name | Description |
---|---|
Height.QC.gz | The post-QCed summary statistic |
EUR.QC.bed | The genotype file after performing some basic filtering |
EUR.QC.bim | This file contains the SNPs that passed the basic filtering |
EUR.QC.fam | This file contains the samples that passed the basic filtering |
EUR.height | This file contains the phenotype of the samples |
EUR.cov | This file contains the covariates of the samples |
EUR.eigenvec | This file contains the PCs of the samples |
Running PRS analysis¶
We can run lassosum as follows:
library(lassosum)
# Prefer to work with data.table as it speeds up file reading
library(data.table)
library(methods)
library(magrittr)
# For multi-threading, you can use the parallel package and
# invoke cl which is then passed to lassosum.pipeline
library(parallel)
# This will invoke 2 threads.
cl <- makeCluster(2)
sum.stat <- "Height.QC.gz"
bfile <- "EUR.QC"
# Read in and process the covariates
covariate <- fread("EUR.cov")
pcs <- fread("EUR.eigenvec") %>%
setnames(., colnames(.), c("FID","IID", paste0("PC",1:6)))
# Need as.data.frame here as lassosum doesn't handle data.table
# covariates very well
cov <- merge(covariate, pcs)
# We will need the EUR.hg19 file provided by lassosum
# which are LD regions defined in Berisa and Pickrell (2015) for the European population and the hg19 genome.
ld.file <- "EUR.hg19"
# output prefix
prefix <- "EUR"
# Read in the target phenotype file
target.pheno <- fread("EUR.height")[,c("FID", "IID", "Height")]
# Read in the summary statistics
ss <- fread(sum.stat)
# Remove P-value = 0, which causes problem in the transformation
ss <- ss[!P == 0]
# Transform the P-values into correlation
cor <- p2cor(p = ss$P,
n = ss$N,
sign = log(ss$OR)
)
fam <- fread(paste0(bfile, ".fam"))
fam[,ID:=do.call(paste, c(.SD, sep=":")),.SDcols=c(1:2)]
# Run the lassosum pipeline
# The cluster parameter is used for multi-threading
# You can ignore that if you do not wish to perform multi-threaded processing
out <- lassosum.pipeline(
cor = cor,
chr = ss$CHR,
pos = ss$BP,
A1 = ss$A1,
A2 = ss$A2,
ref.bfile = bfile,
test.bfile = bfile,
LDblocks = ld.file,
cluster=cl
)
# Store the R2 results
target.res <- validate(out, pheno = as.data.frame(target.pheno), covar=as.data.frame(cov))
# Get the maximum R2
r2 <- max(target.res$validation.table$value)^2
How much phenotypic variation does the "best-fit" PRS explain?
0.2474679