Superpc for R: Tutorial

Eric Bair, Robert Tibshirani
This document describes the installation and use of the Superpc package for R
1. Introduction

The package superpc provides R functions for carrying out prediction by "supervised principal components". It can predict a censored survival outcome, or a quantitative outcome. It is especially useful for correlating patient survival or other quantitative parameters with gene expression data.

The methodology is described in the papers:
Semi-supervised methods for predicting patient survival from gene expression papers (Bair, Tibshirani) PLOS
Prediction by supervised principal components (Bair, Hastie, Paul, Tibshirani) Stanford tech report

2. The Basic idea

Supervised principal components is a generalization of principal components regression. The first (or first few) principal components are the linear combinations of the features that capture the directions of largest variation in a dataset. But these directions may or may not be related to an outcome variable of interest. To find linear combinations that are related to an outcome variable, we compute univariate scores for each gene and then retain only those features whose score exceeds a threshold. A principal components analysis is carried out using only the data from these selected features.

Finally, these "supervised principal components" are used in a regression model to predict the outcome. To summarize, the steps are:

Compute (univariate) standard regression coefficients for each feature
Form a reduced data matrix consisting of only those features whose univariate coefficient exceeds a threshold theta in absolute value (theta is estimated by cross-validation)
Compute the first (or first few) principal components of the reduced data matrix
Use these principal component(s) in a regression model to predict the outcome

This idea can be used in standard regression problems with a quantitative outcome, and also in generalized regression problems such as survival analysis. In the latter problem, the regression coefficients in step (1) are obtained from a proportional hazards model. The superpc R package handles these two cases: standard regression and survival data.

There is one more important point: the features (e.g genes) which important in the prediction are not necessarily the ones that passed the screen in step 2. There are other features that may have as high a correlation with the supervised PC predictor. So we compute an importance score for each feature equal to its correlation with the supervised PC predictor. A reduced predictor is formed by soft-thresholding the importance scores, and using these shrunken scores as weights. The soft-thresholding sets the weight of some features to zero, hence throwing them out of the model. The amount of shrinkage is determined by cross-validation. The reduced predictor often performs as well or better than than the supervised PC predictor, and is more interpretable.

3. Installing `superpc`

You first need to install a recent version of the R statistical package. This is free, and can be found at The R Project.
Follow the instructions. Click on http://cran.r-project.org/mirrors.html under Download. Pick a mirror closest to you. You probably want a pre-compiled version. For windows, choose the Windows (95 or later) link, select base, and download SetupR.exe. Installation takes 5-10 minutes.

R is a great package, and is worth knowing about!
Download the appropriate Unix/Linux version or Windows version for your platform from the superpc distribution site. Often, browsers offer you a choice of opening the file or saving it. Elect to save it and remember the name of the folder where you saved it.
For Unix/Linux type
R CMD INSTALL -l mylib superpc_1.00.tar.gz
In Windows, click on the R icon on your desktop, pull down the Packages menu item and select Install packages from local zip file. A file chooser dialog will allow you to select the file superpc_1.00.zip that you just saved in the previous step.

The above concludes the installation process and needs to be done just once.

4. Using `superpc`

For Unix/Linux, start R and type

library(superpc,lib.loc="mylib")

In Windows, while in R pull down the Packages menu item and select Load package. Select superpc. In Windows you will find it helpful to go the Misc menu and turn off buffered output. This forces R output to be immediately written to the console.

You are now ready to use superpc

5. A Sample Session

Please note that in what follows, code section lines starting with "#" are comments.



# generate some synthetic survival data.
#
# there are 1000 features and 60 samples
#  the outcome is highly correlated with the first principal component of the
#  first 80 features


set.seed(464)


x<-matrix(rnorm(1000*100),ncol=100)
v1<- svd(x[1:80,])$v[,1]

y<-2+5*v1+ .05*rnorm(100)

xtest<-x
ytest<-2+5*v1+ .05*rnorm(100)
censoring.status<- sample(c(rep(1,80),rep(0,20)))
censoring.status.test<- sample(c(rep(1,80),rep(0,20)))

featurenames <- paste("feature",as.character(1:1000),sep="")



# create train and test data objects. censoring.status=1 means the event occurred;
#  censoring.status=0 means censored

data<-list(x=x,y=y, censoring.status= censoring.status, featurenames=featurenames)
data.test<-list(x=xtest,y=ytest, censoring.status=censoring.status.test, featurenames= featurenames)


# train  the model. This step just computes the  scores for each feature

train.obj<- superpc.train(data, type="survival")

# note for regression (non-survival) data, we leave the component "censoring.status"
# out of the data object, and call superpc.train with type="regression".
# otherwise the superpc commands are all the same



# cross-validate the model

cv.obj<-superpc.cv(train.obj, data)

#plot the cross-validation curves. From this plot we see that the 1st 
# principal component is significant and the best threshold  is around 0.7

superpc.plotcv(cv.obj)


#See pdf version of the cv plot 

# here we have the luxury of  test data, so we can compute the  likelihood ratio statistic
# over the test data and plot them. We see that the threshold of 0.7
# works pretty well
#  

lrtest.obj<-superpc.lrtest.curv(train.obj, data,data.test)

superpc.plot.lrtest(lrtest.obj)

#See pdf version of the lrtest plot


# now we derive the predictor of survival  for the test data, 
# and then then use it
# as the predictor in a Cox model . We see that the 1st supervised PC is
# highly significant; the next two are not

fit.cts<- superpc.predict(train.obj, data, data.test, threshold=0.7, n.components=3, prediction.type="continuous")


superpc.fit.to.outcome(train.obj, data.test, fit.cts$v.pred)


# sometimes a discrete (categorical) predictor is attractive.
# Here we form two groups by cutting the predictor at its median
#and then plot Kaplan-Meier curves for the two groups
  
fit.groups<- superpc.predict(train.obj, data, data.test, threshold=0.7, n.components=1, prediction.type="discrete")

superpc.fit.to.outcome(train.obj, data.test, fit.groups$v.pred)

plot(survfit(Surv(data.test$y,data.test$censoring.status)~fit.groups$v.pred), col=2:3, xlab="time", ylab="Prob survival")

#See pdf version of the survival plot


# Finally, we look for a predictor of survival a small number of
#genes (rather than all 1000 genes). We do this by computing an importance
# score for each equal its correlation with the supervised PC predictor.
# Then we soft threshold the importance scores, and use the shrunken
# scores as gene weights to from a reduced predictor. Cross-validation
# gives us an estimate of the best amount to shrink and an idea of
#how well the shrunken predictor works.


fit.red<- superpc.predict.red(train.obj, data, data.test, threshold=0.7)

fit.redcv<- superpc.predict.red.cv(fit.red, cv.obj,  data,  threshold=0.7)

superpc.plotred.lrtest(fit.redcv)

#See pdf version of this plot


# Finally we list the significant genes, in order of decreasing importance score

superpc.listfeatures(data.test, train.obj, fit.red, 1, shrinkage=0.17)

A note on interpretation:

The signs of the scores (latent factors) v.pred returned by superpc.predict are chosen so that the regression of the outcome on each factor has a positive coefficient. This is just a convention to aid in interpretation.

For survival data, this means
Higher score => higher risk (worse survival)

For regression data, this means
Higher score => higher mean of the outcome

How about the direction of effect for each individual feature (gene)? The function superpc.listfeatures reports an importance score equal to the correlation between each feature and the latent factor.

Hence for survival data,

Importance score positive means
increase in value of feature=> higher risk (worse survival)

For regression data,

Importance score positive means
increase in value of feature => higher mean of the outcome

The reverse are true for Importance score negative.