The package superpc
provides R functions for carrying out
prediction by "supervised principal components". It can predict a censored
survival outcome, or a quantitative outcome. It is especially useful for
correlating patient survival or other quantitative parameters with
gene expression data.
The methodology is described in the papers:
Semi-supervised methods for predicting patient survival from gene expression papers
(Bair, Tibshirani) PLOS
Prediction by supervised principal components (Bair, Hastie, Paul, Tibshirani)
Stanford tech report
Finally, these "supervised principal components" are used in a regression model to predict the outcome. To summarize, the steps are:
This idea can be used in standard regression problems with a quantitative outcome, and also in generalized regression problems such as survival analysis. In the latter problem, the regression coefficients in step (1) are obtained from a proportional hazards model. The superpc R package handles these two cases: standard regression and survival data.
There is one more important point: the features (e.g genes) which important in the prediction are not necessarily the ones that passed the screen in step 2. There are other features that may have as high a correlation with the supervised PC predictor. So we compute an importance score for each feature equal to its correlation with the supervised PC predictor. A reduced predictor is formed by soft-thresholding the importance scores, and using these shrunken scores as weights. The soft-thresholding sets the weight of some features to zero, hence throwing them out of the model. The amount of shrinkage is determined by cross-validation. The reduced predictor often performs as well or better than than the supervised PC predictor, and is more interpretable.
superpc
Follow the instructions. Click on
http://cran.r-project.org/mirrors.html under
Download. Pick a mirror closest to you. You probably want a
pre-compiled version. For windows, choose the
Windows (95 or later) link, select base
, and download SetupR.exe
.
Installation takes 5-10 minutes.
R is a great package, and is worth knowing about!
R CMD INSTALL -l mylib superpc_1.00.tar.gz
In Windows, click on the R icon on your desktop, pull down the
Packages
menu item and select
Install packages from local zip file
. A file chooser dialog will
allow you to select the file
superpc_1.00.zip
that you just saved in the previous step.
superpc
For Unix/Linux, start R and type
library(superpc,lib.loc="mylib")
In Windows, while in R pull down the Packages
menu item and
select Load package
. Select superpc
. In Windows you will find
it helpful to go the Misc
menu and turn off buffered output. This
forces R output to be immediately written to the console.
You are now ready to use superpc
Please note that in what follows, code section lines starting with "#" are comments.
# generate some synthetic survival data. # # there are 1000 features and 60 samples # the outcome is highly correlated with the first principal component of the # first 80 features set.seed(464) x<-matrix(rnorm(1000*100),ncol=100) v1<- svd(x[1:80,])$v[,1] y<-2+5*v1+ .05*rnorm(100) xtest<-x ytest<-2+5*v1+ .05*rnorm(100) censoring.status<- sample(c(rep(1,80),rep(0,20))) censoring.status.test<- sample(c(rep(1,80),rep(0,20))) featurenames <- paste("feature",as.character(1:1000),sep="") # create train and test data objects. censoring.status=1 means the event occurred; # censoring.status=0 means censored data<-list(x=x,y=y, censoring.status= censoring.status, featurenames=featurenames) data.test<-list(x=xtest,y=ytest, censoring.status=censoring.status.test, featurenames= featurenames) # train the model. This step just computes the scores for each feature train.obj<- superpc.train(data, type="survival") # note for regression (non-survival) data, we leave the component "censoring.status" # out of the data object, and call superpc.train with type="regression". # otherwise the superpc commands are all the same # cross-validate the model cv.obj<-superpc.cv(train.obj, data) #plot the cross-validation curves. From this plot we see that the 1st # principal component is significant and the best threshold is around 0.7 superpc.plotcv(cv.obj) #See pdf version of the cv plot # here we have the luxury of test data, so we can compute the likelihood ratio statistic # over the test data and plot them. We see that the threshold of 0.7 # works pretty well # lrtest.obj<-superpc.lrtest.curv(train.obj, data,data.test) superpc.plot.lrtest(lrtest.obj) #See pdf version of the lrtest plot # now we derive the predictor of survival for the test data, # and then then use it # as the predictor in a Cox model . We see that the 1st supervised PC is # highly significant; the next two are not fit.cts<- superpc.predict(train.obj, data, data.test, threshold=0.7, n.components=3, prediction.type="continuous") superpc.fit.to.outcome(train.obj, data.test, fit.cts$v.pred) # sometimes a discrete (categorical) predictor is attractive. # Here we form two groups by cutting the predictor at its median #and then plot Kaplan-Meier curves for the two groups fit.groups<- superpc.predict(train.obj, data, data.test, threshold=0.7, n.components=1, prediction.type="discrete") superpc.fit.to.outcome(train.obj, data.test, fit.groups$v.pred) plot(survfit(Surv(data.test$y,data.test$censoring.status)~fit.groups$v.pred), col=2:3, xlab="time", ylab="Prob survival") #See pdf version of the survival plot # Finally, we look for a predictor of survival a small number of #genes (rather than all 1000 genes). We do this by computing an importance # score for each equal its correlation with the supervised PC predictor. # Then we soft threshold the importance scores, and use the shrunken # scores as gene weights to from a reduced predictor. Cross-validation # gives us an estimate of the best amount to shrink and an idea of #how well the shrunken predictor works. fit.red<- superpc.predict.red(train.obj, data, data.test, threshold=0.7) fit.redcv<- superpc.predict.red.cv(fit.red, cv.obj, data, threshold=0.7) superpc.plotred.lrtest(fit.redcv) #See pdf version of this plot # Finally we list the significant genes, in order of decreasing importance score superpc.listfeatures(data.test, train.obj, fit.red, 1, shrinkage=0.17)
v.pred
returned by superpc.predict
are chosen so that the
regression of the outcome on each factor has a positive coefficient.
This is just a convention to aid in interpretation.
For survival data, this means
Higher score => higher risk (worse survival)
For regression data, this means
Higher score => higher mean of the outcome
How about the direction of effect for each individual feature (gene)?
The function superpc.listfeatures
reports an importance
score equal to the correlation between each feature and the latent factor.
Hence for survival data,
Importance score positive means
increase in value of feature=> higher risk (worse survival)
For regression data,
Importance score positive means
increase in value of feature => higher mean of the outcome
The reverse are true for Importance score negative.