--- title: "treeDA vignette" author: "Julia Fukuyama" date: "`r Sys.Date()`" output: rmarkdown::html_document: toc: true toc_float: true theme: lumen keep_md: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{treeDA vignette} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} library(knitr) opts_chunk$set(fig.width = 8, fig.height = 4) ``` ## Package overview Here we will describe how to use the treeDA package. The package provides functions to perform sparse discriminant analysis informed by the tree. The method was developed for microbiome data, but it could in principle be applied to any data with the same tree structure. The idea behind the package is that when we have predictor variables which are structured according to a tree, the mean values of the predictor variables at each node in the tree are natural predictor variables, and can be used in addition to the initial predictors defined at the leaves. For microbiome data, this means using both the abundances of the initial set of taxa as well as the abundances "pseudo-taxa", which correspond to nodes in the tree and are the agglomeration of all the taxa which descend from that node. Without regularization, using both sets of predictors would yield an ill-defined problem because the node predictors are linear combinations of the leaf predictors. However, when we add regularization, the problem becomes well-posed and we can obtain a unique solution. Intuitively, the regularization allows us to incorporate the intuition that a solution where one node is selected is more parsimonious than one in which all the leaves descending from that node are selected. This package is based on the implementation of sparse discriminant analysis implemented in the [`sparseLDA`](https://CRAN.R-project.org/package=sparseLDA) package. The main function, `treeda`, creates the node and leaf predictors, performs sparse discriminant analysis on the combination of node and leaf predictors, and then translates the results back in terms of leaf predictors only. The package also includes functions to perform cross-validation and plotting, which will be demonstrated in this vignette. ## Setup and data description Our first step is to load the required packages and data. We will illustrate the method on an antibiotic dataset (`AntibioticPhyloseq`) provided by the package `adaptiveGPCA`. Note that no other elements of the `adaptiveGPCA` package are used in this tutorial. The antibiotic dataset consists of measurements taken from three subjects before, during, and after taking each of two courses of an antibiotic. The major groupings in the data are by subject (called `ind` in the phyloseq object) and by the the antibiotic condition. The antibiotic treatment is discretized into abx/no abx in a variable called `type`, where abx corresponds to samples taken when the subject was taking the antibiotic and the week following, and no abx corresponds to all the other samples. ```{r packages} library(treeDA) library(ggplot2) library(phyloseq) library(adaptiveGPCA) library(Matrix) data(AntibioticPhyloseq) theme_set(theme_bw()) ``` ## Model fitting The main function in the package is called `treeda`. It takes a response vector giving the classes to be separated, a matrix of predictor variables which are related to each other by a tree, the tree which describes the relationships between the predictor variables, and the sparsity (p, the number of predictors to use). In the antibiotic dataset, we have several potential discriminatory variables. One of these describes whether the sample was taken during or immediately after the subject was subjected to antibiotics, and we can try to find taxa which discriminate between these two groups using the following command: ```{r treeda-type} out.treeda = treeda(response = sample_data(AntibioticPhyloseq)$type, predictors = otu_table(AntibioticPhyloseq), tree = phy_tree(AntibioticPhyloseq), p = 15) ``` ## Model inspection Here the output of the model is stored in an object called `out.treeda`. The print function will give an overview of the fitted model, including the number of predictors used and the confusion matrix for the training data. ```{r treeda-type-print} out.treeda ``` From this, we see that 15 predictors were used (since this was what we specified in the initial call to the function). These predictors potentially include nodes in the tree (corresponding to taxonomic clades) and leaves on the tree (corresponding to individual species). The combination of nodes and leaves can be written purely in terms of the leaves (or species, or OTUs), in which case the model is using 903 of the leaves. This indicates that some of the nodes which were selected as predictive were quite deep in the tree and corresponded to large groups of taxa. Finally, the confusion matrix shows us how well the model does on the trainnig data: we see that a total of 16 cases were classified incorrectly, split approximately evenly between cases which were actually from the abx condition and those which were actually from the no abx condition. The object containing the output from the fit also contains other information. These are: - `means`: The mean value of each predictor. This is only included if the call to `treeda` included `center = TRUE`, in which case the means are stored so that new data can be centered using the mean values from the training data. - `sds`: The standard deviation of each predictor. Like with the means, this is only included if the call to `treeda` included `scale = TRUE`, in which case the standard deviations are stored so that the new data can be scaled using the standard devaiations from the training data. - `leafCoefficients`: A representation of the discriminating axis using only the leaves. This is a list containing `beta`, which are the coefficients, and `intercept`, which is the constant term. - `input`: A list containing the response, predictors, and tree used to fit the model. - `nPredictors`: The number of predictors (in the node + leaf space) used in the model. - `nLeafPredictors`: The number of predictors in the leaf space used in the model. - `sda`: The sda object used in fitting the model. - `class.names`: The names of the classes to be discriminated between. - `projections`: The projections of the observations on the discriminating axes. - `classProperties`: The prior probabilities, mean in discriminating space, and variance in the discriminating space of the classes. - `predictedClasses`: Predicted classes for each observation. - `rss`: Residual sum of squares: the sum of squared distances between each observation and its class mean in the discriminating space. ## Sample plotting Once we have fit the model, we can look at the samples projected onto the discriminating axis. These projections are found in `out.treeda$projections`, and we can see them plotted for the antibiotic data below. In the figure below we also separate out the samples by individual to see whether the model works better for some individuals than others. We see that positive scores along the discriminating axis correspond to the no abx condition, and that there is some difference between the individuals but that the quality of the model is approximately the same across the three subjects. ```{r treeda-type-sample-plot} ggplot(data.frame(sample_data(AntibioticPhyloseq), projections = out.treeda$projections)) + geom_point(aes(x = ind, y = projections, color = type)) ``` ## Coefficient plotting We can also look at the coefficient vector describing the discriminating axis using the `plot_coefficients` function. This gives a plot of the tree with the leaf coefficients aligned underneath. ```{r treeda-type-coef-plot} plot_coefficients(out.treeda) ``` For comparison, we can look at the results when we try to discriminate between individuals instead of between the abx/no abx conditions. We try this with the same amount of sparsity, p = 15. ```{r treeda-ind} out.treeda.ind = treeda(response = sample_data(AntibioticPhyloseq)$ind, predictors = otu_table(AntibioticPhyloseq), tree = phy_tree(AntibioticPhyloseq), p = 15) out.treeda.ind ``` In this case, since we have three classes we obtain two discriminating axes, each of which uses 15 node or leaf predictors for a total of 30 predictors. This corresponds to only 85 leaves on the tree, indicating that the nodes which were chosen corresponded to individual leaves or to much smaller clades than when our aim was to discriminate between the abx and no abx conditions. We can see this more clearly when we look at the coefficient plot, where there are many more singleton leaves with non-zero coefficients than we saw in the corresponding plot for the abx/no abx model. Note that this model contains two discriminating axes because we have three classes, while the abx/no abx model had only one discriminating axis because there were two classes. ```{r tereda-ind-coef-plot} plot_coefficients(out.treeda.ind) ``` ## Cross validation We would often like to choose the sparsity level automatically instead of manually. A common way of doing this is by cross validation, which we have implemented in the function treedacv. It takes most of the same arguments as as treeda: a vector containing the response, or the classes for each of the observations, a matrix of predictors which are related to each other by a tree, and the tree. In addition, the number of folds for the cross validation needs to be specified (the folds argument), and a vector giving the levels of sparsity to be compared by cross validation (the pvec argument). The folds argument can be given either as a single number, in which case the observations will be partitioned into that number of folds, or as a vector assigning each observation to a fold. In this case, the vector should have length equal to the number of observations, and the elements in the vector should be integers between 1 and the number of desired folds assigning the observations to a fold. Here we are using four-fold cross validation, discriminating between the individuals in our dataset, and comparing levels of sparsity between 1 and 15. When we print the output from `treedacv`, it tells us both which value of p (amount of sparsity) corresponded to the minimum CV error, and what the smallest value of p was which was within one standard error of the minimum CV error. (The intuition behind using this value of p instead of that with the minimum CV error is that we would like the most parsimonious model which is statistically indistinguishable from that with the minimum CV error). For us, the minimum CV error is at 11, but if we were following the one standard error rule we would use 7. ```{r treeda-ind-cv} set.seed(0) out.treedacv = treedacv(response = sample_data(AntibioticPhyloseq)$type, predictors = otu_table(AntibioticPhyloseq), tree = phy_tree(AntibioticPhyloseq), folds = 4, pvec = 1:15) out.treedacv ``` The results from the cross validation are stored in `out.treedacv$loss.df`. This data frame contains the CV error for each fold, the mean CV error, and the standard error of the CV error for each value of p. We can use this matrix to plot the CV error as a function of the sparsity, or we can use the plotting function defined by the package, as shown below. ```{r treeda-ind-plot-cv, fig.width = 5, fig.height = 3} plot(out.treedacv) ``` This plot confirms what we said earlier: 11 predictors corresponds to the minimum cross validation error, and 7 predictors corresponds to the sparsest solution which is within 1 standard error of the minimum cross validation error. We can then fit the model with 11 predictors to all the data and look at the plot of the coefficients along the discriminating axis. ```{r treeda-ind-fit-2} out.treeda.11 = treeda(response = sample_data(AntibioticPhyloseq)$type, predictors = otu_table(AntibioticPhyloseq), tree = phy_tree(AntibioticPhyloseq), p = 11) out.treeda.11 ``` ```{r treeda-ind-plot-coef} plot_coefficients(out.treeda.11) ``` From the coefficient plot above, we might be interested in the relatively large group of taxa with the largest positive coefficients. Since the samples in the `abx` condition have positive scores on the discriminating axis, taxa with positive coefficients are over-represented in the `abx` condition. We can find out what these are by examining the leaf coefficient vector. We first convert the Matrix object containing the leaf coefficients into a vector, then find all the taxa which have the maximum positive coefficient, and then print out the unique elements of the taxonomy table corresponding to those taxa. We see that this is a group of 74 Lachnospiraceae. They are mostly not annotated beyond the family level, but one is annotated as being from the genus *Moryella*. ```{r taxa-examine} coef = as.vector(out.treeda.11$leafCoefficients$beta) taxa.max = which(coef == max(coef)) length(taxa.max) unique(tax_table(AntibioticPhyloseq)[taxa.max,]) ```