--- title: "adaptiveGPCA vignette" author: "Julia Fukuyama" date: "`r Sys.Date()`" output: rmarkdown::html_document: toc: true toc_depth: 2 toc_float: true theme: lumen keep_md: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{adaptvieGPCA vignette} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} library(knitr) opts_chunk$set(fig.width = 6, fig.height = 4) ``` ## Overview ### Adaptive gPCA Adaptive gPCA ([Fukuyama, J. (2017)](https://arxiv.org/abs/1702.00501)) is a flexible method for incorporating side information about the variables into principal components analysis. The method has a single parameter, $r$, which controls how much weight to give to the side information: $r = 0$ corresponds to standard PCA (the side information is not used at all), and $r = 1$ corresponds to the maximum amount of weight on the side information. The output from adaptive gPCA is analogous to that given by PCA: we get representations of the samples and the variables, which can be visualized as either independently or as a biplot. The difference with standard PCA is that, assuming $r > 0$, the variable loadings are regularized so that similar variables are encouraged to have similar loadings on the principal axes. ### Package functionality There are two main ways of using this package. The function `adaptivegpca` will choose the value of $r$ automatically, while the function `gpcaFullFamily` produces analogous output for the full range of values of $r$, which is then chosen manually. We will illustrate both methods in this vignette. Although adaptive gPCA can be applied more generally, the method was developed to incorporate phylogenetic structure in microbiome data analysis, and so there are also functions to interface with [phyloseq](http://bioconductor.org/packages/release/bioc/html/phyloseq.html) objects and functions to visualize adaptive gPCA output along with a phylogenetic tree. ### Data The data set we will use to illustrate the functionality of this package is an antibiotic time course study[^1]. In this experiment, three subjects were given two courses of antibiotics, and the abundances of bacteria in the gut microbiome were measured before, during, and after each course of antibiotics. The data is stored in a phyloseq object called `AntibioticPhyloseq`, which contains variance-stabilized and centered abundances on the `otu_table` slot, a phylogenetic tree describing the relationships between the bacterial species measured in the `phy_tree` slot, and information about the samples in the `sample_data` slot. We want a low-dimensional representation of the samples which takes into account the phylogenetic similarities between the bacteria, which is what adaptive gPCA will provide for us. [^1]: Dethlefsen, L., and D. A. Relman. "Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation." Proc Natl Acad Sci USA 108. Suppl 1 (2010): 4554-4561. ## Fitting an agpca model Now that we have described the data we will be using, we will show how to use the package. The first step is to load the required packages and data. ```{r} library(adaptiveGPCA) library(ggplot2) library(phyloseq) data(AntibioticPhyloseq) theme_set(theme_bw()) ``` Next, we create the inputs required for the `adaptivegpca` function. If we have a `phyloseq` object containing taxon abundances and a phylogenetic tree, the function `processPhyloseq` will create a list with the following elements: - A centered data matrix, stored as `X` in the output. - A similarity matrix based on the phylogeny, stored as `Q` in the output. - A vector of weights, stored as `weights`. The weights are only relevant for correspondence analysis, so if the function is invoked with `ca = FALSE` (the default), the weights vector will be a vector containing only 1's and can be ignored. ```{r} pp = processPhyloseq(AntibioticPhyloseq) ``` Next, we pass our data matrix and our similarity matrix to the `adaptivegpca` function. The arguments to `adaptivegpca` are - `X`: A centered data matrix. - `Q`: A symmetric, positive definite matrix describing the similarities between the variables. - `k`: The number of axes to keep, defaults to 2. - `weights`: A vector with sample weights, defaults to a vector of all 1's (equal weights on each sample). The matrices `X` and `Q` will be generated from a phyloseq object using the OTU table and the phylogenetic tree with the `processPhyloseq` function. If you have another kind of data, you can generate these matrices yourself. Below we use the output from `processPhyloseq` to fit an adaptive gPCA model. ```{r} out.agpca = adaptivegpca(pp$X, pp$Q, k = 2) ``` The output from the `adaptivegpca` function is stored in an object of class `adaptivegpca`, which has some generic printing and plotting functions associated with it. As such, if we just print out the object, it will give us some basic information about the results, namely the number of axes, the value of $r$ chosen, and the fraction of the variance explained by the top axes. ```{r} out.agpca ``` The output is a list with several elements. These are: - `U`: The sample scores on the principal axes. - `V`: The variable loadings before rotation. These are not used for plotting --- they need to be rotated into the space of the samples for the biplot representation. - `QV`: The variable loadings rotated to be in the same space as the sample scores. These are the variable loadings that should be plotted for a biplot representation of the samples and the variables. - `vars`: The variances along the principal axes. - `r`: The value of $r$ chosen by adaptivegpca, corresponding to how much regularization to perform according to the structure of the variables. $r = 0$ is standard PCA (variable structure is not considered at all), and $r = 1$ is the maximal amount of regularization. - `evals`: The eigenvalues of the modified similarity matrix. This is described in more detail in the paper, but it is an alternate way of understanding how much regularization is performed by the method. The larger the top eigenvalues are compared with the subsequent ones, the more regularization there is toward the top eigenvectors of the similarity matrix. ## Plotting the output We can also plot the output from `adaptivegpca` using a generic plotting function. By default, this will produce a scree plot, as shown below: ```{r} plot(out.agpca) ``` The plotting function has a `type` argument, which can be either `scree`, `samples`, or `variables`. `type = scree` will give a plot showing the fraction of the variance explained by each of the axes, `type = samples` will show the sample scores along the principal axes, and `type = variables` will show the variable loadings on the principal axes. The `axes` argument allows the user to specify which axes to plot, by default these are axes 1 and 2. ```{r} plot(out.agpca, type = "samples", axes = c(1,2)) plot(out.agpca, type = "variables", axes = c(1,2)) ``` Finally, the `inspectTaxonomy` function allows interactive inspection of the taxa or variables plot, assuming you started the analysis with a phyloseq object. This function will open a browser window with a representation of the phylogenetic tree and the taxon loadings on the adaptive gPCA axes. Taxa in this plot can be selected by "brushing" the plot (drawing a rectangular box), and the selected taxa will be highlighted on the phylogenetic tree and their taxonomic assignments will be printed below. An example call is given below: ```{r eval = FALSE} inspectTaxonomy(out.agpca, AntibioticPhyloseq) ``` The arguments to `inspectTaxonomy` are - `agpcafit`: The output from `adaptivegpca`. - `physeq`: A phyloseq object with a taxonomy table and a phylogenetic tree. Should be the same one used for for the adaptive gPCA fit. - `axes`: Which axes to plot. Defaults to the first two axes. - `br.length`: Should the tree include branch lengths? The trees tend to look better when the branches are all plotted as being the same length, so this defaults to `FALSE`. - `height`: The height of the plotting area in pixels. Defaults to 600. If you click the "done" button in the browser window, the app will close and the function will return a table describing the taxonomy, position along selected axes, and names of the selected taxa. ## Interactive choice of $r$ We can also choose the value of $r$ manually. In the previous section, we described how to use the `adaptivegpca` function, which chooses the value of $r$ automatically, but if we would like to see the results for different values of $r$ and to choose one manually, we can first use the `gpcaFullFamily` function to create a full set of models and then use the `visualizeFullFamily` function to visualize the biplots for each value of $r$. `gpcaFullFamily` takes the same arguments as `adaptivegpca`, along with some additional arguments describing the form of the output. The `visualizeFullFamily` function is a shiny gadget. Calling this function will open a browser window where you can visualize the data set for a range of values of $r$. Clicking "done" in this window will give as output an object of the same format as that given by the `adaptivegpca` function, the difference being that the value of $r$ was chosen manually instead of automatically. The `sample_data`/`sample_mapping` and `var_data`/`var_mapping` arguments allow you to customize the visualization. Without these arguments, the first two axes will be plotted for both the samples and the variables. If you include these arguments, you can customize the plots by providing aesthetic mappings for ggplot. Note that the code below is only included as an example and not evaluated in the vignette. ```{r, eval = FALSE} out.ff = gpcaFullFamily(pp$X, pp$Q, k = 2) out.agpca = visualizeFullFamily(out.ff, sample_data = sample_data(AntibioticPhyloseq), sample_mapping = aes(x = Axis1, y = Axis2, color = type), var_data = tax_table(AntibioticPhyloseq), var_mapping = aes(x = Axis1, y = Axis2, color = Phylum)) ``` In either case, we can plot the results. This time we will do it manually, using ggplot. The sample scores on the principal axes are located in `out.agpca$U` and the loadings of the variables on the principal axes are located in `out.agpca$QV`. To plot the samples, we make a data frame that includes `out.agpca$U` and the sample data, and to plot the taxa we make a data frame containing `out.agpca$QV` and the taxonomy table. ```{r fig.width=7} ggplot(data.frame(out.agpca$U, sample_data(AntibioticPhyloseq))) + geom_point(aes(x = Axis1, y = Axis2, color = type, shape = ind)) + xlab("Axis 1") + ylab("Axis 2") ``` ```{r fig.width=9} ggplot(data.frame(out.agpca$QV, tax_table(AntibioticPhyloseq))) + geom_point(aes(x = Axis1, y = Axis2, color = Phylum)) + xlab("Axis 1") + ylab("Axis 2") ``` ## Correspondence analysis Finally, note that the `processPhyloseq` function also has an argument `ca` (for correspondence analysis). This should be used with phyloseq objects containing raw counts, and it will process a phyloseq object so as to do an adaptive gPCA version of correspondence analysis (this entails transforming counts to relative abunadnces, computing sample weights based on the overall counts for the samples, and finally doing a weighted centering of the relative abundances).