Title: | Temporal Tensor Decomposition, a Dimensionality Reduction Tool for Longitudinal Multivariate Data |
---|---|
Description: | TEMPoral TEnsor Decomposition (TEMPTED), is a dimension reduction method for multivariate longitudinal data with varying temporal sampling. It formats the data into a temporal tensor and decomposes it into a summation of low-dimensional components, each consisting of a subject loading vector, a feature loading vector, and a continuous temporal loading function. These loadings provide a low-dimensional representation of subjects or samples and can be used to identify features associated with clusters of subjects or samples. TEMPTED provides the flexibility of allowing subjects to have different temporal sampling, so time points do not need to be binned, and missing time points do not need to be imputed. |
Authors: | Pixu Shi |
Maintainer: | Pixu Shi <[email protected]> |
License: | GPL-3 |
Version: | 0.1.1 |
Built: | 2024-11-18 04:54:24 UTC |
Source: | https://github.com/pixushi/tempted |
This function aggregate the features into "meta features" by calculating a weighted summation of the features using feature loading of each component as weights. It can also aggregate features by using the combination of multiple components by ranking the features by a linear combination of feature loadings from multiple components.
aggregate_feature( res_tempted, mean_svd = NULL, datlist, pct = 1, contrast = NULL )
aggregate_feature( res_tempted, mean_svd = NULL, datlist, pct = 1, contrast = NULL )
res_tempted |
Output of |
mean_svd |
Output of |
datlist |
Output of |
pct |
The percent of features to aggregate,
features ranked by absolute value of the feature loading of each component.
Default is 1, which means 100% of features are aggregated.
Setting |
contrast |
A matrix choosing how components are combined, each column is a contrast of length r and used to calculate the linear combination of the feature loadings of r components. |
A list of results.
The meta feature obtained by aggregating the observed temporal tensor. It is a data.frame with four columns: "value" for the meta feature values, "subID" for the subject ID, "timepoint" for the time points, and "PC" indicating which component was used to construct the meta feature.
The meta feature obtained by aggregating the denoised temporal tensor. It has the same structure as metafeature_aggregate
.
The contrast used to linearly combine the components from input.
A matrix of TRUE/FALSE indicating which features are aggregated in each component and contrast.
Shi P, Martino C, Han R, Janssen S, Buck G, Serrano M, Owzar K, Knight R, Shenhav L, Zhang AR. (2023) Time-Informed Dimensionality Reduction for Longitudinal Microbiome Studies. bioRxiv. doi: 10.1101/550749. https://www.biorxiv.org/content/10.1101/550749.
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] datlist <- format_tempted(count_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=0.5, transform="clr") mean_svd <- svd_centralize(datlist, r=1) res_tempted <- tempted(mean_svd$datlist, r=2, smooth=1e-5) contrast <- matrix(c(1/2,1), 2, 1) res_aggregate <- aggregate_feature(res_tempted, mean_svd, datlist, pct=1, contrast=contrast) # plot the aggregated features group <- unique(meta_table[, c("studyid", "delivery")]) plot_metafeature(res_aggregate$metafeature_aggregate, group, bws=30)
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] datlist <- format_tempted(count_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=0.5, transform="clr") mean_svd <- svd_centralize(datlist, r=1) res_tempted <- tempted(mean_svd$datlist, r=2, smooth=1e-5) contrast <- matrix(c(1/2,1), 2, 1) res_aggregate <- aggregate_feature(res_tempted, mean_svd, datlist, pct=1, contrast=contrast) # plot the aggregated features group <- unique(meta_table[, c("studyid", "delivery")]) plot_metafeature(res_aggregate$metafeature_aggregate, group, bws=30)
This function is used to calculate the kernel matrix for the RKHS regression that iteratively updates the temporal loading function.
bernoulli_kernel(x, y)
bernoulli_kernel(x, y)
x , y
|
Two values between which the Bernoulli kernel is calculated. |
The calculated kernel between x
and y
.
Han, R., Shi, P. and Zhang, A.R. (2023) Guaranteed functional tensor singular value decomposition. Journal of the American Statistical Association, pp.1-13. doi: 10.1080/01621459.2022.2153689.
OTU read count table from the ECAM data
count_table
count_table
A data.frame with rows representing samples and matching with data.frame meta_table
and columns representing microbial features (i.e. OTUs). Each entry is a read count.
Bokulich, Nicholas A., et al. "Antibiotics, birth mode, and diet shape microbiome maturation during early life." Science translational medicine 8.343 (2016): 343ra82-343ra82.
This function estimates the subject loading of the testing data based on feature and temporal loading from training data, so that both the testing data and training data have the same dimensionality reduction.
est_test_subject(datlist, res_tempted, mean_svd = NULL)
est_test_subject(datlist, res_tempted, mean_svd = NULL)
datlist |
Testing data formatted into datlist in the same fashion as the training data. The same transformation needs to be used for both training and testing data. |
res_tempted |
Result from |
mean_svd |
Result from |
estimated subject loading of testing data
Shi P, Martino C, Han R, Janssen S, Buck G, Serrano M, Owzar K, Knight R, Shenhav L, Zhang AR. (2023) Time-Informed Dimensionality Reduction for Longitudinal Microbiome Studies. bioRxiv. doi: 10.1101/550749. https://www.biorxiv.org/content/10.1101/550749.
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] # split the example data into training and testing id_test <- meta_table_sub$studyid=="2" count_train <- count_table_sub[!id_test,] meta_train <- meta_table_sub[!id_test,] count_test <- count_table_sub[id_test,] meta_test <- meta_table_sub[id_test,] # run tempted on training data datlist_train <- format_tempted(count_train, meta_train$day_of_life, meta_train$studyid, threshold=0.95, pseudo=0.5, transform="clr") mean_svd_train <- svd_centralize(datlist_train, r=1) res_tempted_train <- tempted(mean_svd_train$datlist, r=2, smooth=1e-5) # get the overlapping features count_test <- count_test[,rownames(datlist_train[[1]])[-1]] datlist_test <- format_tempted(count_test, meta_test$day_of_life, meta_test$studyid, threshold=1, pseudo=0.5, transform="clr") # estimate the subject loading of the testing subject sub_test <- est_test_subject(datlist_test, res_tempted_train, mean_svd_train) # train logistic regression classifier on training subjects metauni <- unique(meta_table_sub[,c("studyid", "delivery")]) rownames(metauni) <- metauni$studyid Atrain <- as.data.frame(res_tempted_train$A_hat) Atrain$delivery <- metauni[rownames(Atrain),"delivery"]=="Cesarean" glm_train <- glm(delivery ~ PC1+PC2, data=Atrain, family=binomial(link="logit")) summary(glm_train) # predict the label of testing subject, whose true label is "Cesarean" predict(glm_train, newdata=as.data.frame(sub_test), type="response")
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] # split the example data into training and testing id_test <- meta_table_sub$studyid=="2" count_train <- count_table_sub[!id_test,] meta_train <- meta_table_sub[!id_test,] count_test <- count_table_sub[id_test,] meta_test <- meta_table_sub[id_test,] # run tempted on training data datlist_train <- format_tempted(count_train, meta_train$day_of_life, meta_train$studyid, threshold=0.95, pseudo=0.5, transform="clr") mean_svd_train <- svd_centralize(datlist_train, r=1) res_tempted_train <- tempted(mean_svd_train$datlist, r=2, smooth=1e-5) # get the overlapping features count_test <- count_test[,rownames(datlist_train[[1]])[-1]] datlist_test <- format_tempted(count_test, meta_test$day_of_life, meta_test$studyid, threshold=1, pseudo=0.5, transform="clr") # estimate the subject loading of the testing subject sub_test <- est_test_subject(datlist_test, res_tempted_train, mean_svd_train) # train logistic regression classifier on training subjects metauni <- unique(meta_table_sub[,c("studyid", "delivery")]) rownames(metauni) <- metauni$studyid Atrain <- as.data.frame(res_tempted_train$A_hat) Atrain$delivery <- metauni[rownames(Atrain),"delivery"]=="Cesarean" glm_train <- glm(delivery ~ PC1+PC2, data=Atrain, family=binomial(link="logit")) summary(glm_train) # predict the label of testing subject, whose true label is "Cesarean" predict(glm_train, newdata=as.data.frame(sub_test), type="response")
This function applies a variety of transformations to the read counts and
format the sample by feature table and meta data into a data list
that can be used as the input of tempted
and svd_centralize
.
For data that are not read counts, or data that are not microbiome data,
the user can apply their desired transformation to the data before formatting into list.
format_tempted( featuretable, timepoint, subjectID, threshold = 0.95, pseudo = NULL, transform = "clr" )
format_tempted( featuretable, timepoint, subjectID, threshold = 0.95, pseudo = NULL, transform = "clr" )
featuretable |
A sample by feature matrix. |
timepoint |
The time stamp of each sample, matched with the rows of |
subjectID |
The subject ID of each sample, matched with the rows of |
threshold |
A threshold for feature filtering for microbiome data. Features with zero value percentage > threshold will be excluded. Default is 0.95. |
pseudo |
A small number to add to all the counts before
normalizing into proportions and log transformation.
Default is 1/2 of the smallest non-zero value that is specific for each sample.
This pseudo count is added for |
transform |
The transformation applied to the data.
|
A length n list of matrices as the input of tempted
and svd_centralize
. Each matrix represents a subject, with columns representing samples from this subject, the first row representing the sampling time points, and the following rows representing the feature values.
Examples can be found in tempted
.
Meta data table from the ECAM data
meta_table
meta_table
A data.frame with rows representing samples and matching with data.frame count_table
and processed_table
and three columns:
character denoting the subject ID of the infants.
character denoting the delivery mode of the infants.
character denoting the age of infants measured in days when microbiome sample was taken.
Bokulich, Nicholas A., et al. "Antibiotics, birth mode, and diet shape microbiome maturation during early life." Science translational medicine 8.343 (2016): 343ra82-343ra82.
This is a handy function to plot the smoothed mean and error bands for multiple features.
plot_feature_summary( feature_mat, time_vec, group_vec, coverage = 0.95, bws = NULL, nrow = 1 )
plot_feature_summary( feature_mat, time_vec, group_vec, coverage = 0.95, bws = NULL, nrow = 1 )
feature_mat |
A sample by feature matrix. Each feature will be plotted separately as a facet. The features can be original features, meta features, log ratios, or any variables of interest. |
time_vec |
A vector of time points matched to the rows of |
group_vec |
A vector of factor variable indicating the group membership
of samples matched to the rows of |
coverage |
The coverage rate for the error band. Default is 0.95. |
bws |
The smoothness parameter for the smoothing lines and error bands.
A larger value means a smoother line.
Default is NULL and calculated by function |
nrow |
The number of rows to plot the features used in function |
A ggplot2 object.
# plot the summary of selected features feat.names <- c("OTU4447072", "OTU4467447") proportion_table <- count_table/rowSums(count_table) plot_feature_summary(proportion_table[,feat.names], meta_table$day_of_life, meta_table$delivery, bws=30)
# plot the summary of selected features feat.names <- c("OTU4447072", "OTU4467447") proportion_table <- count_table/rowSums(count_table) plot_feature_summary(proportion_table[,feat.names], meta_table$day_of_life, meta_table$delivery, bws=30)
This function plot the smoothed mean and error band of meta features grouped by a factor variable provided by the user.
plot_metafeature(metafeature, group, coverage = 0.95, bws = NULL, nrow = 1)
plot_metafeature(metafeature, group, coverage = 0.95, bws = NULL, nrow = 1)
metafeature |
|
group |
A subject by 2 data.frame with the first column for subject ID and second column for group membership. |
coverage |
The coverage rate for the error band. Default is 0.95. |
bws |
The smoothness parameter for the smoothing lines and error bands.
A larger value means a smoother line.
Default is NULL and calculated by function |
nrow |
The number of rows to plot the features used in function |
A ggplot2 object.
Examples can be found in tempted_all
, ratio_feature
and aggregate_feature
.
This function uses ggplot2::geom_line()
in ggplot2 to plot the temporal loading functions from tempted
.
plot_time_loading(res, r = NULL, ...)
plot_time_loading(res, r = NULL, ...)
res |
Output of function |
r |
The number of components to plot. By default all the components estimated by |
... |
Arguments to put in |
An ggplot2 object.
Examples can be found in tempted_all
and tempted
.
Central-log-ratio (clr) transformed OTU table from the ECAM data
processed_table
processed_table
A data.frame with rows representing samples and matching with data.frame meta_table
and columns representing microbial features (i.e. OTUs).
Entries do not need to be transformed, and will be directly used by tempted
.
This data.frame is used to illustrate how tempted
can be used for
general form of multivariate longitudinal data already preprocessed by user.
Bokulich, Nicholas A., et al. "Antibiotics, birth mode, and diet shape microbiome maturation during early life." Science translational medicine 8.343 (2016): 343ra82-343ra82.
Top and bottom ranking features are picked based on feature loadings (and their contrasts). The log ratio abundance of the top ranking features over the bottom ranking features is produced as the main result. This function and its result is designed for longitudinal microbiome data, and may not be meaningful for other type of temporal data.
ratio_feature( res_tempted, datlist, pct = 0.05, absolute = FALSE, contrast = NULL )
ratio_feature( res_tempted, datlist, pct = 0.05, absolute = FALSE, contrast = NULL )
res_tempted |
Output of |
datlist |
Output of |
pct |
The percent of features to sum up. Default is 0.05, i.e. 5%. |
absolute |
|
contrast |
A matrix choosing how components are combined, each column is a contrast of length r and used to calculate the linear combination of the feature loadings of r components. |
A list of results:
The log ratio abundance of the top over bottom ranking features. It is a data.frame with five columns: "value" for the log ratio values, "subID" for the subject ID, and "timepoint" for the time points, and "PC" indicating which component was used to construct the meta feature.
The contrast used to linearly combine the components from input.
A matrix of TRUE/FALSE indicating which features are ranked top in each component (and contrast) and used as the numerator of the log ratio.
A matrix of TRUE/FALSE indicating which features are ranked bottom in each component (and contrast) and used as the denominator of the log ratio.
Shi P, Martino C, Han R, Janssen S, Buck G, Serrano M, Owzar K, Knight R, Shenhav L, Zhang AR. (2023) Time-Informed Dimensionality Reduction for Longitudinal Microbiome Studies. bioRxiv. doi: 10.1101/550749. https://www.biorxiv.org/content/10.1101/550749.
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] datlist <- format_tempted(count_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=0.5, transform="clr") mean_svd <- svd_centralize(datlist, r=1) res_tempted <- tempted(mean_svd$datlist, r=2, smooth=1e-5) datalist_raw <- format_tempted(count_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, transform="none") contrast <- cbind(c(1,1), c(1,-1)) res_ratio <- ratio_feature(res_tempted, datalist_raw, pct=0.1, absolute=FALSE, contrast=contrast) group <- unique(meta_table[, c("studyid", "delivery")]) # plot the log ratios plot_metafeature(res_ratio$metafeature_ratio, group, bws=30)
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] datlist <- format_tempted(count_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=0.5, transform="clr") mean_svd <- svd_centralize(datlist, r=1) res_tempted <- tempted(mean_svd$datlist, r=2, smooth=1e-5) datalist_raw <- format_tempted(count_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, transform="none") contrast <- cbind(c(1,1), c(1,-1)) res_ratio <- ratio_feature(res_tempted, datalist_raw, pct=0.1, absolute=FALSE, contrast=contrast) group <- unique(meta_table[, c("studyid", "delivery")]) # plot the log ratios plot_metafeature(res_ratio$metafeature_ratio, group, bws=30)
This function reconstructs the temporal tensor from the low dimensional components extracted by tempted
and svd_centralize
.
reconstruct(res_tempted, res_svd = NULL, datlist = NULL, r_reconstruct = NULL)
reconstruct(res_tempted, res_svd = NULL, datlist = NULL, r_reconstruct = NULL)
res_tempted |
Output of function |
res_svd |
Output of function |
datlist |
Output of function |
r_reconstruct |
The number of components from TEMPTED to be used for the tensor reconstruction. |
datlist_est The reconstructed tensor stored in a datlist format same as the output of format_tempted
.
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] # reconstruct with mean subtraction datlist <- format_tempted(processed_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=NULL, transform="none") mean_svd <- svd_centralize(datlist, r=1) res_tempted <- tempted(mean_svd$datlist, r=2, smooth=1e-5) datlist_est <- reconstruct(res_tempted, mean_svd, datlist=NULL, r_reconstruct=2) vec_est <- unlist(sapply(datlist_est, function(x){x[-1,]})) vec_obs <- unlist(sapply(datlist, function(x){x[-1,]})) R2 <- 1-sum((vec_est-vec_obs)^2)/sum(vec_obs^2) R2 # reconstruct without mean subtraction datlist <- format_tempted(processed_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=NULL, transform="none") res_tempted <- tempted(datlist, r=2, smooth=1e-5) datlist_est <- reconstruct(res_tempted, res_svd=NULL, datlist=datlist, r_reconstruct=2) vec_est <- unlist(sapply(datlist_est, function(x){x[-1,]})) vec_obs <- unlist(sapply(datlist, function(x){x[-1,]})) R2 <- 1-sum((vec_est-vec_obs)^2)/sum(vec_obs^2) R2
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] # reconstruct with mean subtraction datlist <- format_tempted(processed_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=NULL, transform="none") mean_svd <- svd_centralize(datlist, r=1) res_tempted <- tempted(mean_svd$datlist, r=2, smooth=1e-5) datlist_est <- reconstruct(res_tempted, mean_svd, datlist=NULL, r_reconstruct=2) vec_est <- unlist(sapply(datlist_est, function(x){x[-1,]})) vec_obs <- unlist(sapply(datlist, function(x){x[-1,]})) R2 <- 1-sum((vec_est-vec_obs)^2)/sum(vec_obs^2) R2 # reconstruct without mean subtraction datlist <- format_tempted(processed_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=NULL, transform="none") res_tempted <- tempted(datlist, r=2, smooth=1e-5) datlist_est <- reconstruct(res_tempted, res_svd=NULL, datlist=datlist, r_reconstruct=2) vec_est <- unlist(sapply(datlist_est, function(x){x[-1,]})) vec_obs <- unlist(sapply(datlist, function(x){x[-1,]})) R2 <- 1-sum((vec_est-vec_obs)^2)/sum(vec_obs^2) R2
This function first average the feature value of all time points for each subject to form a subject by feature matrix. Next, it performs a singular value decomposition of this matrix and construct the matrix's rank-r approximation. Then, it subtracts this rank-r subject by feature matrix from the temporal tensor.
svd_centralize(datlist, r = 1)
svd_centralize(datlist, r = 1)
datlist |
A length n list of matrices. Each matrix represents a subject, with columns representing samples from this subject, the first row representing the sampling time points, and the following rows representing the feature values. |
r |
The number of ranks in the mean structure. Default is 1. |
A list of results.
The new temporal tensor after mean structure is removed.
The subject singular vector of the mean structure, a subject by r matrix.
The feature singular vector of the mean structure, a feature by r matrix.
The singular value of the mean structure, a length r vector.
Shi P, Martino C, Han R, Janssen S, Buck G, Serrano M, Owzar K, Knight R, Shenhav L, Zhang AR. (2023) Time-Informed Dimensionality Reduction for Longitudinal Microbiome Studies. bioRxiv. doi: 10.1101/550749. https://www.biorxiv.org/content/10.1101/550749.
Examples can be found in tempted
.
This function constructs a de-noised version of the temporal tensor
using the low-rank components obtained by svd_centralize
tempted
and uses the loadings to
tdenoise(res_tempted, mean_svd = NULL)
tdenoise(res_tempted, mean_svd = NULL)
res_tempted |
Output of tempted |
mean_svd |
Output of svd_centralize |
The de-noised functional tensor
This is the main function of tempted.
tempted( datlist, r = 3, smooth = 1e-06, interval = NULL, resolution = 101, maxiter = 20, epsilon = 1e-04 )
tempted( datlist, r = 3, smooth = 1e-06, interval = NULL, resolution = 101, maxiter = 20, epsilon = 1e-04 )
datlist |
A length n list of matrices. Each matrix represents a subject, with columns representing samples from this subject, the first row representing the sampling time points, and the following rows representing the feature values. |
r |
Number of components to decompose into, i.e. rank of the CP type decomposition. Default is set to 3. |
smooth |
Smoothing parameter for RKHS norm. Larger means smoother temporal loading functions. Default is set to be 1e-8. Value can be adjusted depending on the dataset by checking the smoothness of the estimated temporal loading function in plot. |
interval |
The range of time points to ran the decomposition for. Default is set to be the range of all observed time points. User can set it to be a shorter interval than the observed range. |
resolution |
Number of time points to evaluate the value of the temporal loading function. Default is set to 101. It does not affect the subject or feature loadings. |
maxiter |
Maximum number of iteration. Default is 20. |
epsilon |
Convergence criteria for difference between iterations. Default is 1e-4. |
The estimations of the loadings.
Subject loading, a subject by r matrix.
Feature loading, a feature by r matrix.
Temporal loading function, a resolution by r matrix.
The time points where the temporal loading function is evaluated.
Eigen value, a length r vector.
Variance explained by each component. This is the R-squared of the linear regression of the vectorized temporal tensor against the vectorized low-rank reconstruction using individual components.
Variance explained by the first few components accumulated. This is the R-squared of the linear regression of the vectorized temporal tensor against the vectorized low-rank reconstruction using the first few components.
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] # for count data from longitudinal microbiome studies datlist <- format_tempted(count_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=0.5, transform="clr") mean_svd <- svd_centralize(datlist, r=1) res_tempted <- tempted(mean_svd$datlist, r=2, smooth=1e-5) # for preprocessed data that do not need to be transformed datlist <- format_tempted(processed_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=NULL, transform="none") mean_svd <- svd_centralize(datlist, r=1) res_tempted <- tempted(mean_svd$datlist, r=2, smooth=1e-5) # plot the temporal loading plot_time_loading(res_tempted, r=2)
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] # for count data from longitudinal microbiome studies datlist <- format_tempted(count_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=0.5, transform="clr") mean_svd <- svd_centralize(datlist, r=1) res_tempted <- tempted(mean_svd$datlist, r=2, smooth=1e-5) # for preprocessed data that do not need to be transformed datlist <- format_tempted(processed_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, pseudo=NULL, transform="none") mean_svd <- svd_centralize(datlist, r=1) res_tempted <- tempted(mean_svd$datlist, r=2, smooth=1e-5) # plot the temporal loading plot_time_loading(res_tempted, r=2)
This function wraps functions format_tempted
, svd_centralize
, tempted
,
ratio_feature
, \
and aggregate_feature
.
tempted_all( featuretable, timepoint, subjectID, threshold = 0.95, pseudo = NULL, transform = "clr", r = 3, smooth = 1e-06, interval = NULL, resolution = 51, maxiter = 20, epsilon = 1e-04, r_svd = 1, do_ratio = TRUE, pct_ratio = 0.05, absolute = FALSE, pct_aggregate = 1, contrast = NULL )
tempted_all( featuretable, timepoint, subjectID, threshold = 0.95, pseudo = NULL, transform = "clr", r = 3, smooth = 1e-06, interval = NULL, resolution = 51, maxiter = 20, epsilon = 1e-04, r_svd = 1, do_ratio = TRUE, pct_ratio = 0.05, absolute = FALSE, pct_aggregate = 1, contrast = NULL )
featuretable |
A sample by feature matrix. It is an input for |
timepoint |
The time stamp of each sample, matched with the rows of |
subjectID |
The subject ID of each sample, matched with the rows of |
threshold |
A threshold for feature filtering for microbiome data.
Features with zero value percentage >= threshold will be excluded. Default is 0.95.
It is an input for |
pseudo |
A small number to add to all the counts before
normalizing into proportions and log transformation.
Default is 1/2 of the smallest non-zero value that is specific for each sample.
This pseudo count is added for |
transform |
The transformation applied to the data.
|
r |
Number of components to decompose into, i.e. rank of the CP type decomposition.
Default is set to 3.
It is an input for |
smooth |
Smoothing parameter for RKHS norm.
Larger means smoother temporal loading functions. Default is set to be 1e-8.
Value can be adjusted depending on the dataset by checking the smoothness of the estimated temporal loading function in plot.
It is an input for |
interval |
The range of time points to ran the decomposition for.
Default is set to be the range of all observed time points.
User can set it to be a shorter interval than the observed range.
It is an input for |
resolution |
Number of time points to evaluate the value of the temporal loading function.
Default is set to 101. It does not affect the subject or feature loadings. It is an input for |
maxiter |
Maximum number of iteration. Default is 20. It is an input for |
epsilon |
Convergence criteria for difference between iterations. Default is 1e-4. It is an input for |
r_svd |
The number of ranks in the mean structure. Default is 1. It is an input for |
do_ratio |
Whether to calculate the log ratio of features. |
pct_ratio |
The percent of features to sum up. Default is 0.05, i.e. 5%.
It is an input for |
absolute |
|
pct_aggregate |
The percent of features to aggregate,
features ranked by absolute value of the feature loading of each component.
Default is 1, which means 100% of features are aggregated.
Setting |
contrast |
A matrix choosing how components are combined,
each column is a contrast of length r and used to calculate the linear combination of
the feature loadings of r components.
It is an input for |
A list including all the input and output of functions format_tempted
, svd_centralize
, tempted
,
ratio_feature
, and aggregate_feature
.
All the input options of function tempted_all
.
Output of format_tempted
with option transform="none"
.
Output of format_tempted
.
Output of svd_centralize
.
Subject loading, a subject by r matrix.
Feature loading, a feature by r matrix.
Temporal loading function, a resolution by r matrix.
The time points where the temporal loading function is evaluated.
Eigen value, a length r vector.
Variance explained by each component. This is the R-squared of the linear regression of the vectorized temporal tensor against the vectorized low-rank reconstruction using individual components.
Variance explained by the first few components accumulated. This is the R-squared of the linear regression of the vectorized temporal tensor against the vectorized low-rank reconstruction using the first few components.
The log ratio abundance of the top over bottom ranking features. It is a data.frame with five columns: "value" for the log ratio values, "subID" for the subject ID, and "timepoint" for the time points, and "PC" indicating which component was used to construct the meta feature.
A matrix of TRUE/FALSE indicating which features are ranked top in each component (and contrast) and used as the numerator of the log ratio.
A matrix of TRUE/FALSE indicating which features are ranked bottom in each component (and contrast) and used as the denominator of the log ratio.
The meta feature obtained by aggregating the observed temporal tensor. It is a data.frame with four columns: "value" for the meta feature values, "subID" for the subject ID, "timepoint" for the time points, and "PC" indicating which component was used to construct the meta feature.
A matrix of TRUE/FALSE indicating which features are aggregated in each component and contrast.
The contrast used to linearly combine the components from input.
Shi P, Martino C, Han R, Janssen S, Buck G, Serrano M, Owzar K, Knight R, Shenhav L, Zhang AR. (2023) Time-Informed Dimensionality Reduction for Longitudinal Microbiome Studies. bioRxiv. doi: 10.1101/550749. https://www.biorxiv.org/content/10.1101/550749.
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] # for preprocessed data that do not need to be transformed res.processed <- tempted_all(processed_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, threshold=1, transform="none", r=2, smooth=1e-5, do_ratio=FALSE) # for count data that will have pseudo added and clr transformed res.count <- tempted_all(count_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, threshold=0.95, transform="clr", pseudo=0.5, r=2, smooth=1e-5, pct_ratio=0.1, pct_aggregate=1) # for proportional data that will have pseudo added and clr transformed res.proportion <- tempted_all(count_table_sub/rowSums(count_table_sub), meta_table_sub$day_of_life, meta_table_sub$studyid, threshold=0.95, transform="clr", pseudo=NULL, r=2, smooth=1e-5, pct_ratio=0.1, pct_aggregate=1) # plot the temporal loading and subject trajectories grouped by delivery mode plot_time_loading(res.proportion, r=2) group <- unique(meta_table[,c("studyid", "delivery")]) # plot the aggregated features plot_metafeature(res.proportion$metafeature_aggregate, group, bws=30)
# Take a subset of the samples so the example runs faster # Here we are taking samples from the odd months sub_sample <- rownames(meta_table)[(meta_table$day_of_life%/%12)%%2==1] count_table_sub <- count_table[sub_sample,] processed_table_sub <- processed_table[sub_sample,] meta_table_sub <- meta_table[sub_sample,] # for preprocessed data that do not need to be transformed res.processed <- tempted_all(processed_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, threshold=1, transform="none", r=2, smooth=1e-5, do_ratio=FALSE) # for count data that will have pseudo added and clr transformed res.count <- tempted_all(count_table_sub, meta_table_sub$day_of_life, meta_table_sub$studyid, threshold=0.95, transform="clr", pseudo=0.5, r=2, smooth=1e-5, pct_ratio=0.1, pct_aggregate=1) # for proportional data that will have pseudo added and clr transformed res.proportion <- tempted_all(count_table_sub/rowSums(count_table_sub), meta_table_sub$day_of_life, meta_table_sub$studyid, threshold=0.95, transform="clr", pseudo=NULL, r=2, smooth=1e-5, pct_ratio=0.1, pct_aggregate=1) # plot the temporal loading and subject trajectories grouped by delivery mode plot_time_loading(res.proportion, r=2) group <- unique(meta_table[,c("studyid", "delivery")]) # plot the aggregated features plot_metafeature(res.proportion$metafeature_aggregate, group, bws=30)