1. Overview
TDEseq is implemented as an open source R package for detecting genes with temporal dynamic expression patterns in time-series scRNA-seq transcriptomic studies. TDEseq primarily builds upon the linear additive mixed model (LAMM) framework, with a random effect term to account for correlated samples in time-resolved or time-course scRNA-seq studies. In this model, we typically introduce the quadratic I-splines and cubic C-splines as basis functions, which facilitate the detection of four potential temporal gene expression patterns, i.e., growth, recession, peak, and trough. This vignette will illustrate some uses and visulization of TDEseq.
2. Installation
TDEseq is implemented as an R package, which can be installed from GitHub.
library(devtools)
install_github("fanyue322/TDEseq")
3. Usage
TDEseq starts with the normalized gene expression scRNA-seq data and the corresponding time points.
Application on mouse liver development data
We used a sample mouse liver development during embyro process scRNA-seq data as a main example which contains expression measurement of 14,226 genes on 345 cells. This dataset can be downloaded from their original study GSE90047. We also saved the processed data that we used in our examples in RData format, which can be downloaded from here.
library(TDEseq)
load('scRNA-seq_liver_embryo.rda')
counts<-Seurat::GetAssayData(seurat,slot ='counts') ##raw counts data
data.norm<-Seurat::GetAssayData(seurat,slot ='data') ##log normalized data
meta.data<-seurat@meta.data ##metadata
Detecting temporal DE genes through TDEseq
To run TDEseq on scRNA-seq data, a user should consider the minimal parameter set for the wrapper function TDEseq
:
· data.norm(normalized scRNA-seq data)
· meta.data (a dataframe contained time points and individual information)
TDEseq takes the normalized data and a meta data as inputs, and output the p-value and pattern information for each gene. The gene expression data format:
E10.5D_1_01 | E10.5D_1_02 | E10.5D_1_03 | E10.5D_1_04 | E10.5D_1_05 | E10.5D_1_06 | |
---|---|---|---|---|---|---|
Gnai3 | 1.2575920 | 1.3618800 | 1.0647014 | 1.0578996 | 1.4092147 | 1.1380617 |
Cdc45 | 1.0757164 | 0.2382423 | 0.9187661 | 0.9054553 | 0.3632458 | 0.7534112 |
H19 | 2.0985043 | 2.1759738 | 1.8616150 | 1.6786874 | 1.5646516 | 1.6727125 |
Scml2 | 0.0000000 | 0.0000000 | 0.1631383 | 0.0000000 | 0.0000000 | 0.2138798 |
Apoh | 0.0308504 | 0.0092337 | 0.1028170 | 0.0086277 | 0.0351299 | 0.0252298 |
Narf | 0.3709296 | 0.7995276 | 0.2663387 | 0.4411944 | 0.2089655 | 0.0000000 |
The meta data should contain the information of time points/stage for each cells.
stage | group | |
---|---|---|
E10.5D_1_01 | 10.5 | E10.5D |
E11.5D_1_06 | 11.5 | E11.5D |
E11.5E_2_10 | 11.5 | E11.5E |
E13.5E_2_01 | 13.5 | E13.5E |
E14.5E_1_13 | 14.5 | E14.5E |
E15.5D_4_12 | 15.5 | E15.5D |
Where the “stage column contains the time points/stage for each cell (embryo day in this example) and the”group" column contains the individual information (index of the mice in this example). Here, we present a Linear TDEseq without batch effects controlling. TDEseq analysis can be performed as follows: #### Create TDEseqObject Users can create TDEseqObject from gene expression data
tde <- CreateTDEseqObject(counts = counts,data=data.norm,meta.data=meta.data)
Or from Seurat object directly
tde <- CreateTDEseqObject(counts = seurat)
Run TDEseq analysis
tde_method <- "cell"
tde_param <- list(sample.var = "group",
stage.var = "stage",
fit.model = "lm",
tde.thr = 0.05,
num.core=5)
tde <- tdeseq(object = tde, tde.method = tde_method, tde.param=tde_param)
An example seurat object file can be downloaded from here.
The temporal differential expression results can be accessed as a data frame in the output object via result<-GetTDEseqAssayData(tde,'tde')
.
By default, TDEseq uses the Linear version. Alternatively, one can perform the mixed version of TDEseq, which accounts for individual level batch effects, such a batch structure could arise due to sample-level variation. This variable imposing the batch structure on the expression observations and can be specified by setting sample.var="group"
. By accounting for batch effects in TDEseq, we set the fit.model='lmm'
, e.g. :
tde_method <- "cell"
tde_param <- list(sample.var = "group",
stage.var = "stage",
fit.model = "lmm",
num.core=10)
tde <- tdeseq(object = tde, tde.method = tde_method, tde.param=tde_param)
In this case the group information for each cell is required. The group information represents the sample-level variation of the scRNA-seq data.
Parameter settings
In TDEseq()
, besides the default parameters, users can also set other parameters as below to filter genes or cells:
tde_param <- list(sample.var = "batch",
stage.var = "stage",
fit.model = "lm",
pct = 0.1,
tde.thr = 0.05,
lfc = 0.1,
max.gcells = Inf,
min.tcells = 3,
num.core=10)
tde <- tdeseq(object = tde, tde.param=tde_param)
- Remove time points with too few cells by setting
min.tcells
. Here, time points with less than 3 cells will be removed. - Filter genes that are only expressed in a few cells by setting
pct
. Here, genes with more than 90% of zero counts will be filtered out. - Filter genes that show small average X-fold difference (log-scale) between any two time points by setting
lfc
. Here, we limit testing to genes which show at least 0.1-fold difference between any two time points. - Downsample cells by setting
max.gcells
. If max.gcells is smaller than the given number of cells in a sample, the down-sampling will be active. Here, we do not perform downsampling by settingmax.gcells=Inf
. - Perform parallel computing with 10 core by setting
num.core=10
.
Association of gene expression with time points
The exploration of the data analysis consist of checking whether gene expression is associated with time points and the corresponding pattern. This can be interpreted as testing whether the average gene expression is significantly changing along time points with a specific pattern.
gene | increasing.pvalue | decreasing.pvalue | convex.pvalue | concave.pvalue | pvalue | padj | aic.inc | aic.dec | aic.cov | aic.con | SignificantDE | pattern | ChangePoint |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Apoh | 0.00 | 0.81 | 0.0563112 | 0.6699999 | 0 | 0 | 518.56329 | 1211.3285 | 529.5136 | 533.9445 | Yes | Growth | NA |
Ccnd2 | 0.86 | 0.00 | 0.0000000 | 0.7599999 | 0 | 0 | 618.50895 | 357.5597 | 363.9423 | 409.1350 | Yes | Recession | NA |
Pemt | 0.00 | 0.76 | 0.0000000 | 0.7100000 | 0 | 0 | -202.52977 | 570.8639 | -200.4116 | 431.1143 | Yes | Growth | NA |
Haao | 0.00 | 0.84 | 0.0234969 | 0.0252611 | 0 | 0 | 94.26462 | 462.7439 | 122.3448 | 122.6517 | Yes | Growth | NA |
Here, increasing.pvalue
,decreasing.pvalue
,convex.pvalue
,concave.pvalue
are the p-values after testing whether gene expression is changing with a growth, recession, trough or peak pattern.
pval
is the p-value after Chuachy combination rule, and padj
is the adjusted p-values after BY-correction.
aic.inc
,aic.dec
,aic.cov
, aic.con
are the corresponding AIC score. TDEseq determines the pattern
of each gene according to the AIC score, and ChangePoint
represents the time points that gene expression trend changes(only exists for trough or peak genes).
4. Visualization
Here, we present the visualization of global expression patterns via a heatmap. We base this analysis on a classification of temporal differentially expressed genes according to the corresponding patterns. The function PatternHeatmap
takes an output object of the class TDEseq. The output of PatternHeatmap
is a object of the class ComplexHeatmap which can be plotted with print()
. To performed PatternHeatmap
, please first install ComplexHeatmap, Seurat and ggplot2 packages. To perform visulization function included in TDEseq, please library these packages first.
suppressPackageStartupMessages({
library("Seurat")
library("ggplot2")
library("ComplexHeatmap")
library("circlize")
})
Parameter settings
obj
: The results of TDEseq analysis
`stage.id’: The column name of time points in metadata.
features
: Genes to be shown in heatmap. Default is NULL.
features.show
: Genes to be annotated in heatmap.
features.num
: Number of genes to be shown for each patterns. Default is 50.
cols
: Color of the heatmap. Default is c(“navy”, “white”, “firebrick3”).
p<-PatternHeatmap(obj=tde,stage.id='stage',features.show=c("Cse1l","Cdk4","Cbx5","Sardh","Birc5","G3bp1","Pah"))
print(p)
5. Apply TDEseq on real scRNA-seq data
All scRNA-seq data sets used in our originial paper are available on here. For more time series scRNA-seq data, please refer TEDD database
6. Web server of TDEseq
TDEseq is also developed within a web framework with and hosted at our server. This web framework minimizes inherent dependencies on specific hardware, software packages and libraries and file-system attributes. Through this interface, users can provide a gene expression matrix in csv format and a metadata in csv format and run TDEseq analysis directly.
How to register a new user
To register a new user follow these steps:
- Visit our server and click Register button
- The sign-up form will appear. Follow the directions by entering the required information. Select “Role” as TDEseq.
- You will receive a text message from our server with a verification code. Enter the code to complete the account verification.
Usage
Our web version of TDEseq are organized into three steps: Upload data, Parameter settings and Visualization.
A. Upload data. The user can choose to upload normalised scRNA-seq gene expression data (genes by cells) in csv format as well as a metadata including time points and group information (specifying colnames as “Time” and “Group” and rownames as cell barcodes), or R data format including Expdata and Metadata formatted as the csv file.
B. Parameter settings. The web version of TDEseq provides identical parameter settings to the R package version of TDEseq. Users can select appropriate parameters according to the parameter description. The detailed description of each patameters is as follow:
logFC_threshold
: Limit testing to genes which show the maximum on average X-fold difference (log-scale) between any two time points. Default is 0.0.
pct
: Only test genes that are detected in a minimum fraction of pct cells. Default is 0.1.
threshold
: Only return markers that have an FDR(BY adjusted) < return.thresh. Default is 0.05.
max_cells_per_ident
: Down sample each identity class to a max number. Default is no downsampling.
min_cells_per_timepoints
:Minimum number of cells in one of the time points.
LMM
:Denotes which methods to use. Default is FALSE(Linear TDEseq). Set LMM=TRUE
to perform Mixed TDEseq.
pseudocell
: Whether perform pseudo cell strategy or not. Default is NULL. Alternatively, users can set pseudocell as an integer(i.e. 20) to perform pseudocell analysis. Note that to perform pseudocell strategy, raw counts matrix instead of normalized gene expression data is required.
C. Visualization. The web version of TDEseq provides two approaches for user to visualise the analysis results. For the heatmap
, the temporal expression genes are diveded into 4 types according to their pattern (growth, recession, peak and trough) and the corresponding heatmap is displayed with the clustered genes (As the temporal distinct patterns figure in our original paper). Users can select the genes of interest or the top 50 genes for each pattern ranked by p-values to display for the heatmap. Another approach is the line
, which shows the smooth line estimated by TDEseq across all time points.