This file illustrates the normalization and differential analysis of RNAseq data on a real life dataset. The data are provided by courtesy of the transcriptomic platform of IPS2 and the name of the genes were shuffled. Hence the results are not biologically interpretable. They are composed of three files, all included in the directory data
:
D1-counts.txt
contains the raw counts of the experiment (13 columns: the first one contains the gene names, the others correspond to 12 different samples);
D1-genesLength.txt
contains information about gene lengths;
D1-targets.txt
contains information about the sample and the experimental design.
In the rest of this file, we:
1/ first import and describe the data;
2/ then perform and compare different normalizations;
3/ finally perform a differential analysis using different methods and models.
We advice that you use the file by reading the R code while trying to make sense of it. Then, you run it and analyze the results. The packages devtools
, ggplot2
, gridExtra
, reshape2
, mixOmics
, RColorBrewer
, DESeq
, edgeR
and VennDiagram
are required to compile this file. Information about my system (including package and R versions) are provided at the end of this file, in Section “Session information”.
The packages are loaded with:
library(ggplot2)
library(gridExtra)
library(reshape2)
library(mixOmics)
library(RColorBrewer)
library(DESeq)
library(edgeR)
library(VennDiagram)
library(devtools)
The data used in this practical session correspond to 12 samples of RNAseq obtained from microdissections on plants. Plants have four different genotypes: the wild type (“wt”“) plant and three types of mutants and are observed three times in the same conditions. for information Mutants 1 and 2 have a similar phenotype (more complicated leaves than the wild type) whereas mutant 3 has the opposite phenotype (simpler leaves than the wild type).
The datasets are included in the repository data
located at the root of the material for this practical session. They can be loaded using:
raw_counts <- read.table("../data/D1-counts.txt", header = TRUE, row.names = 1)
raw_counts <- as.matrix(raw_counts)
design <- read.table("../data/D1-targets.txt", header = TRUE,
stringsAsFactors = FALSE)
gene_lengths <- scan("../data/D1-genesLength.txt")
The dimensions of the raw count matrix are equal to:
dim(raw_counts)
## [1] 50897 12
which shows that the data contains 12 columns (samples) and 50897 rows (genes). And a quick look is obtained by:
head(raw_counts)
## wt_1 wt_2 wt_3 mut1_1 mut1_2 mut1_3 mut2_1 mut2_2 mut2_3
## Medtr0001s0010.1 0 0 0 1 0 1 0 0 0
## Medtr0001s0070.1 0 0 0 0 0 0 0 0 0
## Medtr0001s0100.1 0 0 0 0 0 0 0 0 0
## Medtr0001s0120.1 0 0 0 0 0 0 0 0 0
## Medtr0001s0160.1 0 0 0 0 0 0 0 0 0
## Medtr0001s0190.1 0 0 0 0 0 0 0 0 0
## mut3_1 mut3_2 mut3_3
## Medtr0001s0010.1 0 1 0
## Medtr0001s0070.1 0 0 0
## Medtr0001s0100.1 0 0 0
## Medtr0001s0120.1 0 0 0
## Medtr0001s0160.1 0 0 0
## Medtr0001s0190.1 0 0 0
which shows that the row names are gene names and that the data are made of counts. Many of these counts are equal to 0. A basic exploratory analysis of the count data is provided in the next section.
The experimental design is described in the object design
:
design
## labels group replicat
## 1 wt_1 wt repbio1
## 2 wt_2 wt repbio2
## 3 wt_3 wt repbio3
## 4 mut1_1 mut1 repbio1
## 5 mut1_2 mut1 repbio2
## 6 mut1_3 mut1 repbio3
## 7 mut2_1 mut2 repbio1
## 8 mut2_2 mut2 repbio2
## 9 mut2_3 mut2 repbio3
## 10 mut3_1 mut3 repbio1
## 11 mut3_2 mut3 repbio2
## 12 mut3_3 mut3 repbio3
that contains 3 columns: the first one labels
is the sample name, the second one group
is the group of the sample (wild type or one of three types of mutant) and the last one replicat
is the biological replicate number.
We start the analysis by filtering out the genes for which no count has been found:
raw_counts_wn <- raw_counts[rowSums(raw_counts) > 0, ]
dim(raw_counts_wn)
## [1] 27916 12
The number of genes which have been filtered out is thus equal to 22981.
It is often useful, to visualize the count distribution, to compute “pseudo counts”, which are log-transformed counts:
pseudo_counts <- log2(raw_counts_wn + 1)
head(pseudo_counts)
## wt_1 wt_2 wt_3 mut1_1 mut1_2 mut1_3
## Medtr0001s0010.1 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000
## Medtr0001s0200.1 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
## Medtr0001s0260.1 6.357552 6.768184 6.643856 6.169925 6.507795 6.189825
## Medtr0001s0360.1 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000
## Medtr0001s0490.1 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000
## Medtr0002s0040.1 7.707359 6.918863 7.826548 2.584963 2.000000 4.643856
## mut2_1 mut2_2 mut2_3 mut3_1 mut3_2 mut3_3
## Medtr0001s0010.1 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
## Medtr0001s0200.1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Medtr0001s0260.1 6.209453 6.392317 6.672425 6.820179 6.554589 6.539159
## Medtr0001s0360.1 0.000000 1.584963 0.000000 1.000000 0.000000 0.000000
## Medtr0001s0490.1 0.000000 0.000000 1.584963 1.000000 0.000000 1.000000
## Medtr0002s0040.1 6.643856 7.651052 8.044394 3.169925 3.459432 1.584963
df_raw <- melt(pseudo_counts, id = rownames(raw_counts_wn))
names(df_raw)[1:2]<- c("id", "sample")
df_raw$method <- rep("Raw counts", nrow(df_raw))
head(df_raw)
## id sample value method
## 1 Medtr0001s0010.1 wt_1 0.000000 Raw counts
## 2 Medtr0001s0200.1 wt_1 0.000000 Raw counts
## 3 Medtr0001s0260.1 wt_1 6.357552 Raw counts
## 4 Medtr0001s0360.1 wt_1 0.000000 Raw counts
## 5 Medtr0001s0490.1 wt_1 1.000000 Raw counts
## 6 Medtr0002s0040.1 wt_1 7.707359 Raw counts
The object df_raw
will be used later to compare the effect of different normalization on the count distribution in the different samples.
First, we provide an overview of the distribution of the first sample (without the genes that have been filtered out) by ploting the histograms of raw counts and pseudo counts:
df <- data.frame(rcounts = raw_counts_wn[ ,1], prcounts = pseudo_counts[ ,1])
p <- ggplot(data=df, aes(x = rcounts, y = ..density..))
p <- p + geom_histogram(fill = "lightblue")
p <- p + theme_bw()
p <- p + ggtitle(paste0("count distribution '", design$labels[1], "'"))
p <- p + xlab("counts")
p2 <- ggplot(data=df, aes(x = prcounts, y = ..density..))
p2 <- p2 + geom_histogram(fill = "lightblue")
p2 <- p2 + theme_bw()
p2 <- p2 + ggtitle(paste0("count distribution - '", design$labels[1], "'"))
p2 <- p2 + xlab(expression(log[2](counts + 1)))
grid.arrange(p, p2, ncol = 2)
This figure shows that the count distribution is highly skewed with a large proportion of genes with a count equal to 0 and a few number of genes with an extreme count (more that \(2^{15}\)).
Another important feature of RNAseq data is the fact that they are overdispersed. This means that the variance of the count of a given gene over different biological samples within a given condition is larger than the average count for the same gene. This feature is illustrated by plotting the graphics of the mean vs variance for condition “wt” for all genes with an average count smaller than 5000 (otherwise the chart can not be read easily):
df <- data.frame(mean = apply(raw_counts_wn[ ,design$group == "wt"], 1, mean),
var = apply(raw_counts_wn[ ,design$group == "wt"], 1, var))
df <- df[df$mean <= 5000, ]
p <- ggplot(data=df, aes(x = mean, y = var))
p <- p + geom_point(colour = "orange")
p <- p + theme_bw()
p <- p + geom_abline(aes(intercept=0, slope=1))
p <- p + ggtitle("Variance versus mean in counts") + ylab("variance")
print(p)
In this figure, the black line is the “y=x” diagonal. It is easy to see that, for most genes, the variance is much larger than the mean.
Finally, a PCA is performed to identify the main sources of variability in the dataset. Pseudo-counts are used to perform this PCA.
resPCA <- pca(t(pseudo_counts), ncomp = 12)
plot(resPCA)
The first component explains more variance than the other components which almost all equally reproduce variance. The projection of the samples on the first two PCs is obtained with:
plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(4, "Dark2"))
This figure shows that mutants 1 and 3 are well separated from the other wild type but that wild type is not well separated from mutant 2.
This section performs different normalization approaches for RNAseq data. These normalization are performed using either DESeq (RLE) or edgeR (TC, RPKM, UQ, TMM). The final count distribution is compared to the raw count distribution in the last part of this section.
To perform the normalization with DESeq, an object of type CountDataSet
has to be created:
dge <- newCountDataSet(raw_counts_wn, conditions = design$group)
dge
## CountDataSet (storageMode: environment)
## assayData: 27916 features, 12 samples
## element names: counts
## protocolData: none
## phenoData
## sampleNames: wt_1 wt_2 ... mut3_3 (12 total)
## varLabels: sizeFactor condition
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
In this object, the attribute sizeFactor
is at first missing:
sizeFactors(dge)
## wt_1 wt_2 wt_3 mut1_1 mut1_2 mut1_3 mut2_1 mut2_2 mut2_3 mut3_1
## NA NA NA NA NA NA NA NA NA NA
## mut3_2 mut3_3
## NA NA
The function estimateSizeFactors()
allows to estimate the size factors. The dge object is unchanged but the attribute sizeFactor
is now given:
dge <- estimateSizeFactors(dge)
sizeFactors(dge)
## wt_1 wt_2 wt_3 mut1_1 mut1_2 mut1_3 mut2_1
## 0.9753728 1.1491346 1.0187414 0.9167701 1.0326086 1.1436435 0.9664323
## mut2_2 mut2_3 mut3_1 mut3_2 mut3_3
## 0.9570126 0.9826726 1.1355102 0.9271699 0.9202688
Normalized counts can be obtained using the function counts
with the option normalized = TRUE
. The calculation of the normalized counts is performed with the formula \(\tilde{K}_{gj} = \frac{K_{gj}}{s_j}\). The following command lines compare the result of the counts
function with this calculation:
deseq_normcount <- counts(dge, normalized = TRUE)
test_normcount <- sweep(raw_counts_wn, 2, sizeFactors(dge), "/")
sum(test_normcount != deseq_normcount)
## [1] 0
Finally, pseudo counts are computed and stored in a data frame for further comparison.
pseudo_deseq <- log2(deseq_normcount + 1)
df_deseq <- melt(pseudo_deseq, id = rownames(raw_counts_wn))
names(df_deseq)[1:2]<- c("id", "sample")
df_deseq$method <- rep("DESeq (RLE)", nrow(df_raw))
In this section, the package edgeR is perform to compare the other normalization approaches. To do so, a DGEList object has to be created from the count data:
dge2 <- DGEList(raw_counts_wn)
dge2
## An object of class "DGEList"
## $counts
## wt_1 wt_2 wt_3 mut1_1 mut1_2 mut1_3 mut2_1 mut2_2 mut2_3
## Medtr0001s0010.1 0 0 0 1 0 1 0 0 0
## Medtr0001s0200.1 0 1 0 0 0 0 0 0 0
## Medtr0001s0260.1 81 108 99 71 90 72 73 83 101
## Medtr0001s0360.1 0 0 1 0 0 0 0 2 0
## Medtr0001s0490.1 1 3 0 0 0 0 0 0 2
## mut3_1 mut3_2 mut3_3
## Medtr0001s0010.1 0 1 0
## Medtr0001s0200.1 0 0 0
## Medtr0001s0260.1 112 93 92
## Medtr0001s0360.1 1 0 0
## Medtr0001s0490.1 1 0 1
## 27911 more rows ...
##
## $samples
## group lib.size norm.factors
## wt_1 1 5932414 1
## wt_2 1 7129204 1
## wt_3 1 6223759 1
## mut1_1 1 5577595 1
## mut1_2 1 6272344 1
## 7 more rows ...
The library sizes and the normalization factors for all samples are obtained by:
dge2$samples
## group lib.size norm.factors
## wt_1 1 5932414 1
## wt_2 1 7129204 1
## wt_3 1 6223759 1
## mut1_1 1 5577595 1
## mut1_2 1 6272344 1
## mut1_3 1 6634542 1
## mut2_1 1 5897356 1
## mut2_2 1 5785184 1
## mut2_3 1 5767730 1
## mut3_1 1 7106881 1
## mut3_2 1 5872052 1
## mut3_3 1 6035692 1
Normalization by Total Count (TC) is obtained directly by the function cpm
. Pseudo-counts (\(\log_2\) transformed counts) are computed and stored in a data frame for later use.
pseudo_TC <- log2(cpm(dge2) + 1)
df_TC <- melt(pseudo_TC, id = rownames(raw_counts_wn))
names(df_TC)[1:2] <- c ("id", "sample")
df_TC$method <- rep("TC", nrow(df_TC))
RPKM needs information about gene lengths which have to be passed to the function rpkm
. Again, pseudo-counts are computed and stored in a data frame for further use.
gene_lengths_wn <- gene_lengths[rowSums(raw_counts) > 0]
pseudo_RPKM <- log2(rpkm(dge2, gene.length = gene_lengths_wn) + 1)
df_RPKM <- melt(pseudo_RPKM, id = rownames(raw_counts_wn))
names(df_RPKM)[1:2] <- c ("id", "sample")
df_RPKM$method <- rep("RPKM", nrow(df_RPKM))
Upper quartile normalization is obtained with the function calcNormFactors
. This function changes the variable norm.factors
in the DGEList object and thus also the way the functions cpm
and rpkm
are computing counts: now, a normalized library size, equal to the original library size multiplied by the normalization factor, is used to compute normalized counts: \(\tilde{K}_{gj} = \frac{K_{gj}}{\mbox{modified LS}}*10^6\). The normalization factors are computed using:
dge2 <- calcNormFactors(dge2, method = "upperquartile")
dge2$samples
## group lib.size norm.factors
## wt_1 1 5932414 1.0047308
## wt_2 1 7129204 0.9810476
## wt_3 1 6223759 1.0130558
## mut1_1 1 5577595 1.0068753
## mut1_2 1 6272344 0.9942229
## mut1_3 1 6634542 1.0593861
## mut2_1 1 5897356 0.9873347
## mut2_2 1 5785184 1.0124342
## mut2_3 1 5767730 1.0453656
## mut3_1 1 7106881 0.9792811
## mut3_2 1 5872052 0.9857220
## mut3_3 1 6035692 0.9361638
The previous formula is compared to the result of the function cpm
in the following command lines:
test_normcount <- sweep(dge2$counts, 2,
dge2$samples$lib.size*dge2$samples$norm.factors / 10^6,
"/")
range(as.vector(test_normcount - cpm(dge2)))
## [1] -1.818989e-12 0.000000e+00
which shows no difference. Finally, pseudo counts are obtained and stored in a data frame:
pseudo_UQ <- log2(cpm(dge2) + 1)
df_UQ <- melt(pseudo_UQ, id = rownames(raw_counts_wn))
names(df_UQ)[1:2] <- c ("id", "sample")
df_UQ$method <- rep("UQ", nrow(df_UQ))
TMM normalization works similarly as UQ and updates the value of the variable norm.factors
:
dge2 <- calcNormFactors(dge2, method = "TMM")
dge2$samples
## group lib.size norm.factors
## wt_1 1 5932414 1.0082044
## wt_2 1 7129204 0.9850930
## wt_3 1 6223759 1.0047944
## mut1_1 1 5577595 1.0064979
## mut1_2 1 6272344 1.0094894
## mut1_3 1 6634542 1.0593265
## mut2_1 1 5897356 0.9977013
## mut2_2 1 5785184 1.0096719
## mut2_3 1 5767730 1.0387295
## mut3_1 1 7106881 0.9791254
## mut3_2 1 5872052 0.9637238
## mut3_3 1 6035692 0.9429276
Normalized pseudo counts are obtained with the function cpm
and stored in a data frame:
pseudo_TMM <- log2(cpm(dge2) + 1)
df_TMM <- melt(pseudo_TMM, id = rownames(raw_counts_wn))
names(df_TMM)[1:2] <- c ("id", "sample")
df_TMM$method <- rep("TMM", nrow(df_TMM))
This last section shows the comparison between all normalization methods (and original raw data) for this data set. First, the distributions of all samples and all normalization methods are compared with boxplots and density plots:
df_allnorm <- rbind(df_raw, df_deseq, df_TC, df_RPKM, df_UQ, df_TMM)
df_allnorm$method <- factor(df_allnorm$method,
levels = c("Raw counts", "DESeq (RLE)", "TC",
"RPKM", "TMM", "UQ"))
p <- ggplot(data=df_allnorm, aes(x=sample, y=value, fill=method))
p <- p + geom_boxplot()
p <- p + theme_bw()
p <- p + ggtitle("Boxplots of normalized pseudo counts\n
for all samples by normalization methods")
p <- p + facet_grid(. ~ method)
p <- p + ylab(expression(log[2] ~ (normalized ~ count + 1))) + xlab("")
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(),
axis.ticks.x = element_blank())
print(p)
p <- ggplot(data=df_allnorm, aes(x=value, colour=sample))
p <- p + geom_density()
p <- p + theme_bw()
p <- p + ggtitle("Density of normalized pseudo counts\n
for all samples by normalization methods")
p <- p + facet_grid(. ~ method)
p <- p + ylab(expression(log[2] ~ (normalized ~ count + 1))) + xlab("")
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(),
axis.ticks.x = element_blank())
print(p)
Visually, DESeq, UQ and TMM seem to behave similarly and provide comparable distributions between samples. Finally, a PCA is performed on the pseudo counts obtained after TMM normalization:
resPCA <- pca(t(pseudo_TMM), ncomp = 12)
plot(resPCA)
The first component explains more variance than the other components which almost all equally reproduce variance. The projection of the samples on the first two PCs is obtained with:
plotIndiv(resPCA, group = design$group, col.per.group = brewer.pal(4, "Dark2"))
This figure shows that a better separation between wild type samples and mutant 2 samples. On the contrary mutants 2 and 1 are less well separated than in raw count data. This is a biologically plausible result since the organization of the different mutants are concordant with their phenotypes. The normalization has thus improved the data quality.
This section will compare the results of different types of approach to obtain genes which are differentially expressed between the wild type plants and each of the mutants:
a standard NB exact test between two conditions;
a GLM with only the plant type effect;
a GLM with the plant and genotype effects;
LM after a transformation of the data.
All analyses are performed with the R package edgeR, except for the last one which is made with limma. A last section compares the results obtained using the different methods.
The following command lines loop over the different mutant types (stored in the vector all_conditions
) and perform the following operations:
first, a DGEList object is created with the wild type samples and the current mutant type samples. The “wt” condition is chosen as the reference level. Normalization is performed using TMM;
then, the functions estimateCommonDisp
and estimateTagwiseDisp
respectively estimate a common dispersion parameter and uses it as a prior to estimate gene specific dispersions \(\phi_g\) for all \(g\). The “Biological Coefficient of Variation” plot is also displayed which displays \(\phi_g\) versus the average (normalized) count for all genes;
finally, an exact test is performed with the function exactTest
to find genes which are differentially expressed between the two conditions (wild type and mutant). The output of the function is printed and p-values and adjusted p-values (Benjamini and Hochberg approach) are stored in pvals_pairwiseExact
. The function topTags
is used to print the genes with the smallest p-values and the function decideTestsDGE
is used to extract genes whose adjusted p-value is smaller than 5%. The names of those genes as well as the fact that they are up/down regulated in the mutant type are saved in DGE_pairwiseExact
.
all_conditions <- setdiff(unique(design$group), "wt")
DEG_pairwiseExact <- NULL
pvals_pairwiseExact <- NULL
par(mfrow = c(1,3))
for (ind in seq_along(all_conditions)) {
# create dataset with 'wt' and current mutant and normalize (TMM)
sel_cols <- union(grep("wt", colnames(raw_counts_wn)),
grep(all_conditions[ind], colnames(raw_counts_wn)))
cur_counts <- raw_counts_wn[ ,sel_cols]
group <- relevel(as.factor(design$group[sel_cols]), ref = "wt")
cur_dge <- DGEList(cur_counts, group = group)
cur_dge <- calcNormFactors(cur_dge, method = "TMM")
# estimate dispersions
cur_dge <- estimateCommonDisp(cur_dge)
cur_dge <- estimateTagwiseDisp(cur_dge)
plotBCV(cur_dge,
main = paste0("BCV plot 'wt' vs '", all_conditions[ind], "'"))
# exact test
res_et <- exactTest(cur_dge)
cat("Exact test results:\n")
print(res_et)
pvals_pairwiseExact <- c(pvals_pairwiseExact, res_et$table$PValue,
p.adjust(res_et$table$PValue, method = "BH"))
cat("Top 10 DEG for 'wt' vs '", all_conditions[ind], "':\n")
print(topTags(res_et))
cur_res <- decideTestsDGE(res_et, adjust.method = "BH", p.value = 0.05)
print(cur_res)
sel_deg <- which(cur_res[ ,1] != 0)
cur_res <- cbind(rownames(cur_counts)[sel_deg], cur_res[sel_deg,1],
rep(all_conditions[ind], length(sel_deg)))
DEG_pairwiseExact <- rbind(DEG_pairwiseExact, cur_res)
}
## Exact test results:
## An object of class "DGEExact"
## $table
## logFC logCPM PValue
## Medtr0001s0010.1 2.6776698 -1.431356 0.5100990
## Medtr0001s0200.1 -1.8534460 -1.539448 1.0000000
## Medtr0001s0260.1 -0.2643299 3.819200 0.2980937
## Medtr0001s0360.1 -1.8661639 -1.538425 1.0000000
## Medtr0001s0490.1 -3.5116569 -1.241010 0.1584303
## 27911 more rows ...
##
## $comparison
## [1] "wt" "mut1"
##
## $genes
## NULL
##
## Top 10 DEG for 'wt' vs ' mut1 ':
## Comparison of groups: mut1-wt
## logFC logCPM PValue FDR
## Medtr7g079200.1 10.428936 5.646475 1.084483e-128 3.027442e-124
## Medtr6g034805.1 -5.621053 3.157062 3.813887e-36 5.323424e-32
## Medtr2g028530.1 -9.381843 2.800091 2.136579e-31 1.988158e-27
## Medtr3g092475.1 -2.761047 4.927626 1.219502e-29 8.510903e-26
## Medtr7g116150.1 6.314021 2.897514 3.541276e-29 1.977165e-25
## Medtr6g051975.1 8.734298 2.175163 6.808323e-27 3.167686e-23
## Medtr6g445300.1 -8.201467 1.703607 4.842629e-20 1.931241e-16
## Medtr2g054640.1 1.999710 6.178289 2.147915e-19 7.495149e-16
## Medtr0024s0070.1 6.840129 2.166796 4.779187e-19 1.482397e-15
## Medtr0002s0040.1 -4.131323 3.991809 7.549003e-18 2.107380e-14
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
## Exact test results:
## An object of class "DGEExact"
## $table
## logFC logCPM PValue
## Medtr0001s0010.1 0.00000000 -1.613063 1.0000000
## Medtr0001s0200.1 -1.82136561 -1.498958 1.0000000
## Medtr0001s0260.1 -0.04218856 3.920703 0.8831503
## Medtr0001s0360.1 0.88126012 -1.290109 1.0000000
## Medtr0001s0490.1 -0.75807651 -1.031106 0.7324608
## 27911 more rows ...
##
## $comparison
## [1] "wt" "mut2"
##
## $genes
## NULL
##
## Top 10 DEG for 'wt' vs ' mut2 ':
## Comparison of groups: mut2-wt
## logFC logCPM PValue FDR
## Medtr7g079200.1 6.264926 1.624597 7.995954e-09 0.0001026475
## Medtr2g054640.1 1.394714 5.716404 8.559340e-09 0.0001026475
## Medtr4g035895.1 -1.788930 6.346641 1.103104e-08 0.0001026475
## Medtr3g073360.1 -6.724211 2.152298 4.357221e-08 0.0002551703
## Medtr2g006070.1 2.088873 4.624899 4.570323e-08 0.0002551703
## Medtr8g467910.1 3.591974 2.640817 1.190725e-07 0.0005540047
## Medtr3g036280.1 -2.653538 2.594173 1.455923e-07 0.0005556901
## Medtr8g073335.1 -4.249142 4.712342 1.592464e-07 0.0005556901
## Medtr4g117170.1 -2.288529 2.964487 1.974584e-07 0.0006124721
## Medtr2g025455.1 -2.499058 3.965837 4.045263e-07 0.0011292757
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
## Exact test results:
## An object of class "DGEExact"
## $table
## logFC logCPM PValue
## Medtr0001s0010.1 1.91369814 -1.558248 1.0000000
## Medtr0001s0200.1 -1.83589426 -1.560024 1.0000000
## Medtr0001s0260.1 0.12727186 3.961898 0.5998775
## Medtr0001s0360.1 0.04230309 -1.452970 1.0000000
## Medtr0001s0490.1 -0.80141626 -1.091878 0.7178534
## 27911 more rows ...
##
## $comparison
## [1] "wt" "mut3"
##
## $genes
## NULL
##
## Top 10 DEG for 'wt' vs ' mut3 ':
## Comparison of groups: mut3-wt
## logFC logCPM PValue FDR
## Medtr7g079200.1 10.398918 5.576979 1.152665e-137 3.217779e-133
## Medtr3g092475.1 -8.644769 4.695309 4.918009e-92 6.864556e-88
## Medtr6g034805.1 -5.376733 3.125204 2.710513e-33 2.522223e-29
## Medtr2g028530.1 -9.357128 2.765291 4.307709e-31 3.006350e-27
## Medtr5g024095.1 -5.182918 3.295389 1.039588e-28 5.804225e-25
## Medtr0002s0040.1 -4.712504 3.927106 4.775120e-28 2.221704e-24
## Medtr6g445300.1 -8.178158 1.674507 1.476583e-19 5.888613e-16
## Medtr6g051975.1 8.104036 1.561939 5.101524e-19 1.780177e-15
## Medtr0044s0130.1 -4.041116 3.016525 2.071577e-18 6.425572e-15
## Medtr7g066070.1 -4.195235 2.974816 8.616258e-17 2.405315e-13
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
Finally, pvals_pairwiseExact
and DEG_pairwiseExact
are transformed into data frames for further use:
DEG_pairwiseExact <- as.data.frame(DEG_pairwiseExact)
names(DEG_pairwiseExact) <- c("name", "UD", "condition")
listDEG_pairwiseExact <- unique(DEG_pairwiseExact$name)
pvals_pairwiseExact <- data.frame("pvalue" = pvals_pairwiseExact,
"type" = rep(rep(c("raw", "adjusted"),
each = nrow(raw_counts_wn)),
length(all_conditions)),
"condition" = rep(all_conditions,
each=nrow(raw_counts_wn)*2))
In particular, the number of DEGs corresponding to each mutant type, the distribution of up/down regulated DEGs versus the mutant type and the total number of (unique) DEGs are obtained with:
table(DEG_pairwiseExact$condition)
##
## mut1 mut2 mut3
## 362 51 127
table(DEG_pairwiseExact$condition, DEG_pairwiseExact$UD) # 1 means 'up-regulated'
##
## 1 -1
## mut1 283 79
## mut2 13 38
## mut3 53 74
length(listDEG_pairwiseExact)
## [1] 480
Finally, the distributions of p-values and adjusted p-values for all three tests are displayed by:
p <- ggplot(data = pvals_pairwiseExact, aes(x = pvalue, fill = type))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle("Histograms of raw/adjusted p-values for exact test")
p <- p + facet_grid(type ~ condition)
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(),
axis.ticks.x = element_blank())
print(p)
In this section, we fit a generalized linear model to account for the group effect. The model writes \(K_{gj} \sim \mbox{NB}(\mu_{gj}, \phi_g)\) with \(\mathbb{E}(\log K_{gj}) = \log(s_j) + \log(\lambda_{gj})\). Here, \(j\) denotes one of the four genotypes (wild type or mutant 1, 2 and 3) and \(\lambda_{gj}\) is explained by: \(\log(\lambda_{g,\textrm{mutant}}) = \lambda_{g0} + \beta_{g1} \mathbb{I}_{\textrm{mutant}_1} + \beta_{g2} \mathbb{I}_{\textrm{mutant}_2} + \beta_{g3} \mathbb{I}_{\textrm{mutant}_3}\). In this model, the wild type plant is the reference genotype and thus, testing if the second (resp., the third, the fourth) coefficient in the model is equal to zero is equivalent to find DEGs between the wild type and mutant 1 (resp. 2, 3).
First, a DGEList object is created with all samples and is normalized by TMM:
# create dataset with 'wt' and current mutant and normalize (TMM)
cur_dge <- DGEList(raw_counts_wn, group = design$group)
cur_dge <- calcNormFactors(cur_dge, method = "TMM")
Then, a design matrix is created: the group condition is stored in a variable group
in which the reference level is set to “wt”. Then, this variable is used in the function model.matrix
to obtain the design matrix:
group <- relevel(as.factor(design$group), ref = "wt")
design_matrix <- model.matrix(~ group)
design_matrix
## (Intercept) groupmut1 groupmut2 groupmut3
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 1 0 0
## 5 1 1 0 0
## 6 1 1 0 0
## 7 1 0 1 0
## 8 1 0 1 0
## 9 1 0 1 0
## 10 1 0 0 1
## 11 1 0 0 1
## 12 1 0 0 1
## attr(,"assign")
## [1] 0 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
Then, the dispersion is estimated in three steps: a common dispersion is first estimated, then a trended dispersion (which relates the dispersion to overall level of expression of the gene) and finally a gene specific dispersion:
cur_dge <- estimateGLMCommonDisp(cur_dge, design_matrix)
cur_dge <- estimateGLMTrendedDisp(cur_dge, design_matrix)
cur_dge <- estimateGLMTagwiseDisp(cur_dge, design_matrix)
plotBCV(cur_dge, main = paste0("BCV plot"))
Finally, the model is fitted with the function glmFit
. Then, coefficients are tested using the function glmLRT
in which the argument coef
is used to select a given mutant type. topTags
is used to print the DEGs with the smallest p-values for each mutant type. Also, decideTestsDGE
allows to extract DEGs and the results (p-values, DEGs and information about up and down regulation) are all saved in pvals_GLM1
and listDEG_GLM1
similarly as what was done for the results of the Exact Test.
# GLM fit
fit <- glmFit(cur_dge, design_matrix)
DEG_GLM1 <- NULL
pvals_GLM1 <- NULL
for (ind in 1:3) {
res_GLM1 <- glmLRT(fit, coef = ind+1)
pvals_GLM1 <- c(pvals_GLM1, res_GLM1$table$PValue,
p.adjust(res_GLM1$table$PValue, method = "BH"))
cat("Top 10 DEG for 'wt' vs '", all_conditions[ind], "':\n")
print(topTags(res_GLM1))
cur_res <- decideTestsDGE(res_GLM1, adjust.method = "BH", p.value = 0.05)
print(cur_res)
sel_deg <- which(cur_res[ ,1] != 0)
cur_res <- cbind(rownames(raw_counts_wn)[sel_deg], cur_res[sel_deg,1],
rep(all_conditions[ind], length(sel_deg)))
DEG_GLM1 <- rbind(DEG_GLM1, cur_res)
}
## Top 10 DEG for 'wt' vs ' mut1 ':
## Coefficient: groupmut1
## logFC logCPM LR PValue FDR
## Medtr7g079200.1 10.406886 5.655526 237.35327 1.485330e-53 4.146448e-49
## Medtr6g034805.1 -5.629671 3.149726 177.65306 1.577141e-40 2.201374e-36
## Medtr2g028530.1 -9.344921 2.667981 149.38073 2.367662e-34 2.203189e-30
## Medtr6g051975.1 8.687375 1.917974 128.19714 1.016301e-29 7.092767e-26
## Medtr6g445300.1 -8.164661 1.570747 89.69065 2.784673e-21 1.554739e-17
## Medtr2g436330.1 -2.315931 3.962352 84.01861 4.901388e-20 2.280452e-16
## Medtr2g054640.1 1.988198 6.048253 74.99260 4.724826e-18 1.884261e-14
## Medtr0002s0040.1 -4.149904 4.018977 63.85203 1.341238e-15 4.680250e-12
## Medtr3g043760.1 1.793755 4.098330 58.45934 2.075362e-14 6.437312e-11
## Medtr0196s0020.1 2.158905 5.643368 52.81686 3.661446e-13 1.022129e-09
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
## Top 10 DEG for 'wt' vs ' mut2 ':
## Coefficient: groupmut2
## logFC logCPM LR PValue FDR
## Medtr7g079200.1 6.268296 5.655526 69.02311 9.731628e-17 2.716681e-12
## Medtr2g006070.1 2.090780 5.149846 39.47340 3.325582e-10 4.641847e-06
## Medtr2g054640.1 1.394952 6.048253 37.52270 9.035520e-10 8.407852e-06
## Medtr4g035895.1 -1.788964 5.999316 34.82308 3.610668e-09 2.519885e-05
## Medtr4g117170.1 -2.288755 2.788306 29.74795 4.920270e-08 2.747085e-04
## Medtr1g018840.1 -2.335843 4.731286 28.06178 1.175040e-07 5.467067e-04
## Medtr7g046260.1 -2.263240 2.882804 25.90610 3.584354e-07 1.429441e-03
## Medtr7g116150.1 4.615098 2.544361 25.58139 4.241107e-07 1.479934e-03
## Medtr5g006710.1 -1.853209 5.174918 24.18845 8.735477e-07 2.375306e-03
## Medtr8g035810.1 -1.771165 4.080911 24.15524 8.887385e-07 2.375306e-03
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
## Top 10 DEG for 'wt' vs ' mut3 ':
## Coefficient: groupmut3
## logFC logCPM LR PValue FDR
## Medtr7g079200.1 10.387927 5.655526 236.46749 2.317253e-53 6.468845e-49
## Medtr6g034805.1 -5.380946 3.149726 170.06094 7.175589e-39 1.001569e-34
## Medtr2g028530.1 -9.344921 2.667981 146.78328 8.752026e-34 8.144052e-30
## Medtr0044s0130.1 -4.049583 3.911734 98.46413 3.309650e-23 2.309805e-19
## Medtr6g051975.1 8.083961 1.917974 96.47444 9.040336e-23 5.047401e-19
## Medtr6g445300.1 -8.164661 1.570747 87.87981 6.955508e-21 3.236166e-17
## Medtr7g066070.1 -4.202221 3.470112 85.52089 2.292665e-20 9.143149e-17
## Medtr5g024095.1 -5.180186 3.732410 75.05921 4.568074e-18 1.544177e-14
## Medtr0002s0040.1 -4.722476 4.018977 74.88940 4.978360e-18 1.544177e-14
## Medtr3g055120.1 -3.806717 2.945337 59.59270 1.166674e-14 3.256886e-11
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
DEG_GLM1 <- as.data.frame(DEG_GLM1)
names(DEG_GLM1) <- c("name", "UD", "condition")
listDEG_GLM1 <- unique(DEG_GLM1$name)
pvals_GLM1 <- data.frame("pvalue" = pvals_GLM1,
"type" = rep(rep(c("raw", "adjusted"),
each = nrow(raw_counts_wn)),
length(all_conditions)),
"condition" = rep(all_conditions,
each=nrow(raw_counts_wn)*2))
The distributions of p-values and adjusted p-values for all three mutant types are obtained with:
p <- ggplot(data = pvals_GLM1, aes(x = pvalue, fill = type))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle("Histogram of raw/adjusted p-values for exact test")
p <- p + facet_grid(type ~ condition)
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(),
axis.ticks.x = element_blank())
print(p)
The total number of DEGs for each mutant type, the distribution of up/down regulated DEGs for each mutant type and the number of (unique) DEGs among the three mutant types are:
table(DEG_GLM1$condition)
##
## mut1 mut2 mut3
## 329 75 165
table(DEG_GLM1$condition, DEG_GLM1$UD) # 1 means 'up-regulated'
##
## 1 -1
## mut1 243 86
## mut2 21 54
## mut3 54 111
length(listDEG_GLM1)
## [1] 501
Compared to what was found with the Exact test, the number of obtained DEGs is of the same order (only a bit larger). A more precise comparison is made in the last part of this section.
The previous model does not take into account all information about the experimental design. Only the group effect has been tested whereas an additional effect might influence the expression level: the replicat effect. In this section we fit a model with an additive contribution of both effects. This problem is equivalent to introduce a (fixed) individual effect in the model. In this situation, the average level of \(\log\) normalized counts is expressed as: \(\log(\lambda_{g,\textrm{rep},\textrm{mutant}}) = \lambda_{g0} + \beta_{g1} \mathbb{I}_{\textrm{rep}_1} + \beta_{g2} \mathbb{I}_{\textrm{rep}_2} + \beta_{g3} \mathbb{I}_{\textrm{mutant}_1} + \beta_{g4} \mathbb{I}_{\textrm{mutant}_2} + \beta_{g5} \mathbb{I}_{\textrm{mutant}_3}\) and DEGs for each of the three mutant types are obtained by testing coefficients \(\beta_{g3}\), \(\beta_{g4}\) and \(\beta_{g5}\).
The method is performed similarly as before: a DGEList is created and normalization factors are obtained with the TMM approach. Then a design matrix that incorporates the two factors in an additive model is created:
# create dataset with 'wt' and current mutant and normalize (TMM)
cur_dge <- DGEList(raw_counts_wn, group = design$group)
cur_dge <- calcNormFactors(cur_dge, method = "TMM")
# create design matrix
design_matrix <- model.matrix(~ design$replicat + group)
design_matrix
## (Intercept) design$replicatrepbio2 design$replicatrepbio3 groupmut1
## 1 1 0 0 0
## 2 1 1 0 0
## 3 1 0 1 0
## 4 1 0 0 1
## 5 1 1 0 1
## 6 1 0 1 1
## 7 1 0 0 0
## 8 1 1 0 0
## 9 1 0 1 0
## 10 1 0 0 0
## 11 1 1 0 0
## 12 1 0 1 0
## groupmut2 groupmut3
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 0 1
## 11 0 1
## 12 0 1
## attr(,"assign")
## [1] 0 1 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`design$replicat`
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
The dispersion is then estimated using the same approach as in the previous section:
cur_dge <- estimateGLMCommonDisp(cur_dge, design_matrix)
cur_dge <- estimateGLMTrendedDisp(cur_dge, design_matrix)
cur_dge <- estimateGLMTagwiseDisp(cur_dge, design_matrix)
plotBCV(cur_dge, main = paste0("BCV plot"))
And finally, the model is fit with glmFit
and tests are performed on the coefficients of interest with glmLRT
. Results extracted with topTags
and decideTestsDGE
are saved in pvals_GLM2
and listDEG_GLM2
:
# GLM fit
fit <- glmFit(cur_dge, design_matrix)
DEG_GLM2 <- NULL
pvals_GLM2 <- NULL
for (ind in 1:3) {
res_GLM2 <- glmLRT(fit, coef = ind+3)
pvals_GLM2 <- c(pvals_GLM2, res_GLM2$table$PValue,
p.adjust(res_GLM2$table$PValue, method = "BH"))
cat("Top 10 DEG for 'wt' vs '", all_conditions[ind], "':\n")
print(topTags(res_GLM2))
cur_res <- decideTestsDGE(res_GLM2, adjust.method = "BH", p.value = 0.05)
print(cur_res)
sel_deg <- which(cur_res[ ,1] != 0)
cur_res <- cbind(rownames(raw_counts_wn)[sel_deg], cur_res[sel_deg,1],
rep(all_conditions[ind], length(sel_deg)))
DEG_GLM2 <- rbind(DEG_GLM2, cur_res)
}
## Top 10 DEG for 'wt' vs ' mut1 ':
## Coefficient: groupmut1
## logFC logCPM LR PValue FDR
## Medtr7g079200.1 10.474995 5.655524 258.39513 3.840109e-58 1.072005e-53
## Medtr6g034805.1 -5.629742 3.149727 153.69225 2.703860e-35 3.774048e-31
## Medtr2g028530.1 -9.353618 2.667978 148.21874 4.249283e-34 3.954099e-30
## Medtr6g051975.1 8.687313 1.917958 116.21369 4.267458e-27 2.978259e-23
## Medtr6g445300.1 -8.180778 1.570739 85.50786 2.307823e-20 1.288504e-16
## Medtr2g436330.1 -2.323846 3.962353 83.96557 5.034671e-20 2.342465e-16
## Medtr0002s0040.1 -4.276996 4.018988 75.21492 4.221655e-18 1.683596e-14
## Medtr2g054640.1 1.997224 6.048253 70.74996 4.054917e-17 1.414963e-13
## Medtr0024s0070.1 7.031074 1.392281 69.41668 7.971152e-17 2.472474e-13
## Medtr0196s0020.1 2.164260 5.643357 54.42308 1.616534e-13 4.512717e-10
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
## Top 10 DEG for 'wt' vs ' mut2 ':
## Coefficient: groupmut2
## logFC logCPM LR PValue FDR
## Medtr7g079200.1 6.165822 5.6555242 67.77259 1.834830e-16 5.122111e-12
## Medtr2g006070.1 2.172786 5.1498495 38.93033 4.392029e-10 6.130394e-06
## Medtr5g069205.1 -7.592679 0.4681538 35.35716 2.744580e-09 2.232755e-05
## Medtr2g054640.1 1.397947 6.0482526 35.05864 3.199248e-09 2.232755e-05
## Medtr7g006245.1 -6.932746 1.5022720 33.74887 6.270537e-09 3.500966e-05
## Medtr4g082315.1 -6.207794 -0.5376342 31.93052 1.597868e-08 7.434345e-05
## Medtr4g035895.1 -1.742579 5.9993076 30.56317 3.231701e-08 1.288802e-04
## Medtr4g117170.1 -2.253610 2.7882773 25.98388 3.442802e-07 1.084179e-03
## Medtr1g019080.1 -3.558257 2.4657083 25.95464 3.495348e-07 1.084179e-03
## Medtr0508s0010.1 1.517946 3.1080756 25.06802 5.534317e-07 1.432564e-03
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
## Top 10 DEG for 'wt' vs ' mut3 ':
## Coefficient: groupmut3
## logFC logCPM LR PValue FDR
## Medtr7g079200.1 10.480542 5.655524 258.40033 3.830090e-58 1.069208e-53
## Medtr6g034805.1 -5.409679 3.149727 147.54793 5.955959e-34 8.313327e-30
## Medtr2g028530.1 -9.324268 2.667978 145.11142 2.030493e-33 1.889441e-29
## Medtr0044s0130.1 -4.008075 3.911732 105.42133 9.874120e-25 6.891149e-21
## Medtr5g024095.1 -5.143842 3.732409 90.00130 2.380041e-21 1.328824e-17
## Medtr6g051975.1 8.083946 1.917958 88.99814 3.951833e-21 1.683232e-17
## Medtr7g066070.1 -4.144825 3.470096 88.86791 4.220741e-21 1.683232e-17
## Medtr6g445300.1 -8.156283 1.570739 83.45149 6.529892e-20 2.278606e-16
## Medtr0002s0040.1 -4.643405 4.018988 81.37178 1.870091e-19 5.800606e-16
## Medtr3g055120.1 -3.742504 2.945316 62.19284 3.114179e-15 8.693541e-12
## TestResults matrix
## [1] 0 0 0 0 0
## 27911 more rows ...
DEG_GLM2 <- as.data.frame(DEG_GLM2)
names(DEG_GLM2) <- c("name", "UD", "condition")
listDEG_GLM2 <- unique(DEG_GLM2$name)
pvals_GLM2 <- data.frame("pvalue" = pvals_GLM2,
"type" = rep(rep(c("raw", "adjusted"),
each = nrow(raw_counts_wn)),
length(all_conditions)),
"condition" = rep(all_conditions,
each=nrow(raw_counts_wn)*2))
The distribution of p-values is given by:
p <- ggplot(data = pvals_GLM2, aes(x = pvalue, fill = type))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle("Histogram of raw/adjusted p-values for exact test")
p <- p + facet_grid(type ~ condition)
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(),
axis.ticks.x = element_blank())
print(p)
Finally, the total number of DEGs for each mutant type, the distribution of up/down regulated DEGs for each mutant type and the number of (unique) DEGs among the three mutant types are:
table(DEG_GLM2$condition)
##
## mut1 mut2 mut3
## 327 68 190
table(DEG_GLM2$condition, DEG_GLM2$UD) # 1 means 'up-regulated'
##
## 1 -1
## mut1 247 80
## mut2 21 47
## mut3 74 116
length(listDEG_GLM2)
## [1] 504
Again, the number of obtained DEGs is of the same order (only very slightly larger) than what was found with the other two methods.
In this last approach, the limma package is used to fit a linear model on transformed data. Similarly as in the previous section, the linear model is designed to have two additive effects, a group effect and a replicat effect.
This approach starts with the definition of a DGEList object as previsously and normalization factors are computed with the TMM approach. The function voom
is then used to transform the data with the same design matrix as in the previous model:
cur_dge <- DGEList(raw_counts_wn)
cur_dge <- calcNormFactors(cur_dge, method = "TMM")
vdge <- voom(cur_dge, design_matrix, plot = TRUE)
Then, the function lmFit
is invoked to fit the model with the same design matrix and finally, the coefficients are shrinked with a Bayesian approach implemented in eBayes
. The function decideTests
allows to extract the DEGs after multiple testing correction (DEGs are indicated in the column corresponding to the tested condition with a 1, for up regulated genes, or a -1, for down regulated genes):
# LM fit
fit <- lmFit(vdge, design_matrix)
fit <- eBayes(fit)
cur_gres <- decideTests(fit, adjust.method = "BH", p.value = 0.05)
head(cur_gres)
## (Intercept) design$replicatrepbio2 design$replicatrepbio3
## Medtr0001s0010.1 -1 0 0
## Medtr0001s0200.1 -1 0 0
## Medtr0001s0260.1 1 0 0
## Medtr0001s0360.1 -1 0 0
## Medtr0001s0490.1 -1 0 0
## Medtr0002s0040.1 1 0 0
## groupmut1 groupmut2 groupmut3
## Medtr0001s0010.1 0 0 0
## Medtr0001s0200.1 0 0 0
## Medtr0001s0260.1 0 0 0
## Medtr0001s0360.1 0 0 0
## Medtr0001s0490.1 0 0 0
## Medtr0002s0040.1 0 0 0
DEGs corresponding to each of the three mutant types and corresponding status (up/down regulated) as well as the p-values are saved in DEG_voom
and pvals_voom
:
DEG_voom <- NULL
pvals_voom <- NULL
for (ind in 1:3) {
res_voom <- fit$p.value
pvals_voom <- c(pvals_voom, res_voom[ ,ind+3],
p.adjust(res_voom[ ,ind+3], method = "BH"))
cat("Top 10 DEG for 'wt' vs '", all_conditions[ind], "':\n")
print(topTable(fit, coef = ind+3))
sel_deg <- which(cur_gres[ ,ind+3] != 0)
cur_res <- cbind(rownames(raw_counts_wn)[sel_deg],
cur_gres[sel_deg,ind+3],
rep(all_conditions[ind], length(sel_deg)))
DEG_voom <- rbind(DEG_voom, cur_res)
}
## Top 10 DEG for 'wt' vs ' mut1 ':
## logFC AveExpr t P.Value
## Medtr6g051975.1 6.739374 -0.43009811 20.245425 4.867541e-10
## Medtr2g028530.1 -7.342487 -0.03160517 -19.031312 9.429348e-10
## Medtr2g436330.1 -2.307696 3.63095676 -11.827159 1.383082e-07
## Medtr6g445300.1 -6.193881 -0.61791117 -16.455440 4.423730e-09
## Medtr7g079200.1 9.762553 3.00431424 13.080276 4.901524e-08
## Medtr1g026140.1 5.141030 -1.94784797 13.347305 3.974531e-08
## Medtr2g054640.1 1.998888 5.83811133 9.064372 1.990549e-06
## Medtr1g094840.1 1.543271 3.28196917 9.007936 2.115814e-06
## Medtr0024s0070.1 6.024795 -1.25354315 10.237915 5.964608e-07
## Medtr6g034805.1 -5.447605 1.26135305 -9.904975 8.296906e-07
## adj.P.Val B
## Medtr6g051975.1 1.316148e-05 8.472273
## Medtr2g028530.1 1.316148e-05 8.388615
## Medtr2g436330.1 6.435019e-04 7.741279
## Medtr6g445300.1 4.116429e-05 7.578494
## Medtr7g079200.1 2.736619e-04 6.817848
## Medtr1g026140.1 2.736619e-04 6.290930
## Medtr2g054640.1 4.543466e-03 5.443729
## Medtr1g094840.1 4.543466e-03 5.391351
## Medtr0024s0070.1 2.081350e-03 5.153579
## Medtr6g034805.1 2.316164e-03 5.012548
## Top 10 DEG for 'wt' vs ' mut2 ':
## logFC AveExpr t P.Value adj.P.Val
## Medtr2g054640.1 1.4013639 5.838111 6.224875 6.555715e-05 0.4575234
## Medtr0508s0010.1 1.5052159 2.813566 5.898000 1.044513e-04 0.4859770
## Medtr2g006070.1 2.2098591 4.829084 5.644011 1.515410e-04 0.5860616
## Medtr5g063740.1 -0.9408826 4.385796 -5.419138 2.122790e-04 0.5860616
## Medtr4g104170.1 -0.8233931 5.508924 -5.390001 2.218708e-04 0.5860616
## Medtr6g060340.1 1.0231411 3.078198 5.546134 1.753314e-04 0.5860616
## Medtr8g063830.1 -1.1594136 5.241249 -5.306687 2.519250e-04 0.5860616
## Medtr4g104510.1 0.8412229 4.870297 5.082582 3.562885e-04 0.6024473
## Medtr4g035895.1 -1.7142203 5.760276 -5.044998 3.778768e-04 0.6024473
## Medtr8g064370.1 1.2056288 5.229920 4.944765 4.424966e-04 0.6024473
## B
## Medtr2g054640.1 2.0827830
## Medtr0508s0010.1 1.2864437
## Medtr2g006070.1 1.1988973
## Medtr5g063740.1 0.9635849
## Medtr4g104170.1 0.9451127
## Medtr6g060340.1 0.9207591
## Medtr8g063830.1 0.8258245
## Medtr4g104510.1 0.5015316
## Medtr4g035895.1 0.4345313
## Medtr8g064370.1 0.2972860
## Top 10 DEG for 'wt' vs ' mut3 ':
## logFC AveExpr t P.Value
## Medtr2g028530.1 -7.292767 -0.03160517 -18.949440 9.873397e-10
## Medtr6g051975.1 6.144723 -0.43009811 17.919284 1.791176e-09
## Medtr6g445300.1 -6.142565 -0.61791117 -16.340956 4.762712e-09
## Medtr7g079200.1 9.745243 3.00431424 13.057438 4.991081e-08
## Medtr0079s0070.1 -5.041555 0.06658128 -11.292945 2.216452e-07
## Medtr6g034805.1 -5.711564 1.26135305 -10.392510 5.132504e-07
## Medtr2g436330.1 -1.381085 3.63095676 -8.310958 4.611795e-06
## Medtr4g075345.1 -4.888321 -1.36136651 -10.018936 7.403141e-07
## Medtr4g046737.1 1.441677 4.13880651 7.752055 8.932441e-06
## Medtr0044s0130.1 -3.916287 3.12999397 -8.280226 4.778367e-06
## adj.P.Val B
## Medtr2g028530.1 2.500124e-05 7.417968
## Medtr6g051975.1 2.500124e-05 6.902496
## Medtr6g445300.1 4.431862e-05 6.667024
## Medtr7g079200.1 3.483275e-04 6.120476
## Medtr0079s0070.1 1.237490e-03 4.777822
## Medtr6g034805.1 2.387983e-03 4.777653
## Medtr2g436330.1 1.212663e-02 4.632578
## Medtr4g075345.1 2.952372e-03 4.181997
## Medtr4g046737.1 1.858069e-02 4.017856
## Medtr0044s0130.1 1.212663e-02 3.696410
DEG_voom <- as.data.frame(DEG_voom)
names(DEG_voom) <- c("name", "UD", "condition")
listDEG_voom <- unique(DEG_voom$name)
pvals_voom <- data.frame("pvalue" = pvals_voom,
"type" = rep(rep(c("raw", "adjusted"),
each = nrow(raw_counts_wn)),
length(all_conditions)),
"condition" = rep(all_conditions,
each=nrow(raw_counts_wn)*2))
The distribution of p-values is given by:
p <- ggplot(data = pvals_voom, aes(x = pvalue, fill = type))
p <- p + geom_histogram()
p <- p + theme_bw()
p <- p + ggtitle("Histogram of raw/adjusted p-values for exact test")
p <- p + facet_grid(type ~ condition)
p <- p + theme(title = element_text(size=10), axis.text.x = element_blank(),
axis.ticks.x = element_blank())
print(p)
The total number of DEGs for each mutant type, the distribution of up/down regulated DEGs for each mutant type and the number of (unique) DEGs among the three mutant types are:
table(DEG_voom$condition)
##
## mut1 mut3
## 30 18
table(DEG_voom$condition, DEG_voom$UD) # 1 means 'up-regulated'
##
## 1 -1
## mut1 19 11
## mut3 5 13
length(listDEG_voom)
## [1] 35
The number of DEGs found with this approach is much smaller than what was obtained with the Negative Binomial models. In particular, the linear models does not find any DEG for mutant 2.
In this last part, a Venn diagram is displayed for the results (unique genes for any type of mutant) of the four approaches:
vd <- venn.diagram(x=list("Exact test" = listDEG_pairwiseExact,
"GLM\n group" = listDEG_GLM1,
"GLM\n group + replicate" = listDEG_GLM2,
"voom" = listDEG_voom),
fill = brewer.pal(4, "Set3"),
cat.col = c("darkgreen", "black", "darkblue", "darkred"),
cat.cex = 1.5, fontface="bold", filename=NULL)
grid.draw(vd)
This chart yields to several conclusions:
first, the results of voom are very consistant with the results of the GLM approach using the same design matrix (with two additive factors), only they are much more stringent (the number of genes found by that method is more than 10 times smaller than what was found by the GLM approach);
second, a large number of genes (\(321 + 33 = 354\) out of approximately 500) are in common between all three negative binomial approaches. As expected, the two more consistent methods are the two GLM approaches (with an additional number of \(73+2=75\) DEGs in common and \(31+41+59+16 = 147\) DEGs that are specific of a given model). The Exact Test has 31 more genes in common with the GLM on group only and 16 more DEGs in common with the GLM on group and replicate;
the 59 DEGs that were found exclusively by GLM with two additive factors might be interesting because they are DEGs which can not be found if the genotype effect is not included in the model.
To obtain results similar to mines, make sure that your session has identical features than the one used to compile this document:
session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.3.1 (2016-06-21)
## system x86_64, linux-gnu
## ui X11
## language en_US:en
## collate en_US.UTF-8
## tz <NA>
## date 2016-07-03
## Packages ------------------------------------------------------------------
## package * version date source
## annotate 1.48.0 2016-07-03 Bioconductor
## AnnotationDbi 1.32.3 2016-07-03 Bioconductor
## assertthat 0.1 2013-12-06 CRAN (R 3.3.1)
## Biobase * 2.30.0 2016-07-03 Bioconductor
## BiocGenerics * 0.16.1 2016-07-03 Bioconductor
## colorspace 1.2-6 2015-03-11 CRAN (R 3.3.1)
## corpcor 1.6.8 2015-07-08 CRAN (R 3.3.0)
## DBI 0.4-1 2016-05-08 CRAN (R 3.3.1)
## DESeq * 1.22.1 2016-07-03 Bioconductor
## devtools * 1.12.0 2016-06-24 CRAN (R 3.3.1)
## digest 0.6.9 2016-01-08 CRAN (R 3.3.1)
## dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1)
## edgeR * 3.12.1 2016-07-03 Bioconductor
## ellipse 0.3-8 2013-04-13 CRAN (R 3.3.0)
## evaluate 0.9 2016-04-29 CRAN (R 3.3.0)
## formatR 1.4 2016-05-09 CRAN (R 3.3.0)
## futile.logger * 1.4.1 2015-04-20 CRAN (R 3.2.2)
## futile.options 1.0.0 2010-04-06 CRAN (R 3.2.2)
## genefilter 1.52.1 2016-07-03 Bioconductor
## geneplotter 1.48.0 2016-07-03 Bioconductor
## ggplot2 * 2.1.0 2016-03-01 CRAN (R 3.3.1)
## gridExtra * 2.2.1 2016-02-29 CRAN (R 3.3.0)
## gtable 0.2.0 2016-02-26 CRAN (R 3.3.1)
## htmltools 0.3.5 2016-03-21 CRAN (R 3.3.0)
## igraph 1.0.1 2015-06-26 CRAN (R 3.3.0)
## IRanges 2.4.8 2016-07-03 Bioconductor
## knitr 1.13 2016-05-09 CRAN (R 3.3.0)
## labeling 0.3 2014-08-23 CRAN (R 3.2.2)
## lambda.r 1.1.7 2015-03-20 CRAN (R 3.2.2)
## lattice * 0.20-33 2015-07-14 CRAN (R 3.2.1)
## limma * 3.26.9 2016-07-03 Bioconductor
## locfit * 1.5-9.1 2013-04-20 CRAN (R 3.2.2)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.2)
## MASS * 7.3-45 2016-04-21 CRAN (R 3.3.1)
## Matrix 1.2-6 2016-05-02 CRAN (R 3.3.1)
## memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
## mixOmics * 6.0.0 2016-06-14 CRAN (R 3.3.1)
## munsell 0.4.3 2016-02-13 CRAN (R 3.3.1)
## plyr 1.8.4 2016-06-08 CRAN (R 3.3.1)
## R6 2.1.2 2016-01-26 CRAN (R 3.3.0)
## RColorBrewer * 1.1-2 2014-12-07 CRAN (R 3.3.1)
## Rcpp 0.12.5 2016-05-14 CRAN (R 3.3.1)
## reshape2 * 1.4.1 2014-12-06 CRAN (R 3.2.2)
## rgl 0.94.1131 2014-09-08 CRAN (R 3.1.1)
## rmarkdown 0.9.6 2016-05-01 CRAN (R 3.3.0)
## RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.2)
## S4Vectors 0.8.11 2016-07-03 Bioconductor
## scales 0.4.0 2016-02-26 CRAN (R 3.3.1)
## stringi 1.1.1 2016-05-27 CRAN (R 3.3.1)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.2)
## survival 2.39-5 2016-06-26 CRAN (R 3.3.1)
## tibble 1.0 2016-03-23 CRAN (R 3.3.1)
## tidyr 0.5.1 2016-06-14 CRAN (R 3.3.1)
## VennDiagram * 1.6.17 2016-04-18 CRAN (R 3.3.0)
## withr 1.0.2 2016-06-20 CRAN (R 3.3.1)
## XML 3.98-1.4 2016-03-01 CRAN (R 3.3.1)
## xtable 1.8-2 2016-02-05 CRAN (R 3.3.1)
## yaml 2.1.13 2014-06-12 CRAN (R 3.3.0)