Code
library("reshape2")
library("ggplot2")
library("GENIE3")
library("igraph")
library("PRROC")
library("rfPermute")
This practical’s aim is to perform network inference using GENIE3
(Huynh-Thu et al. 2010). The dataset used to illustrate this practical is an extract of the expression data and reconstructed network described in (Nicolas et al. 2012), that is related to the bacteria Bacillus subtilis. Note that gene and sample names have been anonymized.
Be also sure to have the proper R packages installed (usage of renv
is strongly recommended):
library("reshape2")
library("ggplot2")
library("GENIE3")
library("igraph")
library("PRROC")
library("rfPermute")
Expression data is included in the file ../data/expr.csv
and are loaded with:
<- read.table("../data/expr.csv", sep = "\t", header = TRUE)
expr dim(expr)
[1] 457 270
The data are organized with genes in rows and samples in columns, the first column being the gene name:
head(expr[, 1:10])
A global visualization of the gene expression is available using a heatmap:
<- melt(expr, id.vars = "Name")
df_heatmap names(df_heatmap) <- c("gene", "sample", "expression")
<- ggplot(df_heatmap, aes(sample, gene)) +
p geom_tile(aes(fill = expression), color = "white") +
scale_fill_gradient(low = "yellow", high = "red") +
ylab("genes ") + xlab("samples") + theme(axis.text = element_blank()) +
labs(fill = "Expression level")
p
Network inference will be performed with GENIE3
:
?GENIE3
GENIE3
requires an expression matrix with genes in rows and gene names as row names:
<- as.matrix(expr[, 2:ncol(expr)])
expr_matrix rownames(expr_matrix) <- expr$Name
<- expr_matrix
expr_matrix set.seed(1055)
<- GENIE3(expr_matrix, nTrees = 50, verbose = TRUE) res_GENIE3
Note: This simulation is given as a mere illustration of the process. However, we can not expect much from the results: * 50 trees is probably not enough to obtain a good performance; * \(\sigma\) factors, which are major regulators in bacteria, are not specified in the original data (where they could have been).
The result is then compared with the ground true network, available in ../data/net.rds
as an igraph
object:
<- readRDS("../data/net.rds")
ref_net ref_net
IGRAPH afc5784 DN-- 457 710 --
+ attr: name (v/c)
+ edges from afc5784 (vertex names):
[1] W0285->W0285 W0285->F0297 W0285->C0769 W0285->O0665 W0285->X0991
[6] W0285->K0548 W0285->Y0468 W0285->Z0620 W0285->X0574 W0285->P0189
[11] W0285->V0342 W0285->J0957 W0285->F0308 W0285->X0773 W0285->R0994
[16] W0285->Q0692 W0285->Z0954 W0285->D0661 W0285->S0907 W0285->Y0771
[21] W0285->A0935 W0285->O0136 W0285->V0123 W0285->G0870 W0285->J0696
[26] W0285->S0104 W0285->U0802 W0285->I0168 W0285->F0181 W0285->D0014
[31] W0285->V0861 W0285->N0624 W0285->X0018 W0285->D0148 W0285->Z0131
[36] W0285->P0275 W0285->F0851 W0285->Y0754 W0285->O0313 W0285->Z0109
+ ... omitted several edges
edge_density(ref_net)
[1] 0.003407041
The true network can be displayed using:
par(mar = rep(0, 4))
set.seed(1121)
plot(ref_net, vertex.size = 3, vertex.color = "lightgreen",
vertex.frame.color = "lightgreen", edge.color = "grey",
vertex.label = rep(NA, vcount(ref_net)))
For the sake of simplicity, we will further use the undirected version of the network. We also extract the corresponding adjacency matrix:
<- as.undirected(ref_net, mode = "collapse")
undirected_ref <- as_adj(undirected_ref, sparse = FALSE)
ref_adj diag(ref_adj) <- 0
Weight distribution of the solution is explored to visually set a relevant threshold:
<- c(res_GENIE3[upper.tri(res_GENIE3)],
all_weights lower.tri(res_GENIE3)])
res_GENIE3[<- data.frame("weights" = all_weights)
df <- ggplot(df, aes(x = weights)) + geom_histogram() + theme_bw() +
p scale_x_log10() + geom_vline(xintercept = 1e-2, color = "darkred")
p
The corresponding (undirected) network is then deduced:
<- res_GENIE3
net1 < 1e-2] <- 0
net1[res_GENIE3 >= 1e-2] <- 1
net1[res_GENIE3 <- graph_from_adjacency_matrix(net1, mode = "max")
net1 net1
IGRAPH 7138c93 UN-- 457 12142 --
+ attr: name (v/c)
+ edges from 7138c93 (vertex names):
[1] W0285--C0769 W0285--P0255 W0285--S0077 W0285--K0840 W0285--Y0311
[6] W0285--N0741 W0285--N0010 W0285--F0096 W0285--N0057 W0285--D0254
[11] W0285--U0658 W0285--Z0569 W0285--J0880 W0285--D0054 W0285--L0362
[16] W0285--E0486 W0285--E0301 W0285--Z0025 W0285--V0132 W0285--Y0083
[21] W0285--R0994 W0285--V0864 W0285--B0497 W0285--K0663 W0285--S0155
[26] W0285--R0764 W0285--O0113 W0285--V0123 W0285--X0082 W0285--C0834
[31] W0285--U0180 W0285--V0928 W0285--G0023 W0285--E0461 W0285--W0606
[36] W0285--N0763 W0285--X0912 W0285--P0487 W0285--E0983 W0285--F0717
+ ... omitted several edges
It has a density equal to:
edge_density(net1)
[1] 0.1165304
which is larger than the density of the true network.
Visually, it also looks quite different:
par(mar = rep(0, 4))
set.seed(1112)
plot(net1, vertex.size = 3, vertex.color = "lightgreen",
vertex.frame.color = "lightgreen", edge.color = "grey",
vertex.label = rep(NA, vcount(net1)))
It is mostly explained by a very different degree distribution (not using prior information on \(\sigma\) factor is probably instrumental in this difference):
<- data.frame("true" = degree(ref_net), "GENIE3" = degree(net1))
df <- ggplot(df, aes(x = true, y = GENIE3)) + geom_point(alpha = 0.4) +
p theme_bw()
p
Another approach would find a threshold based on a realistic value of the network density (obtained from the true network, here):
<- edge_density(undirected_ref)
ref_density <- pmax(res_GENIE3[upper.tri(res_GENIE3)],
max_weight t(res_GENIE3)[upper.tri(res_GENIE3)])
<- quantile(max_weight, probs = 1 - ref_density)
thresh thresh
99.31955%
0.05164564
This threshold is used to define a second network:
<- res_GENIE3
net2 < thresh] <- 0
net2[res_GENIE3 >= thresh] <- 1
net2[res_GENIE3 <- graph_from_adjacency_matrix(net2, mode = "max")
net2 net2
IGRAPH ec7c922 UN-- 457 709 --
+ attr: name (v/c)
+ edges from ec7c922 (vertex names):
[1] W0285--C0769 W0285--Y0311 W0285--N0010 W0285--V0123 W0285--F0717
[6] W0285--T0776 F0297--R0154 C0769--X0082 C0769--L0809 C0769--G0023
[11] C0769--P0487 C0769--X0681 C0769--S0522 C0769--T0353 E0873--L0863
[16] P0255--I0595 P0255--B0357 P0255--V0583 P0255--C0521 O0665--N0927
[21] O0665--S0636 O0665--H0534 R0740--P0524 R0740--U0911 P0524--U0911
[26] J0702--Z0491 J0702--O0570 J0702--I0968 X0991--C0593 X0991--R0994
[31] X0991--R0578 X0991--K0643 D0934--Z0569 D0934--R0910 D0934--H0859
[36] T0196--L0743 T0196--V0933 Y0468--J0269 V0395--J0880 V0395--E0301
+ ... omitted several edges
This network has a density close to the true network, as expected:
edge_density(net2)
[1] 0.006804484
But it still does not match the true network visually:
par(mar = rep(0, 4))
set.seed(1112)
plot(net2, vertex.size = 3, vertex.color = "lightgreen",
vertex.frame.color = "lightgreen", edge.color = "grey",
vertex.label = rep(NA, vcount(net1)))
And, again, it has a very different degree distribution:
<- data.frame("true" = degree(ref_net), "GENIE3" = degree(net2))
df <- ggplot(df, aes(x = true, y = GENIE3)) + geom_point(alpha = 0.4) +
p theme_bw()
p
ROC and PR curves are finally obtained, using the set of weights (maximum between the two weights for each edge) for positive (true) and negative (wrong) edges:
<- res_GENIE3
true_edges == 0] <- 0
true_edges[ref_adj <- pmax(true_edges[upper.tri(true_edges)],
true_edges t(true_edges)[upper.tri(true_edges)])
<- true_edges[true_edges != 0]
true_edges
<- res_GENIE3
wrong_edges == 1] <- 0
wrong_edges[ref_adj <- pmax(wrong_edges[upper.tri(wrong_edges)],
wrong_edges t(wrong_edges)[upper.tri(wrong_edges)])
<- wrong_edges[wrong_edges != 0]
wrong_edges
<- roc.curve(scores.class0 = wrong_edges, scores.class1 = true_edges,
roc curve = TRUE)
roc
ROC curve
Area under curve:
0.501703
Curve for scores from 1.636326e-06 to 0.1349187
( can be plotted with plot(x) )
plot(roc)
<- pr.curve(scores.class0 = wrong_edges, scores.class1 = true_edges,
pr curve = TRUE)
pr
Precision-recall curve
Area under curve (Integral):
0.9929469
Area under curve (Davis & Goadrich):
0.9929468
Curve for scores from 1.636326e-06 to 0.1349187
( can be plotted with plot(x) )
plot(pr)
rfPermute
can be used to obtain a \(p\)-value describing the confidence of each edge. It requires to be run on each gene used as a target for the prediction, in turn, which would be too long for this practical. We just illustrate its use with one of the gene, with a very reduced (and not relevant) number of permutations:
<- data.frame(t(expr_matrix))
expr_df set.seed(1241)
<- rfPermute(W0285 ~ ., data = expr_df, num.rep = 10,
res_rfPermute p = round(sqrt(500)))
The results contain a \(p\)-value field that is summarized below:
res_rfPermute
An rfPermute model
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 152
No. of permutation replicates: 10
Start time: 2023-09-18 13:06:53
End time: 2023-09-18 13:09:12
Run time: 2.32 mins
summary(res_rfPermute$pval[, 2, "scaled"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.09091 1.00000 1.00000 0.89872 1.00000 1.00000
Using a threshold of 10% on the \(p\)-value, we obtain the predictors of “W0285” that we can compare to the true neighborhood:
<- neighborhood(undirected_ref, order = 1, nodes = "W0285")[[1]]$name
ref_nei cat("reference:", ref_nei, "\n")
reference: W0285 F0297 C0769 O0665 X0991 K0548 Y0468 Z0620 X0574 P0189 V0342 J0957 F0308 X0773 R0994 Q0692 Z0954 D0661 S0907 Y0771 A0935 O0136 V0123 G0870 J0696 S0104 U0802 I0168 F0181 D0014 V0861 N0624 X0018 D0148 Z0131 P0275 F0851 Y0754 O0313 Z0109 J0269 L0777 H0545 U0226 E0325 F0118 R0995 U0641 E0434 B0166 O0495 N0566 B0883 Z0078 W0606 J0233 N0278 Z0843 G0686 I0949 X0293 Y0904 N0763 N0896 S0139 N0094 Z0444 A0410 B0719 R0480 P0046 Q0690 R0906 C0200 X0912 W0547 C0837 D0649 M0953 U0978 F0161 M0393 H0637 K0305 O0093 O0789 B0300 U0423 A0241 A0437 P0206 S0759 P0407 S0176 R0578 T0352 Y0723 K0383 X0916 J0779 K0466 G0932 W0720 K0219 S0095 N0458 R0154 T0042 L0040 P0487 E0983 S0636 N0704 S0435 F0717 E0542 S0421 A0781 T0869 H0534 C0450 J0612 S0373 G0669 N0586 A0494 A0289 R0296
<- names(which(res_rfPermute$pval[, 2, "scaled"] < 0.1))
pred_nei cat("predicted:", pred_nei, "\n")
predicted: S0077 K0840 Y0311 N0741 N0010 P0156 U0658 E0301 C0799 M0635 Y0083 K0563 V0123 X0082 C0834 E0325 G0023 L0040 F0717 N0671 J0732 B0594 K0756 T0776 S0522 N0430 L0129
The intersection between the two neighborhood is finally obtained with:
intersect(ref_nei, pred_nei)
[1] "V0123" "E0325" "L0040" "F0717"
Note: Again, the previous analysis is not satisfactory. What would be sounder would be to use the results of GENIE3
as a prior list of interactions to test and to perform rfPermute
only to edges that have been selected during this prior step.
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rfPermute_2.5.2 PRROC_1.3.1 igraph_1.5.1 GENIE3_1.22.0
[5] ggplot2_3.4.3 reshape2_1.4.4
loaded via a namespace (and not attached):
[1] randomForest_4.7-1.1 gtable_0.3.4 jsonlite_1.8.7
[4] dplyr_1.1.3 compiler_4.3.1 tidyselect_1.2.0
[7] Rcpp_1.0.11 stringr_1.5.0 swfscMisc_1.6.5
[10] parallel_4.3.1 scales_1.2.1 yaml_2.3.7
[13] fastmap_1.1.1 R6_2.5.1 plyr_1.8.8
[16] labeling_0.4.3 generics_0.1.3 knitr_1.43
[19] tibble_3.2.1 munsell_0.5.0 pillar_1.9.0
[22] rlang_1.1.1 utf8_1.2.3 stringi_1.7.12
[25] xfun_0.40 cli_3.6.1 withr_2.5.0
[28] magrittr_2.0.3 digest_0.6.33 grid_4.3.1
[31] rstudioapi_0.15.0 lifecycle_1.0.3 vctrs_0.6.3
[34] evaluate_0.21 glue_1.6.2 farver_2.1.1
[37] codetools_0.2-19 fansi_1.0.4 colorspace_2.1-0
[40] rmarkdown_2.24 tools_4.3.1 pkgconfig_2.0.3
[43] htmltools_0.5.6