Nathalie Vialaneix and Sébastien Déjean
12 octobre 2018
library(tidyverse)
What is tidyverse?
A new operator
tidyverse
is a collection of packages aiming at manipulating data easier:
ggplot2
to make graphicsdplyr
to transform and summarize the content of datasetstidyr
to transform the structure of data tables (see also reshape2
)purrr
to allow better functional programming in Rtibble
to provide a better framework for data tablesreadr
to read datasets fasterx %>% f()
means f(x)
x %>% f(y)
means f(x, y)
x %>% f(y, .)
means f(y, x)
x <- 3
log(x)
[1] 1.098612
Exercise: use pipe to perform the same operation
x %>% f()
means f(x)
x %>% f(y)
means f(x, y)
x %>% f(y, .)
means f(y, x)
x <- 3
x %>% log()
[1] 1.098612
Exercise: use pipe to perform the same operation
dplyr, as ggplot2, is based on a grammar that aims at manipulating data. In this grammar, the main actions are:
select()
to select variables based on their namesarrange()
to modify the ordering of a datasetfilter()
to select observations based on their valuesmutate()
to add new variables to a dataset (coming from existing variables)summarise()
to summarize multiple values (usually combined with group_by()
)data(diamonds)
set.seed(25091309)
sample1000 <- sample(1:nrow(diamonds), 1000, replace = FALSE)
diamonds <- diamonds[sample1000, ]
small_diam <- diamonds %>% select(cut, color, price)
small_diam
# A tibble: 1,000 x 3
cut color price
<ord> <ord> <int>
1 Very Good D 1658
2 Premium G 10766
3 Premium I 6173
4 Ideal E 5962
5 Premium F 2839
6 Premium H 5266
7 Ideal E 723
8 Premium G 11032
9 Fair D 4441
10 Good E 1246
# ... with 990 more rows
ordered_diams <- diamonds %>% arrange(desc(color))
ordered_diams
# A tibble: 1,000 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.310 Ideal J VS2 62.1 53.6 400 4.36 4.39 2.72
2 1.23 Ideal J VS2 61.8 56.0 4986 6.87 6.81 4.23
3 1.55 Ideal J SI1 62.4 57.0 8301 7.50 7.45 4.67
4 0.400 Very Good J VS1 63.4 58.0 810 4.64 4.61 2.93
5 0.710 Premium J VS2 62.8 61.0 1917 5.71 5.63 3.56
6 1.00 Good J SI1 58.7 62.0 3614 6.47 6.51 3.81
7 0.550 Very Good J VS2 62.6 57.0 1062 5.17 5.21 3.25
8 1.24 Very Good J SI2 63.3 60.0 3908 6.83 6.75 4.29
9 1.56 Good J VS2 62.3 64.0 8107 7.41 7.36 4.60
10 1.75 Ideal J VS2 62.1 56.0 9890 7.74 7.69 4.79
# ... with 990 more rows
vg_diam <- diamonds %>% filter(cut == "Ideal")
summary(vg_diam)
carat cut color clarity depth
Min. :0.2300 Fair : 0 D:43 VS2 :84 Min. :58.50
1st Qu.:0.3400 Good : 0 E:65 SI1 :61 1st Qu.:61.30
Median :0.5300 Very Good: 0 F:68 VS1 :58 Median :61.80
Mean :0.6758 Premium : 0 G:75 SI2 :47 Mean :61.66
3rd Qu.:0.9100 Ideal :357 H:50 VVS1 :46 3rd Qu.:62.20
Max. :2.1600 I:37 VVS2 :41 Max. :63.70
J:19 (Other):20
table price x y
Min. :53.00 Min. : 360 Min. :3.970 Min. :3.990
1st Qu.:55.00 1st Qu.: 855 1st Qu.:4.490 1st Qu.:4.470
Median :56.00 Median : 1761 Median :5.220 Median :5.240
Mean :56.09 Mean : 3123 Mean :5.439 Mean :5.453
3rd Qu.:57.00 3rd Qu.: 4008 3rd Qu.:6.220 3rd Qu.:6.230
Max. :60.00 Max. :18682 Max. :8.400 Max. :8.340
z
Min. :2.430
1st Qu.:2.760
Median :3.210
Mean :3.358
3rd Qu.:3.860
Max. :5.100
vg_diam
# A tibble: 357 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.11 Ideal E SI2 60.6 56.0 5962 6.76 6.78 4.10
2 0.330 Ideal E VS2 61.5 57.0 723 4.41 4.47 2.73
3 0.310 Ideal D VS2 62.5 56.0 734 4.29 4.32 2.69
4 0.310 Ideal J VS2 62.1 53.6 400 4.36 4.39 2.72
5 0.530 Ideal D VS2 60.9 57.0 1783 5.17 5.24 3.17
6 0.330 Ideal F VVS1 61.9 56.0 955 4.42 4.47 2.75
7 0.380 Ideal D VS2 62.0 56.0 998 4.68 4.64 2.89
8 0.790 Ideal E SI1 62.0 57.0 3384 5.92 5.96 3.68
9 1.23 Ideal J VS2 61.8 56.0 4986 6.87 6.81 4.23
10 0.380 Ideal F VVS1 62.3 54.0 1096 4.64 4.70 2.91
# ... with 347 more rows
Exercice: Make a dataset with the top 10% most expensive diamonds.
vg_diam <- diamonds %>% filter(cut == "Ideal")
summary(vg_diam)
carat cut color clarity depth
Min. :0.2300 Fair : 0 D:43 VS2 :84 Min. :58.50
1st Qu.:0.3400 Good : 0 E:65 SI1 :61 1st Qu.:61.30
Median :0.5300 Very Good: 0 F:68 VS1 :58 Median :61.80
Mean :0.6758 Premium : 0 G:75 SI2 :47 Mean :61.66
3rd Qu.:0.9100 Ideal :357 H:50 VVS1 :46 3rd Qu.:62.20
Max. :2.1600 I:37 VVS2 :41 Max. :63.70
J:19 (Other):20
table price x y
Min. :53.00 Min. : 360 Min. :3.970 Min. :3.990
1st Qu.:55.00 1st Qu.: 855 1st Qu.:4.490 1st Qu.:4.470
Median :56.00 Median : 1761 Median :5.220 Median :5.240
Mean :56.09 Mean : 3123 Mean :5.439 Mean :5.453
3rd Qu.:57.00 3rd Qu.: 4008 3rd Qu.:6.220 3rd Qu.:6.230
Max. :60.00 Max. :18682 Max. :8.400 Max. :8.340
z
Min. :2.430
1st Qu.:2.760
Median :3.210
Mean :3.358
3rd Qu.:3.860
Max. :5.100
vg_diam
# A tibble: 357 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.11 Ideal E SI2 60.6 56.0 5962 6.76 6.78 4.10
2 0.330 Ideal E VS2 61.5 57.0 723 4.41 4.47 2.73
3 0.310 Ideal D VS2 62.5 56.0 734 4.29 4.32 2.69
4 0.310 Ideal J VS2 62.1 53.6 400 4.36 4.39 2.72
5 0.530 Ideal D VS2 60.9 57.0 1783 5.17 5.24 3.17
6 0.330 Ideal F VVS1 61.9 56.0 955 4.42 4.47 2.75
7 0.380 Ideal D VS2 62.0 56.0 998 4.68 4.64 2.89
8 0.790 Ideal E SI1 62.0 57.0 3384 5.92 5.96 3.68
9 1.23 Ideal J VS2 61.8 56.0 4986 6.87 6.81 4.23
10 0.380 Ideal F VVS1 62.3 54.0 1096 4.64 4.70 2.91
# ... with 347 more rows
Exercice: Make a dataset with the top 10% most expensive diamonds.
top_exp <- diamonds %>% filter(price >= quantile(price, probs = 0.9))
summary(top_exp$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9641 11168 13113 13399 15230 18788
large_diam <- diamonds %>% mutate(ratio = price / carat,
nothing = NA,
weird = paste(cut, color, sep = "-")) %>%
select(carat, price, color, cut, ratio, nothing, weird)
large_diam
# A tibble: 1,000 x 7
carat price color cut ratio nothing weird
<dbl> <int> <ord> <ord> <dbl> <lgl> <chr>
1 0.510 1658 D Very Good 3251. NA Very Good-D
2 1.56 10766 G Premium 6901. NA Premium-G
3 1.51 6173 I Premium 4088. NA Premium-I
4 1.11 5962 E Ideal 5371. NA Ideal-E
5 0.710 2839 F Premium 3999. NA Premium-F
6 1.12 5266 H Premium 4702. NA Premium-H
7 0.330 723 E Ideal 2191. NA Ideal-E
8 1.71 11032 G Premium 6451. NA Premium-G
9 1.03 4441 D Fair 4312. NA Fair-D
10 0.530 1246 E Good 2351. NA Good-E
# ... with 990 more rows
Exercice: Make a dataset with only the diamonds with color 'D' and two additional variables: the \( \log_{10} \) of their price and the combination of clarity and cut.
large_diam <- diamonds %>% mutate(ratio = price / carat,
nothing = NA,
weird = paste(cut, color, sep = "-")) %>%
select(carat, price, color, cut, ratio, nothing, weird)
large_diam
# A tibble: 1,000 x 7
carat price color cut ratio nothing weird
<dbl> <int> <ord> <ord> <dbl> <lgl> <chr>
1 0.510 1658 D Very Good 3251. NA Very Good-D
2 1.56 10766 G Premium 6901. NA Premium-G
3 1.51 6173 I Premium 4088. NA Premium-I
4 1.11 5962 E Ideal 5371. NA Ideal-E
5 0.710 2839 F Premium 3999. NA Premium-F
6 1.12 5266 H Premium 4702. NA Premium-H
7 0.330 723 E Ideal 2191. NA Ideal-E
8 1.71 11032 G Premium 6451. NA Premium-G
9 1.03 4441 D Fair 4312. NA Fair-D
10 0.530 1246 E Good 2351. NA Good-E
# ... with 990 more rows
Exercice: Make a dataset with only the diamonds with color 'D' and two additional variables: the \( \log_{10} \) of their price and the combination of clarity and cut.
ld <- diamonds %>% filter(color == "D") %>%
mutate(log10 = log10(price),
combo = paste(cut, clarity, sep = "-")) %>%
select(price, cut, clarity, color, log10, combo)
ld
# A tibble: 113 x 6
price cut clarity color log10 combo
<int> <ord> <ord> <ord> <dbl> <chr>
1 1658 Very Good VS2 D 3.22 Very Good-VS2
2 4441 Fair SI1 D 3.65 Fair-SI1
3 734 Ideal VS2 D 2.87 Ideal-VS2
4 717 Good SI1 D 2.86 Good-SI1
5 1752 Premium VS2 D 3.24 Premium-VS2
6 1783 Ideal VS2 D 3.25 Ideal-VS2
7 998 Ideal VS2 D 3.00 Ideal-VS2
8 2079 Ideal VS2 D 3.32 Ideal-VS2
9 1838 Good SI1 D 3.26 Good-SI1
10 1753 Ideal VS2 D 3.24 Ideal-VS2
# ... with 103 more rows
new_diams <- diamonds %>% summarise(av_depth = mean(depth),
sd_depth = sd(depth))
new_diams
# A tibble: 1 x 2
av_depth sd_depth
<dbl> <dbl>
1 61.7 1.55
new_diams <- diamonds %>%
group_by(color) %>%
summarise(av_price = mean(price),
sd_price = sd(price))
new_diams
# A tibble: 7 x 3
color av_price sd_price
<ord> <dbl> <dbl>
1 D 2852. 3101.
2 E 3424. 3531.
3 F 3061. 3374.
4 G 4178. 4193.
5 H 4020. 4138.
6 I 5701. 4611.
7 J 4816. 3746.
Exercice: Summarize (with mean, sd and frequentcy) the datasets for all combinations of color and cut for colors 'D' and 'E'.
Exercice: Summarize (with mean, sd and frequentcy) the datasets for all combinations of color and cut for colors 'D' and 'E'.
new_diamb <- diamonds %>% filter(color %in% c("D", "E")) %>%
group_by(color, cut) %>%
summarise(av_price = mean(price),
sd_price = sd(price),
count = length(price))
new_diamb
# A tibble: 10 x 5
# Groups: color [?]
color cut av_price sd_price count
<ord> <ord> <dbl> <dbl> <int>
1 D Fair 6240. 6888. 4
2 D Good 3817. 3817. 10
3 D Very Good 3097. 2999. 32
4 D Premium 2723. 3052. 24
5 D Ideal 2202. 2363. 43
6 E Fair 2614. 1802. 2
7 E Good 5091. 5162. 19
8 E Very Good 3770. 3757. 58
9 E Premium 3707. 3563. 47
10 E Ideal 2449. 2407. 65
Useful for:
p <- ggplot(new_diamb, aes(x = cut, y = av_price, colour = color, group = color)) + geom_point() +
geom_line() + geom_errorbar(aes(ymin = av_price - sd_price / sqrt(count),
ymax = av_price + sd_price / sqrt(count)))
p
tidyr is used to clean datasets so as: 1/ each variable is in a column; 2/ each observation is in a row; 3/ each value is in a cell. The main functions are:
gather()
that gathers multiple columns into a key-value pairs (where key is
the former column name)spread()
that takes two columns (key, value) and spreads them into
multiple columns (one column for each key)separate()
and extract()
to pull appart a column with multiple values
based on a separator or a regular expressiongrades <- tibble(
Name = c("Tommy", "Mary", "Gary", "Cathy"),
Sexage = c("m.15", "f.15", "m.16", "f.14"),
Math = c(10, 15, 16, 14),
Philo = c(11, 13, 10, 12),
English = c(12, 13, 17, 10)
)
grades
# A tibble: 4 x 5
Name Sexage Math Philo English
<chr> <chr> <dbl> <dbl> <dbl>
1 Tommy m.15 10. 11. 12.
2 Mary f.15 15. 13. 13.
3 Gary m.16 16. 10. 17.
4 Cathy f.14 14. 12. 10.
grades <- grades %>%
separate(Sexage, into = c("Sex", "Age")) # default separator is any nonalphanumeric character
modif_grades <- grades %>%
gather(Math, Philo, English, key = Topic, value = Grade)
modif_grades
# A tibble: 12 x 5
Name Sex Age Topic Grade
<chr> <chr> <chr> <chr> <dbl>
1 Tommy m 15 Math 10.
2 Mary f 15 Math 15.
3 Gary m 16 Math 16.
4 Cathy f 14 Math 14.
5 Tommy m 15 Philo 11.
6 Mary f 15 Philo 13.
7 Gary m 16 Philo 10.
8 Cathy f 14 Philo 12.
9 Tommy m 15 English 12.
10 Mary f 15 English 13.
11 Gary m 16 English 17.
12 Cathy f 14 English 10.
Usefull for:
p <- ggplot(modif_grades, aes(x = Topic, y = Grade)) + geom_boxplot() +
theme_bw()
p
Exercise:
# A tibble: 6 x 7
# Groups: Topic [3]
Name Sex Age Topic Grade minval maxval
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Mary f 15 Math 15. 14. 15.
2 Cathy f 14 Math 14. 14. 15.
3 Mary f 15 Philo 13. 12. 13.
4 Cathy f 14 Philo 12. 12. 13.
5 Mary f 15 English 13. 10. 13.
6 Cathy f 14 English 10. 10. 13.
Slides built with material coming from: