tidyr包简介

hadley大神的新作tidyr包,是reshape2的重构,也是由两大函数组成

package function1 function2
tidyr gather spread
reshape2 melt cast

以下内容是学习笔记,详细内容参见Hadley大神的github

Data tidying

tidyr提供了数据集内组织数据的标准方式,不用每次都另起炉灶

Defining tidy data

通过以下定义整齐的数据:

  1. Data structures(数据结构)。数据集大多数情况下都是行和列组成的数据框,通常有列名,有时候有行名
  2. Data semantics(数据语义)。数据集是值(数字和字符)的集合。每个值归属于一个变量和一个观测

Tidying messy datasets

以下是5种常见的脏数据集:

  1. 列名是值,不是变量名。如列名有”<10",">10”之类的分类值
  2. 列名包括多类型的值
  3. 变量储存为值
  4. 不同类型的数据存储在一张表
  5. 同一类型的数据存储在多张表

以下对5种脏数据集分别进行清理:

列名是值,不是变量名

原数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
pew = tbl_df(read.csv("pew.csv", stringsAsFactors = FALSE, check.names = FALSE))
pew
#> Source: local data frame [18 x 11]
#>
#> religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
#> (chr) (int) (int) (int) (int) (int) (int)
#> 1 Agnostic 27 34 60 81 76 137
#> 2 Atheist 12 27 37 52 35 70
#> 3 Buddhist 27 21 30 34 33 58
#> 4 Catholic 418 617 732 670 638 1116
#> 5 Don’t know/refused 15 14 15 11 10 35
#> 6 Evangelical Prot 575 869 1064 982 881 1486
#> 7 Hindu 1 9 7 9 11 34
#> 8 Historically Black Prot 228 244 236 238 197 223
#> 9 Jehovah's Witness 20 27 24 24 21 30
#> 10 Jewish 19 19 25 25 30 95
#> .. ... ... ... ... ... ... ...
#> Variables not shown: $75-100k (int), $100-150k (int), >150k (int), Don't
#> know/refused (int).

<$10k $10-20k $20-30k $30-40k $40-50k $50-75k合为income列,religion排序(-religion代表以religion分组)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pew %>%
gather(income, frequency, -religion)
#> Source: local data frame [180 x 3]
#>
#> religion income frequency
#> (chr) (chr) (int)
#> 1 Agnostic <$10k 27
#> 2 Atheist <$10k 12
#> 3 Buddhist <$10k 27
#> 4 Catholic <$10k 418
#> 5 Don’t know/refused <$10k 15
#> 6 Evangelical Prot <$10k 575
#> 7 Hindu <$10k 1
#> 8 Historically Black Prot <$10k 228
#> 9 Jehovah's Witness <$10k 20
#> 10 Jewish <$10k 19
#> .. ... ... ...

原数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
billboard = tbl_df(read.csv("billboard.csv", stringsAsFactors = FALSE))
billboard
#> Source: local data frame [317 x 81]
#>
#> year artist track time date.entered wk1
#> (int) (chr) (chr) (chr) (chr) (int)
#> 1 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 87
#> 2 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91
#> 3 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81
#> 4 2000 3 Doors Down Loser 4:24 2000-10-21 76
#> 5 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 57
#> 6 2000 98^0 Give Me Just One Nig... 3:24 2000-08-19 51
#> 7 2000 A*Teens Dancing Queen 3:44 2000-07-08 97
#> 8 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 84
#> 9 2000 Aaliyah Try Again 4:03 2000-03-18 59
#> 10 2000 Adams, Yolanda Open My Heart 5:30 2000-08-26 76
#> .. ... ... ... ... ... ...
#> Variables not shown: wk2 (int), wk3 (int), wk4 (int), wk5 (int), wk6
#> (int), wk7 (int), wk8 (int), wk9 (int), wk10 (int), wk11 (int), wk12
#> (int), wk13 (int), wk14 (int), wk15 (int), wk16 (int), wk17 (int), wk18
#> (int), wk19 (int), wk20 (int), wk21 (int), wk22 (int), wk23 (int), wk24
#> (int), wk25 (int), wk26 (int), wk27 (int), wk28 (int), wk29 (int), wk30
#> (int), wk31 (int), wk32 (int), wk33 (int), wk34 (int), wk35 (int), wk36
#> (int), wk37 (int), wk38 (int), wk39 (int), wk40 (int), wk41 (int), wk42
#> (int), wk43 (int), wk44 (int), wk45 (int), wk46 (int), wk47 (int), wk48
#> (int), wk49 (int), wk50 (int), wk51 (int), wk52 (int), wk53 (int), wk54
#> (int), wk55 (int), wk56 (int), wk57 (int), wk58 (int), wk59 (int), wk60
#> (int), wk61 (int), wk62 (int), wk63 (int), wk64 (int), wk65 (int), wk66
#> (lgl), wk67 (lgl), wk68 (lgl), wk69 (lgl), wk70 (lgl), wk71 (lgl), wk72
#> (lgl), wk73 (lgl), wk74 (lgl), wk75 (lgl), wk76 (lgl).

w1~w76合为week

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
billboard2 = billboard %>%
gather(week, rank, wk1:wk76, na.rm = TRUE)
billboard2
#> Source: local data frame [5,307 x 7]
#>
#> year artist track time date.entered week
#> (int) (chr) (chr) (chr) (chr) (chr)
#> 1 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk1
#> 2 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 wk1
#> 3 2000 3 Doors Down Kryptonite 3:53 2000-04-08 wk1
#> 4 2000 3 Doors Down Loser 4:24 2000-10-21 wk1
#> 5 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 wk1
#> 6 2000 98^0 Give Me Just One Nig... 3:24 2000-08-19 wk1
#> 7 2000 A*Teens Dancing Queen 3:44 2000-07-08 wk1
#> 8 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk1
#> 9 2000 Aaliyah Try Again 4:03 2000-03-18 wk1
#> 10 2000 Adams, Yolanda Open My Heart 5:30 2000-08-26 wk1
#> .. ... ... ... ... ... ...
#> Variables not shown: rank (int).

mutate添加新变量,extract_numeric函数取出week列的数值,date计算week对应的日期,select筛选列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
billboard3 = billboard2 %>%
mutate(
week = extract_numeric(week),
date = as.Date(date.entered) + 7 * (week - 1)) %>%
select(-date.entered)
billboard3
#> Source: local data frame [5,307 x 7]
#>
#> year artist track time week rank
#> (int) (chr) (chr) (chr) (dbl) (int)
#> 1 2000 2 Pac Baby Don't Cry (Keep... 4:22 1 87
#> 2 2000 2Ge+her The Hardest Part Of ... 3:15 1 91
#> 3 2000 3 Doors Down Kryptonite 3:53 1 81
#> 4 2000 3 Doors Down Loser 4:24 1 76
#> 5 2000 504 Boyz Wobble Wobble 3:35 1 57
#> 6 2000 98^0 Give Me Just One Nig... 3:24 1 51
#> 7 2000 A*Teens Dancing Queen 3:44 1 97
#> 8 2000 Aaliyah I Don't Wanna 4:15 1 84
#> 9 2000 Aaliyah Try Again 4:03 1 59
#> 10 2000 Adams, Yolanda Open My Heart 5:30 1 76
#> .. ... ... ... ... ... ...
#> Variables not shown: date (date).

artist,track,week排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
billboard3 %>% arrange(artist, track, week)
#> Source: local data frame [5,307 x 7]
#>
#> year artist track time week rank date
#> (int) (chr) (chr) (chr) (dbl) (int) (date)
#> 1 2000 2 Pac Baby Don't Cry (Keep... 4:22 1 87 2000-02-26
#> 2 2000 2 Pac Baby Don't Cry (Keep... 4:22 2 82 2000-03-04
#> 3 2000 2 Pac Baby Don't Cry (Keep... 4:22 3 72 2000-03-11
#> 4 2000 2 Pac Baby Don't Cry (Keep... 4:22 4 77 2000-03-18
#> 5 2000 2 Pac Baby Don't Cry (Keep... 4:22 5 87 2000-03-25
#> 6 2000 2 Pac Baby Don't Cry (Keep... 4:22 6 94 2000-04-01
#> 7 2000 2 Pac Baby Don't Cry (Keep... 4:22 7 99 2000-04-08
#> 8 2000 2Ge+her The Hardest Part Of ... 3:15 1 91 2000-09-02
#> 9 2000 2Ge+her The Hardest Part Of ... 3:15 2 87 2000-09-09
#> 10 2000 2Ge+her The Hardest Part Of ... 3:15 3 92 2000-09-16
#> .. ... ... ... ... ... ... ...

列名包括多类型的值

原数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
tb = tbl_df(read.csv("tb.csv", stringsAsFactors = FALSE))
tb
#> Source: local data frame [5,769 x 22]
#>
#> iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu
#> (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int)
#> 1 AD 1989 NA NA NA NA NA NA NA NA NA NA
#> 2 AD 1990 NA NA NA NA NA NA NA NA NA NA
#> 3 AD 1991 NA NA NA NA NA NA NA NA NA NA
#> 4 AD 1992 NA NA NA NA NA NA NA NA NA NA
#> 5 AD 1993 NA NA NA NA NA NA NA NA NA NA
#> 6 AD 1994 NA NA NA NA NA NA NA NA NA NA
#> 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA
#> 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA
#> 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA
#> 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA
#> .. ... ... ... ... ... ... ... ... ... ... ... ...
#> Variables not shown: f04 (int), f514 (int), f014 (int), f1524 (int), f2534
#> (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int).

先将m04~fu合为demo列,按iso2year排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
tb2 = tb %>%
gather(demo, n, -iso2, -year, na.rm = TRUE)
tb2
#> Source: local data frame [35,750 x 4]
#>
#> iso2 year demo n
#> (chr) (int) (chr) (int)
#> 1 AD 2005 m04 0
#> 2 AD 2006 m04 0
#> 3 AD 2008 m04 0
#> 4 AE 2006 m04 0
#> 5 AE 2007 m04 0
#> 6 AE 2008 m04 0
#> 7 AG 2007 m04 0
#> 8 AL 2005 m04 0
#> 9 AL 2006 m04 1
#> 10 AL 2007 m04 0
#> .. ... ... ... ...

separatedemo列拆为sexage1代表截断的位数,还可以按符号分割

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
tb3 = tb2 %>%
separate(demo, c("sex", "age"), 1)
tb3
#> Source: local data frame [35,750 x 5]
#>
#> iso2 year sex age n
#> (chr) (int) (chr) (chr) (int)
#> 1 AD 2005 m 04 0
#> 2 AD 2006 m 04 0
#> 3 AD 2008 m 04 0
#> 4 AE 2006 m 04 0
#> 5 AE 2007 m 04 0
#> 6 AE 2008 m 04 0
#> 7 AG 2007 m 04 0
#> 8 AL 2005 m 04 0
#> 9 AL 2006 m 04 1
#> 10 AL 2007 m 04 0
#> .. ... ... ... ... ...

变量储存为值

tmaxtmin为变量,但是储存为值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
weather = tbl_df(read.csv("weather.csv", stringsAsFactors = FALSE))
weather
#> Source: local data frame [22 x 35]
#>
#> id year month element d1 d2 d3 d4 d5 d6 d7
#> (chr) (int) (int) (chr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
#> 1 MX17004 2010 1 tmax NA NA NA NA NA NA NA
#> 2 MX17004 2010 1 tmin NA NA NA NA NA NA NA
#> 3 MX17004 2010 2 tmax NA 27.3 24.1 NA NA NA NA
#> 4 MX17004 2010 2 tmin NA 14.4 14.4 NA NA NA NA
#> 5 MX17004 2010 3 tmax NA NA NA NA 32.1 NA NA
#> 6 MX17004 2010 3 tmin NA NA NA NA 14.2 NA NA
#> 7 MX17004 2010 4 tmax NA NA NA NA NA NA NA
#> 8 MX17004 2010 4 tmin NA NA NA NA NA NA NA
#> 9 MX17004 2010 5 tmax NA NA NA NA NA NA NA
#> 10 MX17004 2010 5 tmin NA NA NA NA NA NA NA
#> .. ... ... ... ... ... ... ... ... ... ... ...
#> Variables not shown: d8 (dbl), d9 (lgl), d10 (dbl), d11 (dbl), d12 (lgl),
#> d13 (dbl), d14 (dbl), d15 (dbl), d16 (dbl), d17 (dbl), d18 (lgl), d19
#> (lgl), d20 (lgl), d21 (lgl), d22 (lgl), d23 (dbl), d24 (lgl), d25 (dbl),
#> d26 (dbl), d27 (dbl), d28 (dbl), d29 (dbl), d30 (dbl), d31 (dbl).

d1~d31合为day

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
weather2 = weather %>%
gather(day, value, d1:d31, na.rm = TRUE)
weather2
#> Source: local data frame [66 x 6]
#>
#> id year month element day value
#> (chr) (int) (int) (chr) (chr) (dbl)
#> 1 MX17004 2010 12 tmax d1 29.9
#> 2 MX17004 2010 12 tmin d1 13.8
#> 3 MX17004 2010 2 tmax d2 27.3
#> 4 MX17004 2010 2 tmin d2 14.4
#> 5 MX17004 2010 11 tmax d2 31.3
#> 6 MX17004 2010 11 tmin d2 16.3
#> 7 MX17004 2010 2 tmax d3 24.1
#> 8 MX17004 2010 2 tmin d3 14.4
#> 9 MX17004 2010 7 tmax d3 28.6
#> 10 MX17004 2010 7 tmin d3 17.5
#> .. ... ... ... ... ... ...

day列分出数字,select筛选,arrange排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
weather3 = weather2 %>%
mutate(day = extract_numeric(day)) %>%
select(id, year, month, day, element, value) %>%
arrange(id, year, month, day)
weather3
#> Source: local data frame [66 x 6]
#>
#> id year month day element value
#> (chr) (int) (int) (dbl) (chr) (dbl)
#> 1 MX17004 2010 1 30 tmax 27.8
#> 2 MX17004 2010 1 30 tmin 14.5
#> 3 MX17004 2010 2 2 tmax 27.3
#> 4 MX17004 2010 2 2 tmin 14.4
#> 5 MX17004 2010 2 3 tmax 24.1
#> 6 MX17004 2010 2 3 tmin 14.4
#> 7 MX17004 2010 2 11 tmax 29.7
#> 8 MX17004 2010 2 11 tmin 13.4
#> 9 MX17004 2010 2 23 tmax 29.9
#> 10 MX17004 2010 2 23 tmin 10.7
#> .. ... ... ... ... ... ...

element拆为tmaxtmin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
weather3 %>% spread(element, value)
#> Source: local data frame [33 x 6]
#>
#> id year month day tmax tmin
#> (chr) (int) (int) (dbl) (dbl) (dbl)
#> 1 MX17004 2010 1 30 27.8 14.5
#> 2 MX17004 2010 2 2 27.3 14.4
#> 3 MX17004 2010 2 3 24.1 14.4
#> 4 MX17004 2010 2 11 29.7 13.4
#> 5 MX17004 2010 2 23 29.9 10.7
#> 6 MX17004 2010 3 5 32.1 14.2
#> 7 MX17004 2010 3 10 34.5 16.8
#> 8 MX17004 2010 3 16 31.1 17.6
#> 9 MX17004 2010 4 27 36.3 16.7
#> 10 MX17004 2010 5 27 33.2 18.2
#> .. ... ... ... ... ... ...

不同类型的数据存储在一张表

unique去除重复值,mutate添加id

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
song = billboard3 %>%
select(artist, track, year, time) %>%
unique() %>%
mutate(song_id = row_number())
song
#> Source: local data frame [317 x 5]
#>
#> artist track year time song_id
#> (chr) (chr) (int) (chr) (int)
#> 1 2 Pac Baby Don't Cry (Keep... 2000 4:22 1
#> 2 2Ge+her The Hardest Part Of ... 2000 3:15 2
#> 3 3 Doors Down Kryptonite 2000 3:53 3
#> 4 3 Doors Down Loser 2000 4:24 4
#> 5 504 Boyz Wobble Wobble 2000 3:35 5
#> 6 98^0 Give Me Just One Nig... 2000 3:24 6
#> 7 A*Teens Dancing Queen 2000 3:44 7
#> 8 Aaliyah I Don't Wanna 2000 4:15 8
#> 9 Aaliyah Try Again 2000 4:03 9
#> 10 Adams, Yolanda Open My Heart 2000 5:30 10
#> .. ... ... ... ... ...

left_join左连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
rank = billboard3 %>%
left_join(song, c("artist", "track", "year", "time")) %>%
select(song_id, date, week, rank) %>%
arrange(song_id, date)
rank
#> Source: local data frame [5,307 x 4]
#>
#> song_id date week rank
#> (int) (date) (dbl) (int)
#> 1 1 2000-02-26 1 87
#> 2 1 2000-03-04 2 82
#> 3 1 2000-03-11 3 72
#> 4 1 2000-03-18 4 77
#> 5 1 2000-03-25 5 87
#> 6 1 2000-04-01 6 94
#> 7 1 2000-04-08 7 99
#> 8 2 2000-09-02 1 91
#> 9 2 2000-09-09 2 87
#> 10 2 2000-09-16 3 92
#> .. ... ... ... ...

同一类型的数据存储在多张表

将多张表里的数据合并到一起

1
2
3
4
library(plyr)
paths = dir("data", pattern = "\\.csv$", full.names = TRUE)
names(paths) = basename(paths)
ldply(paths, read.csv, stringsAsFactors = FALSE)

gather & spread

  1. gather. gather(data,key,value)
  2. spread. spread(data,key,value)
1
2
3
4
5
6
7
8
9
10
11
12
13
# 原数据
stocks = data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10,0,1),
Y = rnorm(10,0,2),
Z = rnorm(10,0,4)
)
# gather
stocks %>% gather(stock,price,-time)
stocks %>% gather(stock,price,X:Y)
# spread
stocksm = stocks %>% gather(stock,price,-time)
stocksm %>% spread(stock,price)
喂他一颗糖