1 引言
数据集有长数据、宽数据之分。原生数据一般为宽数据,以其适合录入、查看、比较,这类数据的基本特点是:每一个观测(observation)占一行,每一个测量值(measurement)占一列。这时并不是每一列都是变量(variable),也就是说一个变量可能分解占位在不同的列。而长数据则是各类分析软件需要的格式,这类数据的特点是:每一个独特的变量集中在某列,观测位于一行。以下为各种类型数据示例:
宽数据
tidyr::table4a
country
1999
2000
Afghanistan
745
2666
Brazil
37737
80488
China
212258
213766
比较隐蔽的例子
data.table::data.table(
Date = c("2009-01-01", "2009-01-02"),
Boeing.stock.price = c("$173.55", "$172.61"),
Amazon.stock.price = c("$174.90", "$171.42"),
Google.stock.price = c("$174.34", "$170.04")
)
Date
Boeing.stock.price
Amazon.stock.price
Google.stock.price
2009-01-01
$173.55
$174.90
$174.34
2009-01-02
$172.61
$171.42
$170.04
长数据
tidyr::table4a %>% gather(2:3, key = "year", value = cases)
country
year
cases
Afghanistan
1999
745
Brazil
1999
37737
China
1999
212258
Afghanistan
2000
2666
Brazil
2000
80488
China
2000
213766
窄数据
tidyr::table5 %>% mutate(year = str_c(century, year)) %>% select(-2)
country
year
rate
Afghanistan
1999
745/19987071
Afghanistan
2000
2666/20595360
Brazil
1999
37737/172006362
Brazil
2000
80488/174504898
China
1999
212258/1272915272
China
2000
213766/1280428583
需要说明的是,判断数据是长还是宽数据,这取决于您所分析数据集中的变量。在矩形数据集(Rectangle Data)不是所有的列名称都是变量。是否是变量取决于您分析和计算的目的。也就是说在你本次分析和计算过程中,同一变量在同一列,观测(observation)在各自的行,才是真正的tidy data。
The ambiguity comes from the definition of tidy data. Tidiness depends on the variables in your data set. But what is a variable depends on what you are trying to do. To identify the variables that you need to work with, describe what you want to do with an equation. Each variable in the equation should correspond to a variable in your data. Primer
1.1 基本概念
1.1.1 变量
变量是可以测量的一种数量、属性或特征。 A variable is a quantity, quality, or property that you can measure.
A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. -Tidy data
1.1.2 变量值
变量值是测量变量时的状态。 A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
1.1.3 观测
观测是为了同一个目的同时测量的一组测量结果。 An observation or case is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a case or data point.
An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.-Tidy data
1.1.4 宽数据(Wide Data)
病例一览表(Cases list)是公共卫生领域最常见的宽数据。
suppressPackageStartupMessages(library(tidyverse))
data(Oswego, package = "epiDisplay")
wide_data <- Oswego %>%
as_tibble() %>%
slice_sample(n = 5)
wide_data
age
sex
timesupper
ill
onsetdate
onsettime
bakedham
spinach
mashedpota
cabbagesal
jello
rolls
brownbread
milk
coffee
water
cakes
vanilla
chocolate
fruitsalad
24
F
NA
FALSE
NA
NA
TRUE
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
FALSE
FALSE
FALSE
FALSE
36
M
NA
TRUE
04/18
2215
TRUE
TRUE
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
FALSE
TRUE
FALSE
FALSE
48
F
NA
TRUE
04/18
2400
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
FALSE
TRUE
TRUE
TRUE
FALSE
70
M
1930
TRUE
04/18
2230
TRUE
TRUE
TRUE
FALSE
TRUE
TRUE
TRUE
FALSE
TRUE
TRUE
FALSE
TRUE
FALSE
FALSE
68
M
NA
TRUE
04/18
2130
TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
FALSE
# %>%
# flextable::flextable()
1.1.5 杂乱数据(Untidy Data)
“Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham
The five most common problems with messy datasets, along with their remedies:
* Column headers are values, not variable names.
* Multiple variables are stored in one column.
* Variables are stored in both rows and columns.
* Multiple types of observational units are stored in the same table.
* A single observational unit is stored in multiple tables.
窄数据是指用每列存储多个变量或变量值的数据
Narrow data(Molten dataTidy data) uses a literal key column and a literal value column to store multiple variables.
who %>%
slice_sample(n = 10)
country
iso2
iso3
year
new_sp_m014
new_sp_m1524
new_sp_m2534
new_sp_m3544
new_sp_m4554
new_sp_m5564
new_sp_m65
new_sp_f014
new_sp_f1524
new_sp_f2534
new_sp_f3544
new_sp_f4554
new_sp_f5564
new_sp_f65
new_sn_m014
new_sn_m1524
new_sn_m2534
new_sn_m3544
new_sn_m4554
new_sn_m5564
new_sn_m65
new_sn_f014
new_sn_f1524
new_sn_f2534
new_sn_f3544
new_sn_f4554
new_sn_f5564
new_sn_f65
new_ep_m014
new_ep_m1524
new_ep_m2534
new_ep_m3544
new_ep_m4554
new_ep_m5564
new_ep_m65
new_ep_f014
new_ep_f1524
new_ep_f2534
new_ep_f3544
new_ep_f4554
new_ep_f5564
new_ep_f65
newrel_m014
newrel_m1524
newrel_m2534
newrel_m3544
newrel_m4554
newrel_m5564
newrel_m65
newrel_f014
newrel_f1524
newrel_f2534
newrel_f3544
newrel_f4554
newrel_f5564
newrel_f65
Palau
PW
PLW
1989
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Seychelles
SC
SYC
1984
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Oman
OM
OMN
1992
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Syrian Arab Republic
SY
SYR
1998
5
335
293
111
93
48
50
20
197
99
43
49
18
21
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Namibia
NA
NAM
2011
48
337
844
660
361
152
138
78
427
653
410
185
100
110
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Nepal
NP
NPL
2008
81
150
1409
1558
1706
1515
792
107
832
820
704
630
523
226
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Montserrat
MS
MSR
2005
NA
NA
NA
NA
1
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Cameroon
CM
CMR
1993
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Zimbabwe
ZW
ZWE
1988
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Namibia
NA
NAM
2006
86
347
1052
799
386
174
146
74
485
875
521
239
92
80
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
1.1.6 整洁数据(Tidy Data)
整洁数据,需要满足以下条件:
每个变量分布在唯一列;
每个观测分布在行;
每一个值分布在单独的单元格。
Each variable is in its own column
Each observation is in its own row
Each value is in its own cell (this follows from #1 and #2)
1.2 Data manipulation
includes variable-by-variable transformation (e.g., log or sqrt), as well as aggregation, fltering and reordering.
Filter: subsetting or removing observations based on some condition.
Transform: adding or modifying variables. These modications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).
Sort: changing the order of observations.