1 引言

数据集有长数据、宽数据之分。原生数据一般为宽数据,以其适合录入、查看、比较,这类数据的基本特点是:每一个观测(observation)占一行,每一个测量值(measurement)占一列。这时并不是每一列都是变量(variable),也就是说一个变量可能分解占位在不同的列。而长数据则是各类分析软件需要的格式,这类数据的特点是:每一个独特的变量集中在某列,观测位于一行。以下为各种类型数据示例:

宽数据

tidyr::table4a

country

1999

2000

Afghanistan

745

2666

Brazil

37737

80488

China

212258

213766

比较隐蔽的例子

data.table::data.table(

Date = c("2009-01-01", "2009-01-02"),

Boeing.stock.price = c("$173.55", "$172.61"),

Amazon.stock.price = c("$174.90", "$171.42"),

Google.stock.price = c("$174.34", "$170.04")

)

Date

Boeing.stock.price

Amazon.stock.price

Google.stock.price

2009-01-01

$173.55

$174.90

$174.34

2009-01-02

$172.61

$171.42

$170.04

长数据

tidyr::table4a %>% gather(2:3, key = "year", value = cases)

country

year

cases

Afghanistan

1999

745

Brazil

1999

37737

China

1999

212258

Afghanistan

2000

2666

Brazil

2000

80488

China

2000

213766

窄数据

tidyr::table5 %>% mutate(year = str_c(century, year)) %>% select(-2)

country

year

rate

Afghanistan

1999

745/19987071

Afghanistan

2000

2666/20595360

Brazil

1999

37737/172006362

Brazil

2000

80488/174504898

China

1999

212258/1272915272

China

2000

213766/1280428583

需要说明的是,判断数据是长还是宽数据,这取决于您所分析数据集中的变量。在矩形数据集(Rectangle Data)不是所有的列名称都是变量。是否是变量取决于您分析和计算的目的。也就是说在你本次分析和计算过程中,同一变量在同一列,观测(observation)在各自的行,才是真正的tidy data。

The ambiguity comes from the definition of tidy data. Tidiness depends on the variables in your data set. But what is a variable depends on what you are trying to do. To identify the variables that you need to work with, describe what you want to do with an equation. Each variable in the equation should correspond to a variable in your data. Primer

1.1 基本概念

1.1.1 变量

变量是可以测量的一种数量、属性或特征。 A variable is a quantity, quality, or property that you can measure.

A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. -Tidy data

1.1.2 变量值

变量值是测量变量时的状态。 A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

1.1.3 观测

观测是为了同一个目的同时测量的一组测量结果。 An observation or case is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a case or data point.

An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.-Tidy data

1.1.4 宽数据(Wide Data)

病例一览表(Cases list)是公共卫生领域最常见的宽数据。

suppressPackageStartupMessages(library(tidyverse))

data(Oswego, package = "epiDisplay")

wide_data <- Oswego %>%

as_tibble() %>%

slice_sample(n = 5)

wide_data

age

sex

timesupper

ill

onsetdate

onsettime

bakedham

spinach

mashedpota

cabbagesal

jello

rolls

brownbread

milk

coffee

water

cakes

vanilla

chocolate

fruitsalad

24

F

NA

FALSE

NA

NA

TRUE

TRUE

TRUE

FALSE

FALSE

TRUE

FALSE

FALSE

TRUE

FALSE

FALSE

FALSE

FALSE

FALSE

36

M

NA

TRUE

04/18

2215

TRUE

TRUE

FALSE

TRUE

FALSE

TRUE

TRUE

FALSE

FALSE

FALSE

FALSE

TRUE

FALSE

FALSE

48

F

NA

TRUE

04/18

2400

TRUE

TRUE

TRUE

TRUE

TRUE

TRUE

TRUE

TRUE

TRUE

FALSE

TRUE

TRUE

TRUE

FALSE

70

M

1930

TRUE

04/18

2230

TRUE

TRUE

TRUE

FALSE

TRUE

TRUE

TRUE

FALSE

TRUE

TRUE

FALSE

TRUE

FALSE

FALSE

68

M

NA

TRUE

04/18

2130

TRUE

FALSE

TRUE

TRUE

FALSE

FALSE

TRUE

FALSE

TRUE

FALSE

FALSE

TRUE

FALSE

FALSE

# %>%

# flextable::flextable()

1.1.5 杂乱数据(Untidy Data)

“Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham

The five most common problems with messy datasets, along with their remedies:

* Column headers are values, not variable names.

* Multiple variables are stored in one column.

* Variables are stored in both rows and columns.

* Multiple types of observational units are stored in the same table.

* A single observational unit is stored in multiple tables.

窄数据是指用每列存储多个变量或变量值的数据

Narrow data(Molten dataTidy data) uses a literal key column and a literal value column to store multiple variables.

who %>%

slice_sample(n = 10)

country

iso2

iso3

year

new_sp_m014

new_sp_m1524

new_sp_m2534

new_sp_m3544

new_sp_m4554

new_sp_m5564

new_sp_m65

new_sp_f014

new_sp_f1524

new_sp_f2534

new_sp_f3544

new_sp_f4554

new_sp_f5564

new_sp_f65

new_sn_m014

new_sn_m1524

new_sn_m2534

new_sn_m3544

new_sn_m4554

new_sn_m5564

new_sn_m65

new_sn_f014

new_sn_f1524

new_sn_f2534

new_sn_f3544

new_sn_f4554

new_sn_f5564

new_sn_f65

new_ep_m014

new_ep_m1524

new_ep_m2534

new_ep_m3544

new_ep_m4554

new_ep_m5564

new_ep_m65

new_ep_f014

new_ep_f1524

new_ep_f2534

new_ep_f3544

new_ep_f4554

new_ep_f5564

new_ep_f65

newrel_m014

newrel_m1524

newrel_m2534

newrel_m3544

newrel_m4554

newrel_m5564

newrel_m65

newrel_f014

newrel_f1524

newrel_f2534

newrel_f3544

newrel_f4554

newrel_f5564

newrel_f65

Palau

PW

PLW

1989

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

Seychelles

SC

SYC

1984

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

Oman

OM

OMN

1992

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

Syrian Arab Republic

SY

SYR

1998

5

335

293

111

93

48

50

20

197

99

43

49

18

21

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

Namibia

NA

NAM

2011

48

337

844

660

361

152

138

78

427

653

410

185

100

110

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

Nepal

NP

NPL

2008

81

150

1409

1558

1706

1515

792

107

832

820

704

630

523

226

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

Montserrat

MS

MSR

2005

NA

NA

NA

NA

1

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

Cameroon

CM

CMR

1993

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

Zimbabwe

ZW

ZWE

1988

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

Namibia

NA

NAM

2006

86

347

1052

799

386

174

146

74

485

875

521

239

92

80

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

NA

1.1.6 整洁数据(Tidy Data)

整洁数据,需要满足以下条件:

每个变量分布在唯一列;

每个观测分布在行;

每一个值分布在单独的单元格。

Each variable is in its own column

Each observation is in its own row

Each value is in its own cell (this follows from #1 and #2)

1.2 Data manipulation

includes variable-by-variable transformation (e.g., log or sqrt), as well as aggregation, fltering and reordering.

Filter: subsetting or removing observations based on some condition.

Transform: adding or modifying variables. These modications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).

Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).

Sort: changing the order of observations.