This notebook would introduce the basic syntax and dplyr package of R Programming. Before digging into Spatial Analysis, let us be familiar with R. The installation of R and R Studio please see here.

1 Object

Object is the type of data. Commonly used object includes:

  1. vector
  2. factor
  3. matrix
  4. data frame
  5. list

1.1 Vector

⌾ Features of vector

  • a combination of values
  • one-dimensional
  • it can be numeric, character or logical

⌾ Use c() to create a vector

  • Numeric vector
vec1=c(1,2,3,4,5)
## [1] 1 2 3 4 5
  • Character vector
vec2=c("A","B","C","D","E")
## [1] "A" "B" "C" "D" "E"
  • Logical vector
vec3=c(T,F,T,F,T)
## [1]  TRUE FALSE  TRUE FALSE  TRUE

⌾ Retrieve the specific elements

  • Retrieve the third element in vec2
vec2[3]
## [1] "C"
  • Retrieve multiple elements in vec2
    continuous elements
vec2[2:4]
## [1] "B" "C" "D"
  • Retrieve multiple elements in vec2
    non-continuous elements
vec2[c(1,3,5)]
## [1] "A" "C" "E"

⌾ Create sequential numeric vector

seq(from=2, to=20, by=2)
##  [1]  2  4  6  8 10 12 14 16 18 20

⌾ Operations on vectors

Let us set a numeric vector first.

vec4=c(1,9,3,15,6,2,8,12)
  • Maximum & Minimum & Range
max(vec4)
## [1] 15
min(vec4)
## [1] 1
range(vec4)
## [1]  1 15
  • Sum & Mean & Median & Product
sum(vec4)
## [1] 56
mean(vec4)
## [1] 7
median(vec4)
## [1] 7
prod(vec4)
## [1] 466560
  • Variance & Standard deviation
var(vec4)
## [1] 24.57143
sd(vec4)
## [1] 4.956958
  • Length
    number of elements in the vector
length(vec4)
## [1] 8
  • Order & Sort
    sort would rearrange the sequence by ascending
    order would retrieve the index of ascending sequence
sort(vec4)
## [1]  1  2  3  6  8  9 12 15
order(vec4)
## [1] 1 6 3 5 7 2 8 4

See the result of order() above. It shows that the last value is “4”, which means that the index of largest value is “4”, that is, vec4[4] (15).

  • Unique
    remove the duplicated element
vec_dup=c(1,9,5,2,6,1,8,5,2)
unique(vec_dup)
## [1] 1 9 5 2 6 8
  • Check NA
    check if the element is NA
vec_na=c(1,9,5,NA,6,NA)
is.na(vec_na)
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE

⌾ Calculations with vectors

Let us use vec4 and set a new numeric vector vec5.

vec5=c(2,5,8,11,7,4,10,3)
  • Arithmetic (both vectors)
    add/ multiple two vectors

    Note that the lengths of the vectors should be the same while doing the calculations!

vec4+vec5
## [1]  3 14 11 26 13  6 18 15
vec4*vec5
## [1]   2  45  24 165  42   8  80  36
  • Arithmetic (vector & element)
    add/ multiple a value to all the elements in the vector
vec5+5
## [1]  7 10 13 16 12  9 15  8
vec5*5
## [1] 10 25 40 55 35 20 50 15

⌾ Convert data type

  • From character to numeric

    Note that sometimes we may get a “character” type of number when importing the data. We should convert them into “numeric” type in advance, so as to do calculations.

vec_cha=c("1","2","3","4","5")
  • Use class() to check out the data type
class(vec_cha)
## [1] "character"
  • Convert it by function as.numeric()
as.numeric(vec_cha)
## [1] 1 2 3 4 5

Also, we can convert the numeric to character as well. Use the function as.character().

1.2 Factor

⌾ Features of factor

  • converted from the character vector
  • should set the levels
  • some useful tips in R:
    • setting a factor is very useful when adjusting the sequence of legend in ggplot2
    • used to set the dummy variable in the econometric model

⌾ Use factor() to create a factor

  • Function & Argument
factor(c(...), levels=c(...))

Let us set a vector of the schools:

school=c("University","Kindergarten","Elementary","High","Junior High")

We know that the sequence should be Kindergarten >> Elementary >> Junior High >> High >> University.

  • Use factor() to set their levels
school=factor(school, levels=c("Kindergarten","Elementary","Junior High","High","University"))
## [1] University   Kindergarten Elementary   High         Junior High 
## Levels: Kindergarten Elementary Junior High High University

⌾ Check if school is the factor

is.factor(school)
## [1] TRUE
class(school)
## [1] "factor"

⌾ Calculate number of elements in each level

summary(school)
## Kindergarten   Elementary  Junior High         High   University 
##            1            1            1            1            1

⌾ Convert factor to numeric

  • Use as.numeric() to transform the factor to numeric
as.numeric(school)
## [1] 5 1 2 4 3

The numeric retrieved above represents the level of each elements.

1.3 Matrix

⌾ Features of matrix

  • contains multiple rows and columns
  • combination of multiple vectors with the same types (numeric, character or logical)

⌾ Use matrix() to create a matrix

  • Function & Argument
    • nrow= means number of rows in the matrix
    • ncol= means number of columns in the matrix
matrix(c(...), nrow=, ncol=)
mat1=matrix(c(1:15), nrow=3, ncol=5)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    4    7   10   13
## [2,]    2    5    8   11   14
## [3,]    3    6    9   12   15

See the result above, the original vector (c(1:15)) is filled by columns. What if we want it by row?

  • Set argument byrow=T
mat2=matrix(c(1:15), nrow=3, ncol=5, byrow=T)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15

⌾ Check out the number of rows, columns and dimension of matrix

nrow(mat2)
## [1] 3
ncol(mat2)
## [1] 5
dim(mat2)
## [1] 3 5

⌾ Retrieve the specific elements in the matrix

The index of row should be set at first, and then set the index of column.

mat2[2, 4]
## [1] 9

⌾ Transpose the matrix

t(mat2)
##      [,1] [,2] [,3]
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13
## [4,]    4    9   14
## [5,]    5   10   15

⌾ Calculations on matrix

  • Sum of each row/ column
rowSums(mat2)
## [1] 15 40 65
colSums(mat2)
## [1] 18 21 24 27 30
  • Mean of each row/ column
rowMeans(mat2)
## [1]  3  8 13
colMeans(mat2)
## [1]  6  7  8  9 10

1.4 Data frame

⌾ Features of data frame

  • like a matrix but can have columns with different types
  • works like a spreadsheet (Excel)
  • commonly used in data analysis
  • can be tidy up by the package dplyr (see Section 2)

⌾ Use data.frame() to create a data frame

  • Function & Argument
data.frame(A=c(), B=c(), C=c())

A, B, and C in the function represent the column names of the table.

StuScore=data.frame(StudentID=c("ID1","ID2","ID3","ID4","ID5"),
           Name=c("Bob","Mary","Robert","Jason","Jane"),
           Score=c(60,80,40,50,100))
##   StudentID   Name Score
## 1       ID1    Bob    60
## 2       ID2   Mary    80
## 3       ID3 Robert    40
## 4       ID4  Jason    50
## 5       ID5   Jane   100

⌾ Check out the column names of data frame

colnames(StuScore)
## [1] "StudentID" "Name"      "Score"

⌾ Retrieve specific rows/ columns

  • Retrieve rows
    place the index in front of the comma
StuScore[c(1,3,5),]
##   StudentID   Name Score
## 1       ID1    Bob    60
## 3       ID3 Robert    40
## 5       ID5   Jane   100
  • Retrieve columns
    place the index after the comma
StuScore[, c(1,2)]
##   StudentID   Name
## 1       ID1    Bob
## 2       ID2   Mary
## 3       ID3 Robert
## 4       ID4  Jason
## 5       ID5   Jane
  • Retrieve columns by specifying the column names
    place the column name vector after the comma
StuScore[, c("StudentID","Name")]
##   StudentID   Name
## 1       ID1    Bob
## 2       ID2   Mary
## 3       ID3 Robert
## 4       ID4  Jason
## 5       ID5   Jane
  • Retrieve single column by data$column_name
StuScore$Name
## [1] "Bob"    "Mary"   "Robert" "Jason"  "Jane"

⌾ Retrieve specific element

  • Set the index of row and column
    data[row_index, column_index]
StuScore[2,2]
## [1] "Mary"
  • Edit the element by giving value
StuScore[2,2]="Jessica"

# check out the new data
StuScore
##   StudentID    Name Score
## 1       ID1     Bob    60
## 2       ID2 Jessica    80
## 3       ID3  Robert    40
## 4       ID4   Jason    50
## 5       ID5    Jane   100

⌾ Check out the first/ last 6 rows of data

Here we use the data iris provided by the R base. iris is a well-known data set that consists of 150 samples from each of three species of Iris flower (setosa, virginica, and versicolor). The columms include the length and the width of the sepals and petals (5 columns in total).

  • Use head() to retrieve the first 6 rows
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
  • Use tail() to retrieve the last 6 rows
tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

⌾ Extend the data frame

Back to our StuScore example. If we want to add attributes of a new student (add new row):

  • Use rbind() to add new rows
# first create a new data
new_student=data.frame(StudentID="ID6", Name="Bob", Score=90)

# use rbind to merge data
StuScore=rbind(StuScore, new_student)

# check out the new data
StuScore
##   StudentID    Name Score
## 1       ID1     Bob    60
## 2       ID2 Jessica    80
## 3       ID3  Robert    40
## 4       ID4   Jason    50
## 5       ID5    Jane   100
## 6       ID6     Bob    90

Then, if we want to add a new attribute (e.g., Gender) for all students (add new column):

  • Use cbind() to add new columns
# first create a new data
Gender=c("M","F","M","M","F","M")

# use cbind to merge data
StuScore=cbind(StuScore, Gender)

# check out the new data
StuScore
##   StudentID    Name Score Gender
## 1       ID1     Bob    60      M
## 2       ID2 Jessica    80      F
## 3       ID3  Robert    40      M
## 4       ID4   Jason    50      M
## 5       ID5    Jane   100      F
## 6       ID6     Bob    90      M

Or it can simply add by using: data$new_column=c(...)

StuScore$Height=c(180,165,170,160,175,180)

# check out the new data
StuScore
##   StudentID    Name Score Gender Height
## 1       ID1     Bob    60      M    180
## 2       ID2 Jessica    80      F    165
## 3       ID3  Robert    40      M    170
## 4       ID4   Jason    50      M    160
## 5       ID5    Jane   100      F    175
## 6       ID6     Bob    90      M    180

⌾ Check out the duplicated rows

Let’s create a new data with duplicated rows.

StuScore_dup=data.frame(StudentID=c("ID1","ID2","ID4","ID3","ID4","ID5","ID2"),
           Name=c("Bob","Mary","Jason","Robert","Jason","Jane","Mary"),
           Score=c(60,80,40,100,40,100,80))
##   StudentID   Name Score
## 1       ID1    Bob    60
## 2       ID2   Mary    80
## 3       ID4  Jason    40
## 4       ID3 Robert   100
## 5       ID4  Jason    40
## 6       ID5   Jane   100
## 7       ID2   Mary    80

We can see some of data are duplicated (ID2 & ID4). Let’s use duplicated() to check out whether it is repeated for each row.

duplicated(StuScore_dup)
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE

The result shows that the row 5 and 7 are duplicated (TRUE). Note that the first appeared row would not be viewed as duplicated.

1.5 List

⌾ Features of list

  • collection of objects, which can be vector, matrix, or data frame
  • most complicated, many packages use list to return the result

⌾ Use list() to create a list

StuScore_list=list(StudentID=c("ID1","ID2","ID3","ID4","ID5"),
           Name=c("Bob","Mary","Robert","Jason","Jane"),
           Score=c(60,80,40,50,100),
           Class="A")
## $StudentID
## [1] "ID1" "ID2" "ID3" "ID4" "ID5"
## 
## $Name
## [1] "Bob"    "Mary"   "Robert" "Jason"  "Jane"  
## 
## $Score
## [1]  60  80  40  50 100
## 
## $Class
## [1] "A"

⌾ Retrieve the attribute of list

  • Use list$NAME to retrieve
StuScore_list$Score
## [1]  60  80  40  50 100
  • Use index to retrieve
StuScore_list[[2]]
## [1] "Bob"    "Mary"   "Robert" "Jason"  "Jane"

2 Tidy data

The package commonly used to tidy data frame is dplyr. Please install and import the package via the following code:

# install the package
install.packages("dplyr")

# import the package
library(dplyr)

Functions that would be introduced in this section are summarized in Table 2.1.

Table 2.1: Vital Functions in dplyr
Functions Usage
select() select the specific column of data frame
filter() filter the data frame by given conditions
mutate() add a new column (attribute) in data frame
group_by() group the data by specific columns, it is commonly followed by filter() or summarise()
summarise() conduct the basic calculation on data frame, it is commonly used with group_by()
left_join() join two data by a specific key (column)
arrange() arrange the data by specific columns
rename() rename the column name of data frame
distinct() remove the duplicated rows

Before introducing how to process the data by using package dplyr, let’s first see how to read the data.

⌾ Read data by read.csv()

  • From local directory
    when the file is in your computer
# function
read.csv("DIRECTORY")

# example
read.csv("C:/Users/Robert/Downloads/world.csv")

Note that the “single forward slash (/)” is used in the directory.
Do not use “single backslash (\)!
Alternatively, we can use”double backslash (\\)” instead.

  • From website
    when the file is on the website (with a url)
# function
read.csv("url")

# example
read.csv("https://raw.githubusercontent.com/ChiaJung-Yeh/R-Code-Sharing/main/R_Preliminary/world.csv")

Here we adopt the second method to obtain our data.

# read data
world=read.csv("https://raw.githubusercontent.com/ChiaJung-Yeh/R-Code-Sharing/main/R_Preliminary/world.csv")

# check out the data
world

2.1 Select

Select the column we want in the data frame. This function is the same as retrieving columns in data frame introduced in Section 1.4.

  • Function & Argument
select(DATA, col1, col2, ...)

col1 and col2 above represent the column names that would be retrieved.

⌾ Select columns by select()

# select data
world_sel1=select(world, name_long, continent)

# check out the first 6 rows
head(world_sel1)
##        name_long     continent
## 1           Fiji       Oceania
## 2       Tanzania        Africa
## 3 Western Sahara        Africa
## 4         Canada North America
## 5  United States North America
## 6     Kazakhstan          Asia

⌾ Select columns by given column names vector

First set up a character vector to store the column name that would be retrieved. Then, Use all_of() in select() to call all of the column stored in the vector.

# specified columns vector
sp_col=c("iso_a2", "region_un")

# select data with specified columns
world_sel2=select(world, all_of(sp_col))

# check out the first 6 rows
head(world_sel2)
##   iso_a2 region_un
## 1     FJ   Oceania
## 2     TZ    Africa
## 3     EH    Africa
## 4     CA  Americas
## 5     US  Americas
## 6     KZ      Asia

⌾ Delete columns in data frame

To do so, just simply add a minus (-) before the column name that should be removed.

# delete column
world_sel3=select(world, -continent, -region_un, -subregion, -type)

# check out the first 6 rows
head(world_sel3)
##   iso_a2      name_long    area_km2       pop  lifeExp gdpPercap
## 1     FJ           Fiji    19289.97    885806 69.96000  8222.254
## 2     TZ       Tanzania   932745.79  52234869 64.16300  2402.099
## 3     EH Western Sahara    96270.60        NA       NA        NA
## 4     CA         Canada 10036042.98  35535348 81.95305 43079.143
## 5     US  United States  9510743.74 318622525 78.84146 51921.985
## 6     KZ     Kazakhstan  2729810.51  17288285 71.62000 23587.338

2.2 Filter

Filter the data frame by given conditions.

  • Function & Argument
filter(DATA, condition1, condition2, ...)

condition1 and condition2 above represent the condition we set. Add more conditions in filter() means that ALL the conditions should be satisfied (condition1 & condition2 & …).

⌾ Numeric filter

filter the country where its population (pop) is over 100 million

# filter data by population
world_fil1=filter(world, pop>100000000)

# check out the first 6 rows
head(world_fil1)
##   iso_a2          name_long     continent region_un          subregion
## 1     US      United States North America  Americas   Northern America
## 2     ID          Indonesia          Asia      Asia South-Eastern Asia
## 3     RU Russian Federation        Europe    Europe     Eastern Europe
## 4     MX             Mexico North America  Americas    Central America
## 5     BR             Brazil South America  Americas      South America
## 6     NG            Nigeria        Africa    Africa     Western Africa
##                type   area_km2       pop  lifeExp gdpPercap
## 1           Country  9510743.7 318622525 78.84146 51921.985
## 2 Sovereign country  1819251.3 255131116 68.85600 10003.089
## 3 Sovereign country 17018507.4 143819666 70.74366 25284.586
## 4 Sovereign country  1969480.3 124221600 76.75300 16622.597
## 5 Sovereign country  8508557.1 204213133 75.04200 15374.262
## 6 Sovereign country   905071.7 176460502 52.54900  5671.901
# check how many countries satisfy with the requirement
nrow(world_fil1)
## [1] 12

⌾ Character filter

A %in% B is used to check if each of the element in vector A is listed in the vector B

filter the country where its continent (continent) is in Asia or Europe

# filter data by population
world_fil2=filter(world, continent %in% c("Asia", "Europe"))

# check out the first 6 rows
head(world_fil2)
##   iso_a2          name_long continent region_un          subregion
## 1     KZ         Kazakhstan      Asia      Asia       Central Asia
## 2     UZ         Uzbekistan      Asia      Asia       Central Asia
## 3     ID          Indonesia      Asia      Asia South-Eastern Asia
## 4     RU Russian Federation    Europe    Europe     Eastern Europe
## 5     NO             Norway    Europe    Europe    Northern Europe
## 6     TL        Timor-Leste      Asia      Asia South-Eastern Asia
##                type    area_km2       pop  lifeExp gdpPercap
## 1 Sovereign country  2729810.51  17288285 71.62000 23587.338
## 2 Sovereign country   461410.26  30757700 71.03900  5370.866
## 3 Sovereign country  1819251.33 255131116 68.85600 10003.089
## 4 Sovereign country 17018507.41 143819666 70.74366 25284.586
## 5 Sovereign country   397994.63        NA       NA        NA
## 6 Sovereign country    14714.93   1212814 68.28500  6262.906

⌾ Multiple filter (AND)

  • add conditions in the function filter()
  • alternatively, we can use & to connect all of the conditions in filter()

filter the country where its continent (continent) is in Asia or Europe AND the population (pop) is over 100 million

# filter data by multiple conditions
world_fil3=filter(world, continent %in% c("Asia", "Europe"), pop>100000000)

# check out the first 6 rows
head(world_fil3)
##   iso_a2          name_long continent region_un          subregion
## 1     ID          Indonesia      Asia      Asia South-Eastern Asia
## 2     RU Russian Federation    Europe    Europe     Eastern Europe
## 3     IN              India      Asia      Asia      Southern Asia
## 4     BD         Bangladesh      Asia      Asia      Southern Asia
## 5     PK           Pakistan      Asia      Asia      Southern Asia
## 6     CN              China      Asia      Asia       Eastern Asia
##                type   area_km2        pop  lifeExp gdpPercap
## 1 Sovereign country  1819251.3  255131116 68.85600 10003.089
## 2 Sovereign country 17018507.4  143819666 70.74366 25284.586
## 3 Sovereign country  3142892.1 1293859294 68.02100  5385.142
## 4 Sovereign country   133782.1  159405279 71.80300  2973.042
## 5 Sovereign country   874120.0  185546257 66.13900  4576.227
## 6           Country  9409830.5 1364270000 75.93200 12758.648
# check how many countries satisfy with the requirements
nrow(world_fil3)
## [1] 8

Or, we can use & to connect the conditions as follows.

# filter data by multiple conditions
world_fil4=filter(world, continent %in% c("Asia", "Europe") & pop>100000000)

⌾ Multiple filter (OR)

  • when any of the condition is met, retrieve the data
  • use | to connect all of the conditions in filter()
# filter data by multiple conditions
world_fil5=filter(world, continent %in% c("Asia", "Europe") | pop>100000000)

# check how many countries satisfy with the requirements
nrow(world_fil5)
## [1] 90

2.3 Mutate

Add the new columns (attributes) or update the columns in data frame. This function is similar to cbind() introduced in Section 1.4.

  • Function & Argument
mutate(DATA, A=..., B=..., ...)

A and B above represent the operation of generating the new columns.

⌾ Mutate a new column by column operations

add a column to calculate the population density of each country (pop_dens)

Population density is population (pop) divided by area (area_km2).

# mutate a column for population density
world_mut1=mutate(world, pop_dens=pop/area_km2)

# check out the first 6 rows
head(world_mut1[, c("name_long","continent","pop","area_km2","pop_dens")])
##        name_long     continent       pop    area_km2  pop_dens
## 1           Fiji       Oceania    885806    19289.97 45.920547
## 2       Tanzania        Africa  52234869   932745.79 56.001184
## 3 Western Sahara        Africa        NA    96270.60        NA
## 4         Canada North America  35535348 10036042.98  3.540773
## 5  United States North America 318622525  9510743.74 33.501326
## 6     Kazakhstan          Asia  17288285  2729810.51  6.333145

Since there are missing data on population in some of countries, there exists NA in pop_dens.

⌾ Mutate a new column with ifelse()

  • Function & Argument
ifelse(condition, [do when true], [do when false])
add a column to label the country if area and population are both higher than median (label 1 if true)

Note that there are NA in the column, we should set na.rm=T in the function median(), to remove NA in advance.

# calculate the medium first
pop_med=median(world$pop, na.rm=T)
area_med=median(world$area_km2, na.rm=T)

# mutate a column for label
world_mut2=mutate(world, pop_area=ifelse(pop>pop_med & area_km2>area_med, 1,0))

# check out the first 6 rows
head(world_mut2[, c("name_long","continent","pop","area_km2","pop_area")])
##        name_long     continent       pop    area_km2 pop_area
## 1           Fiji       Oceania    885806    19289.97        0
## 2       Tanzania        Africa  52234869   932745.79        1
## 3 Western Sahara        Africa        NA    96270.60        0
## 4         Canada North America  35535348 10036042.98        1
## 5  United States North America 318622525  9510743.74        1
## 6     Kazakhstan          Asia  17288285  2729810.51        1

⌾ Mutate a new column with case_when()

  • Function & Argument
case_when(DATA, 
          condition1 ~ result1, 
          condition2 ~ result2, 
          condition3 ~ result3,...)

It means that if the condition is satisfied, then retrieve the result.

add a column to classify the country into large area large population (LALP), large area small population (LASP), small area large population (SALP), and small area small population (SASP) by the median of each attribute

# calculate the medium first
pop_med=median(world$pop, na.rm=T)
area_med=median(world$area_km2, na.rm=T)

# mutate a column for classfication
world_mut3=mutate(world, pop_area_cla=case_when(
  pop>pop_med & area_km2>area_med ~ "LALP",
  pop>pop_med & area_km2<area_med ~ "LASP",
  pop<pop_med & area_km2>area_med ~ "SALP",
  pop<pop_med & area_km2<area_med ~ "SASP"))

# check out the first 6 rows
head(world_mut3[, c("name_long","continent","pop","area_km2","pop_area_cla")])
##        name_long     continent       pop    area_km2 pop_area_cla
## 1           Fiji       Oceania    885806    19289.97         SASP
## 2       Tanzania        Africa  52234869   932745.79         LALP
## 3 Western Sahara        Africa        NA    96270.60         <NA>
## 4         Canada North America  35535348 10036042.98         LALP
## 5  United States North America 318622525  9510743.74         LALP
## 6     Kazakhstan          Asia  17288285  2729810.51         LALP

2.4 Group by (& Summarise)

group_by() is usually followed by filter() or summarise(). To connect two different function, package dplyr provides a useful tools, pipe (%>%), to connect the functions. When using it, data should only place in the first function.

  • Function & Argument

group_by() %>% filter()

group_by(DATA, column)%>%
  filter(condition)

group_by() %>% summrise()

group_by(DATA, column)%>%
  summarise(operation)

⌾ Group by with filter

filter the largest population (pop) of each continent

# calculate the medium first
world_gro1=group_by(world, continent)%>%
  filter(pop==max(pop, na.rm=T))

# check out the result
world_gro1
## # A tibble: 6 x 10
## # Groups:   continent [6]
##   iso_a2 name_long  conti~1 regio~2 subre~3 type  area_~4    pop lifeExp gdpPe~5
##   <chr>  <chr>      <chr>   <chr>   <chr>   <chr>   <dbl>  <int>   <dbl>   <dbl>
## 1 US     United St~ North ~ Americ~ Northe~ Coun~  9.51e6 3.19e8    78.8  51922.
## 2 RU     Russian F~ Europe  Europe  Easter~ Sove~  1.70e7 1.44e8    70.7  25285.
## 3 BR     Brazil     South ~ Americ~ South ~ Sove~  8.51e6 2.04e8    75.0  15374.
## 4 NG     Nigeria    Africa  Africa  Wester~ Sove~  9.05e5 1.76e8    52.5   5672.
## 5 AU     Australia  Oceania Oceania Austra~ Coun~  7.69e6 2.35e7    82.3  43547.
## 6 CN     China      Asia    Asia    Easter~ Coun~  9.41e6 1.36e9    75.9  12759.
## # ... with abbreviated variable names 1: continent, 2: region_un, 3: subregion,
## #   4: area_km2, 5: gdpPercap

By using group_by() %>% filter(), all of the attributes would be retained.

⌾ Group by with summarise

calculate the total population (pop) and average life expectancy (lifeExp) of each continent

# calculate the medium first
world_gro2=group_by(world, continent)%>%
  summarise(total_pop=sum(pop, na.rm=T),
            ave_lifeExp=mean(lifeExp, na.rm=T))

# check out the result
world_gro2
## # A tibble: 8 x 3
##   continent                total_pop ave_lifeExp
##   <chr>                        <dbl>       <dbl>
## 1 Africa                  1154946633        61.2
## 2 Antarctica                       0       NaN  
## 3 Asia                    4311408059        72.6
## 4 Europe                   669036256        78.1
## 5 North America            565028684        75.0
## 6 Oceania                   37757833        74.0
## 7 Seven seas (open ocean)          0       NaN  
## 8 South America            412060811        73.8

By using group_by() %>% summarise(), it can retrieve the operation on specific column. Note that only the specified column are retained in the result.

2.5 Left join

Merge two different data by referencing the key (terminology in database management: primary key)

  • Function & Argument
left_join(DATA1, DATA2, by=c("col1"="col2"))

It means that using col1 in DATA1 and col2 in DATA2 to be the ket referencee, in order to merge two data.

⌾ Left join two data

Here we first read the coffee production data of the world. The code is as follows.

coffee=read.csv("https://raw.githubusercontent.com/ChiaJung-Yeh/R-Code-Sharing/main/R_Preliminary/coffee.csv")

Then we can merge these two data by using the country name. Note that the country name in data world and world_coffee are all name_long, and hence, we can simply set by="name_long", or even without setting it (the package will automatically find the common name to be the key.)

# left join two data
world_coffee=left_join(world, coffee, by="name_long")

# check out the first 6 rows
head(world_coffee)
##   iso_a2      name_long     continent region_un        subregion
## 1     FJ           Fiji       Oceania   Oceania        Melanesia
## 2     TZ       Tanzania        Africa    Africa   Eastern Africa
## 3     EH Western Sahara        Africa    Africa  Northern Africa
## 4     CA         Canada North America  Americas Northern America
## 5     US  United States North America  Americas Northern America
## 6     KZ     Kazakhstan          Asia      Asia     Central Asia
##                type    area_km2       pop  lifeExp gdpPercap
## 1 Sovereign country    19289.97    885806 69.96000  8222.254
## 2 Sovereign country   932745.79  52234869 64.16300  2402.099
## 3     Indeterminate    96270.60        NA       NA        NA
## 4 Sovereign country 10036042.98  35535348 81.95305 43079.143
## 5           Country  9510743.74 318622525 78.84146 51921.985
## 6 Sovereign country  2729810.51  17288285 71.62000 23587.338
##   coffee_production_2016 coffee_production_2017
## 1                     NA                     NA
## 2                     81                     66
## 3                     NA                     NA
## 4                     NA                     NA
## 5                     NA                     NA
## 6                     NA                     NA

The result show that only Tanzania produce coffee in the first 6 countries in data.

2.6 Arrange

Arrange the value of specific column.

  • Function & Argument
arrange(DATA, col)

It means that using col to arrange data by ascending value. Note that the data class of col should be numeric or factor.

⌾ Arrange data by ascending order

arrange the data by population with ascending value

# arrange data by population
world_arr1=arrange(world, pop)

# check out the first 6 rows
head(world_arr1[, c("name_long","continent","pop")])
##       name_long     continent    pop
## 1     Greenland North America  56295
## 2       Vanuatu       Oceania 258850
## 3 New Caledonia       Oceania 268050
## 4       Iceland        Europe 327386
## 5        Belize North America 351694
## 6       Bahamas North America 382169

⌾ Arrange data by descending order

The order in arrange() is ascending by default. What if we want the value descending?
Simply use function desc() in arrange() to do so.

arrange the data by population with descending value

# arrange data by population
world_arr2=arrange(world, desc(pop))

# check out the first 6 rows
head(world_arr2[, c("name_long","continent","pop")])
##       name_long     continent        pop
## 1         China          Asia 1364270000
## 2         India          Asia 1293859294
## 3 United States North America  318622525
## 4     Indonesia          Asia  255131116
## 5        Brazil South America  204213133
## 6      Pakistan          Asia  185546257

2.7 Rename

Rename the column name of the data.

  • Function & Argument
rename(DATA, var_new=var_old)

The function above would replace the column name var_old with var_new in the data.

⌾ Rename by rename()

rename name_long to country in data world

# rename the column
world_ren=rename(world, country=name_long)

# check out the first 6 rows
head(world_ren[, c("country","continent")])
##          country     continent
## 1           Fiji       Oceania
## 2       Tanzania        Africa
## 3 Western Sahara        Africa
## 4         Canada North America
## 5  United States North America
## 6     Kazakhstan          Asia

2.8 distinct

Remove the duplicated rows in data frame.

  • Function & Argument
distinct(DATA)

⌾ Distinct by distinct()

Here we first use function select() to choose the columns we want. We can then use duplicated() (see Section 1.4 ⌾ Check out the duplicated rows) to test if it exists duplicated rows. Last, we use distinct() to remove the repeated ones.

# select data only three columns: continent, region_un, and subregion
world_sub=select(world, continent, region_un, subregion)

# check out the first 6 rows
head(world_sub)
##       continent region_un        subregion
## 1       Oceania   Oceania        Melanesia
## 2        Africa    Africa   Eastern Africa
## 3        Africa    Africa  Northern Africa
## 4 North America  Americas Northern America
## 5 North America  Americas Northern America
## 6          Asia      Asia     Central Asia
# check if there is duplicated rows in world_sub (show first 10 results)
duplicated(world_sub)[1:10]
##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
# world_sub has duplicated rows, using distinct to retain the unique one
world_sub_dis=distinct(world_sub)

# check out the first 6 rows
head(world_sub_dis)
##       continent region_un          subregion
## 1       Oceania   Oceania          Melanesia
## 2        Africa    Africa     Eastern Africa
## 3        Africa    Africa    Northern Africa
## 4 North America  Americas   Northern America
## 5          Asia      Asia       Central Asia
## 6          Asia      Asia South-Eastern Asia

3 Closing

Now, we know the basic syntax of R programming, and are familiar with package dplyr. It is enough to do some big data analysis!!

In the class in 12/10 (Sat.), we need to dig into the GIS operation in R. Before the class, please install the following packages in advance. Just run the code below!

# sf (for GIS)
install.packages("sf")

# ggplot2 (for plot)
install.packages("ggplot2")

# spData (for GIS data)
install.packages("spData")

# TWspdata (for GIS data of Taiwan)
install.packages("devtools")
devtools::install_github("ChiaJung-Yeh/TWspdata")

# TDX (for connecting to transport data)
devtools::install_github("ChiaJung-Yeh/NYCU_TDX")

If you have any questions regarding the installation of R or the basic syntax of R programming, please feel free to contact TA (葉家榮).
E-mail: robert1328.mg10@nycu.edu.tw

See you soon!