This notebook would introduce the basic syntax and dplyr
package of R Programming. Before digging into Spatial Analysis, let us be familiar with R. The installation of R and R Studio please see here.
Object is the type of data. Commonly used object includes:
⌾ Features of vector
⌾ Use c()
to create a vector
vec1=c(1,2,3,4,5)
## [1] 1 2 3 4 5
vec2=c("A","B","C","D","E")
## [1] "A" "B" "C" "D" "E"
vec3=c(T,F,T,F,T)
## [1] TRUE FALSE TRUE FALSE TRUE
⌾ Retrieve the specific elements
vec2
vec2[3]
## [1] "C"
vec2
vec2[2:4]
## [1] "B" "C" "D"
vec2
vec2[c(1,3,5)]
## [1] "A" "C" "E"
⌾ Create sequential numeric vector
seq(from=2, to=20, by=2)
## [1] 2 4 6 8 10 12 14 16 18 20
⌾ Operations on vectors
Let us set a numeric vector first.
vec4=c(1,9,3,15,6,2,8,12)
max(vec4)
## [1] 15
min(vec4)
## [1] 1
range(vec4)
## [1] 1 15
sum(vec4)
## [1] 56
mean(vec4)
## [1] 7
median(vec4)
## [1] 7
prod(vec4)
## [1] 466560
var(vec4)
## [1] 24.57143
sd(vec4)
## [1] 4.956958
length(vec4)
## [1] 8
sort(vec4)
## [1] 1 2 3 6 8 9 12 15
order(vec4)
## [1] 1 6 3 5 7 2 8 4
See the result of order()
above. It shows that the last value is “4”, which means that the index of largest value is “4”, that is, vec4[4]
(15).
vec_dup=c(1,9,5,2,6,1,8,5,2)
unique(vec_dup)
## [1] 1 9 5 2 6 8
vec_na=c(1,9,5,NA,6,NA)
is.na(vec_na)
## [1] FALSE FALSE FALSE TRUE FALSE TRUE
⌾ Calculations with vectors
Let us use vec4
and set a new numeric vector vec5
.
vec5=c(2,5,8,11,7,4,10,3)
Note that the lengths of the vectors should be the same while doing the calculations!
vec4+vec5
## [1] 3 14 11 26 13 6 18 15
vec4*vec5
## [1] 2 45 24 165 42 8 80 36
vec5+5
## [1] 7 10 13 16 12 9 15 8
vec5*5
## [1] 10 25 40 55 35 20 50 15
⌾ Convert data type
Note that sometimes we may get a “character” type of number when importing the data. We should convert them into “numeric” type in advance, so as to do calculations.
vec_cha=c("1","2","3","4","5")
class()
to check out the data typeclass(vec_cha)
## [1] "character"
as.numeric()
as.numeric(vec_cha)
## [1] 1 2 3 4 5
Also, we can convert the numeric to character as well. Use the function as.character()
.
⌾ Features of factor
ggplot2
⌾ Use factor()
to create a factor
factor(c(...), levels=c(...))
Let us set a vector of the schools:
school=c("University","Kindergarten","Elementary","High","Junior High")
We know that the sequence should be Kindergarten >> Elementary >> Junior High >> High >> University.
factor()
to set their levelsschool=factor(school, levels=c("Kindergarten","Elementary","Junior High","High","University"))
## [1] University Kindergarten Elementary High Junior High
## Levels: Kindergarten Elementary Junior High High University
⌾ Check if school
is the factor
is.factor(school)
## [1] TRUE
class(school)
## [1] "factor"
⌾ Calculate number of elements in each level
summary(school)
## Kindergarten Elementary Junior High High University
## 1 1 1 1 1
⌾ Convert factor to numeric
as.numeric()
to transform the factor to numericas.numeric(school)
## [1] 5 1 2 4 3
The numeric retrieved above represents the level of each elements.
⌾ Features of matrix
⌾ Use matrix()
to create a matrix
nrow=
means number of rows in the matrixncol=
means number of columns in the matrixmatrix(c(...), nrow=, ncol=)
mat1=matrix(c(1:15), nrow=3, ncol=5)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 4 7 10 13
## [2,] 2 5 8 11 14
## [3,] 3 6 9 12 15
See the result above, the original vector (c(1:15)
) is filled by columns. What if we want it by row?
byrow=T
mat2=matrix(c(1:15), nrow=3, ncol=5, byrow=T)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
⌾ Check out the number of rows, columns and dimension of matrix
nrow(mat2)
## [1] 3
ncol(mat2)
## [1] 5
dim(mat2)
## [1] 3 5
⌾ Retrieve the specific elements in the matrix
The index of row should be set at first, and then set the index of column.
mat2[2, 4]
## [1] 9
⌾ Transpose the matrix
t(mat2)
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
⌾ Calculations on matrix
rowSums(mat2)
## [1] 15 40 65
colSums(mat2)
## [1] 18 21 24 27 30
rowMeans(mat2)
## [1] 3 8 13
colMeans(mat2)
## [1] 6 7 8 9 10
⌾ Features of data frame
dplyr
(see Section 2)
⌾ Use data.frame()
to create a data frame
data.frame(A=c(), B=c(), C=c())
A
, B
, and C
in the function represent the column names of the table.
StuScore=data.frame(StudentID=c("ID1","ID2","ID3","ID4","ID5"),
Name=c("Bob","Mary","Robert","Jason","Jane"),
Score=c(60,80,40,50,100))
## StudentID Name Score
## 1 ID1 Bob 60
## 2 ID2 Mary 80
## 3 ID3 Robert 40
## 4 ID4 Jason 50
## 5 ID5 Jane 100
⌾ Check out the column names of data frame
colnames(StuScore)
## [1] "StudentID" "Name" "Score"
⌾ Retrieve specific rows/ columns
StuScore[c(1,3,5),]
## StudentID Name Score
## 1 ID1 Bob 60
## 3 ID3 Robert 40
## 5 ID5 Jane 100
StuScore[, c(1,2)]
## StudentID Name
## 1 ID1 Bob
## 2 ID2 Mary
## 3 ID3 Robert
## 4 ID4 Jason
## 5 ID5 Jane
StuScore[, c("StudentID","Name")]
## StudentID Name
## 1 ID1 Bob
## 2 ID2 Mary
## 3 ID3 Robert
## 4 ID4 Jason
## 5 ID5 Jane
data$column_name
StuScore$Name
## [1] "Bob" "Mary" "Robert" "Jason" "Jane"
⌾ Retrieve specific element
data[row_index, column_index]
StuScore[2,2]
## [1] "Mary"
StuScore[2,2]="Jessica"
# check out the new data
StuScore
## StudentID Name Score
## 1 ID1 Bob 60
## 2 ID2 Jessica 80
## 3 ID3 Robert 40
## 4 ID4 Jason 50
## 5 ID5 Jane 100
⌾ Check out the first/ last 6 rows of data
Here we use the data iris
provided by the R base. iris
is a well-known data set that consists of 150 samples from each of three species of Iris flower (setosa, virginica, and versicolor). The columms include the length and the width of the sepals and petals (5 columns in total).
head()
to retrieve the first 6 rowshead(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail()
to retrieve the last 6 rowstail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
⌾ Extend the data frame
Back to our StuScore
example. If we want to add attributes of a new student (add new row):
rbind()
to add new rows# first create a new data
new_student=data.frame(StudentID="ID6", Name="Bob", Score=90)
# use rbind to merge data
StuScore=rbind(StuScore, new_student)
# check out the new data
StuScore
## StudentID Name Score
## 1 ID1 Bob 60
## 2 ID2 Jessica 80
## 3 ID3 Robert 40
## 4 ID4 Jason 50
## 5 ID5 Jane 100
## 6 ID6 Bob 90
Then, if we want to add a new attribute (e.g., Gender) for all students (add new column):
cbind()
to add new columns# first create a new data
Gender=c("M","F","M","M","F","M")
# use cbind to merge data
StuScore=cbind(StuScore, Gender)
# check out the new data
StuScore
## StudentID Name Score Gender
## 1 ID1 Bob 60 M
## 2 ID2 Jessica 80 F
## 3 ID3 Robert 40 M
## 4 ID4 Jason 50 M
## 5 ID5 Jane 100 F
## 6 ID6 Bob 90 M
Or it can simply add by using: data$new_column=c(...)
StuScore$Height=c(180,165,170,160,175,180)
# check out the new data
StuScore
## StudentID Name Score Gender Height
## 1 ID1 Bob 60 M 180
## 2 ID2 Jessica 80 F 165
## 3 ID3 Robert 40 M 170
## 4 ID4 Jason 50 M 160
## 5 ID5 Jane 100 F 175
## 6 ID6 Bob 90 M 180
⌾ Check out the duplicated rows
Let’s create a new data with duplicated rows.
StuScore_dup=data.frame(StudentID=c("ID1","ID2","ID4","ID3","ID4","ID5","ID2"),
Name=c("Bob","Mary","Jason","Robert","Jason","Jane","Mary"),
Score=c(60,80,40,100,40,100,80))
## StudentID Name Score
## 1 ID1 Bob 60
## 2 ID2 Mary 80
## 3 ID4 Jason 40
## 4 ID3 Robert 100
## 5 ID4 Jason 40
## 6 ID5 Jane 100
## 7 ID2 Mary 80
We can see some of data are duplicated (ID2 & ID4). Let’s use duplicated()
to check out whether it is repeated for each row.
duplicated(StuScore_dup)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
The result shows that the row 5 and 7 are duplicated (TRUE
). Note that the first appeared row would not be viewed as duplicated.
⌾ Features of list
⌾ Use list()
to create a list
StuScore_list=list(StudentID=c("ID1","ID2","ID3","ID4","ID5"),
Name=c("Bob","Mary","Robert","Jason","Jane"),
Score=c(60,80,40,50,100),
Class="A")
## $StudentID
## [1] "ID1" "ID2" "ID3" "ID4" "ID5"
##
## $Name
## [1] "Bob" "Mary" "Robert" "Jason" "Jane"
##
## $Score
## [1] 60 80 40 50 100
##
## $Class
## [1] "A"
⌾ Retrieve the attribute of list
list$NAME
to retrieveStuScore_list$Score
## [1] 60 80 40 50 100
StuScore_list[[2]]
## [1] "Bob" "Mary" "Robert" "Jason" "Jane"
The package commonly used to tidy data frame is dplyr
. Please install and import the package via the following code:
# install the package
install.packages("dplyr")
# import the package
library(dplyr)
Functions that would be introduced in this section are summarized in Table 2.1.
Functions | Usage |
---|---|
select()
|
select the specific column of data frame |
filter()
|
filter the data frame by given conditions |
mutate()
|
add a new column (attribute) in data frame |
group_by()
|
group the data by specific columns, it is commonly followed by filter() or summarise()
|
summarise()
|
conduct the basic calculation on data frame, it is commonly used with group_by()
|
left_join()
|
join two data by a specific key (column) |
arrange()
|
arrange the data by specific columns |
rename()
|
rename the column name of data frame |
distinct()
|
remove the duplicated rows |
Before introducing how to process the data by using package dplyr
, let’s first see how to read the data.
⌾ Read data by read.csv()
# function
read.csv("DIRECTORY")
# example
read.csv("C:/Users/Robert/Downloads/world.csv")
Note that the “single forward slash (/)” is used in the directory.
Do not use “single backslash (\)!
Alternatively, we can use”double backslash (\\)” instead.
# function
read.csv("url")
# example
read.csv("https://raw.githubusercontent.com/ChiaJung-Yeh/R-Code-Sharing/main/R_Preliminary/world.csv")
Here we adopt the second method to obtain our data.
# read data
world=read.csv("https://raw.githubusercontent.com/ChiaJung-Yeh/R-Code-Sharing/main/R_Preliminary/world.csv")
# check out the data
world
Select the column we want in the data frame. This function is the same as retrieving columns in data frame introduced in Section 1.4.
select(DATA, col1, col2, ...)
col1
and col2
above represent the column names that would be retrieved.
⌾ Select columns by select()
# select data
world_sel1=select(world, name_long, continent)
# check out the first 6 rows
head(world_sel1)
## name_long continent
## 1 Fiji Oceania
## 2 Tanzania Africa
## 3 Western Sahara Africa
## 4 Canada North America
## 5 United States North America
## 6 Kazakhstan Asia
⌾ Select columns by given column names vector
First set up a character vector to store the column name that would be retrieved. Then, Use all_of()
in select()
to call all of the column stored in the vector.
# specified columns vector
sp_col=c("iso_a2", "region_un")
# select data with specified columns
world_sel2=select(world, all_of(sp_col))
# check out the first 6 rows
head(world_sel2)
## iso_a2 region_un
## 1 FJ Oceania
## 2 TZ Africa
## 3 EH Africa
## 4 CA Americas
## 5 US Americas
## 6 KZ Asia
⌾ Delete columns in data frame
To do so, just simply add a minus (-
) before the column name that should be removed.
# delete column
world_sel3=select(world, -continent, -region_un, -subregion, -type)
# check out the first 6 rows
head(world_sel3)
## iso_a2 name_long area_km2 pop lifeExp gdpPercap
## 1 FJ Fiji 19289.97 885806 69.96000 8222.254
## 2 TZ Tanzania 932745.79 52234869 64.16300 2402.099
## 3 EH Western Sahara 96270.60 NA NA NA
## 4 CA Canada 10036042.98 35535348 81.95305 43079.143
## 5 US United States 9510743.74 318622525 78.84146 51921.985
## 6 KZ Kazakhstan 2729810.51 17288285 71.62000 23587.338
Filter the data frame by given conditions.
filter(DATA, condition1, condition2, ...)
condition1
and condition2
above represent the condition we set. Add more conditions in filter()
means that ALL the conditions should be satisfied (condition1 & condition2 & …).
⌾ Numeric filter
filter the country where its population (pop
) is over 100 million
# filter data by population
world_fil1=filter(world, pop>100000000)
# check out the first 6 rows
head(world_fil1)
## iso_a2 name_long continent region_un subregion
## 1 US United States North America Americas Northern America
## 2 ID Indonesia Asia Asia South-Eastern Asia
## 3 RU Russian Federation Europe Europe Eastern Europe
## 4 MX Mexico North America Americas Central America
## 5 BR Brazil South America Americas South America
## 6 NG Nigeria Africa Africa Western Africa
## type area_km2 pop lifeExp gdpPercap
## 1 Country 9510743.7 318622525 78.84146 51921.985
## 2 Sovereign country 1819251.3 255131116 68.85600 10003.089
## 3 Sovereign country 17018507.4 143819666 70.74366 25284.586
## 4 Sovereign country 1969480.3 124221600 76.75300 16622.597
## 5 Sovereign country 8508557.1 204213133 75.04200 15374.262
## 6 Sovereign country 905071.7 176460502 52.54900 5671.901
# check how many countries satisfy with the requirement
nrow(world_fil1)
## [1] 12
⌾ Character filter
A %in% B
is used to check if each of the element in vector A is listed in the vector B
filter the country where its continent (continent
) is in Asia or Europe
# filter data by population
world_fil2=filter(world, continent %in% c("Asia", "Europe"))
# check out the first 6 rows
head(world_fil2)
## iso_a2 name_long continent region_un subregion
## 1 KZ Kazakhstan Asia Asia Central Asia
## 2 UZ Uzbekistan Asia Asia Central Asia
## 3 ID Indonesia Asia Asia South-Eastern Asia
## 4 RU Russian Federation Europe Europe Eastern Europe
## 5 NO Norway Europe Europe Northern Europe
## 6 TL Timor-Leste Asia Asia South-Eastern Asia
## type area_km2 pop lifeExp gdpPercap
## 1 Sovereign country 2729810.51 17288285 71.62000 23587.338
## 2 Sovereign country 461410.26 30757700 71.03900 5370.866
## 3 Sovereign country 1819251.33 255131116 68.85600 10003.089
## 4 Sovereign country 17018507.41 143819666 70.74366 25284.586
## 5 Sovereign country 397994.63 NA NA NA
## 6 Sovereign country 14714.93 1212814 68.28500 6262.906
⌾ Multiple filter (AND)
filter()
&
to connect all of the conditions in filter()
filter the country where its continent (continent
) is in Asia or Europe AND the population (pop
) is over 100 million
# filter data by multiple conditions
world_fil3=filter(world, continent %in% c("Asia", "Europe"), pop>100000000)
# check out the first 6 rows
head(world_fil3)
## iso_a2 name_long continent region_un subregion
## 1 ID Indonesia Asia Asia South-Eastern Asia
## 2 RU Russian Federation Europe Europe Eastern Europe
## 3 IN India Asia Asia Southern Asia
## 4 BD Bangladesh Asia Asia Southern Asia
## 5 PK Pakistan Asia Asia Southern Asia
## 6 CN China Asia Asia Eastern Asia
## type area_km2 pop lifeExp gdpPercap
## 1 Sovereign country 1819251.3 255131116 68.85600 10003.089
## 2 Sovereign country 17018507.4 143819666 70.74366 25284.586
## 3 Sovereign country 3142892.1 1293859294 68.02100 5385.142
## 4 Sovereign country 133782.1 159405279 71.80300 2973.042
## 5 Sovereign country 874120.0 185546257 66.13900 4576.227
## 6 Country 9409830.5 1364270000 75.93200 12758.648
# check how many countries satisfy with the requirements
nrow(world_fil3)
## [1] 8
Or, we can use &
to connect the conditions as follows.
# filter data by multiple conditions
world_fil4=filter(world, continent %in% c("Asia", "Europe") & pop>100000000)
⌾ Multiple filter (OR)
|
to connect all of the conditions in filter()
# filter data by multiple conditions
world_fil5=filter(world, continent %in% c("Asia", "Europe") | pop>100000000)
# check how many countries satisfy with the requirements
nrow(world_fil5)
## [1] 90
Add the new columns (attributes) or update the columns in data frame. This function is similar to cbind()
introduced in Section 1.4.
mutate(DATA, A=..., B=..., ...)
A
and B
above represent the operation of generating the new columns.
⌾ Mutate a new column by column operations
add a column to calculate the population density of each country (pop_dens
)
Population density is population (pop
) divided by area (area_km2
).
# mutate a column for population density
world_mut1=mutate(world, pop_dens=pop/area_km2)
# check out the first 6 rows
head(world_mut1[, c("name_long","continent","pop","area_km2","pop_dens")])
## name_long continent pop area_km2 pop_dens
## 1 Fiji Oceania 885806 19289.97 45.920547
## 2 Tanzania Africa 52234869 932745.79 56.001184
## 3 Western Sahara Africa NA 96270.60 NA
## 4 Canada North America 35535348 10036042.98 3.540773
## 5 United States North America 318622525 9510743.74 33.501326
## 6 Kazakhstan Asia 17288285 2729810.51 6.333145
Since there are missing data on population in some of countries, there exists NA in pop_dens
.
⌾ Mutate a new column with ifelse()
ifelse(condition, [do when true], [do when false])
add a column to label the country if area and population are both higher than median (label 1 if true)
Note that there are NA in the column, we should set na.rm=T
in the function median()
, to remove NA in advance.
# calculate the medium first
pop_med=median(world$pop, na.rm=T)
area_med=median(world$area_km2, na.rm=T)
# mutate a column for label
world_mut2=mutate(world, pop_area=ifelse(pop>pop_med & area_km2>area_med, 1,0))
# check out the first 6 rows
head(world_mut2[, c("name_long","continent","pop","area_km2","pop_area")])
## name_long continent pop area_km2 pop_area
## 1 Fiji Oceania 885806 19289.97 0
## 2 Tanzania Africa 52234869 932745.79 1
## 3 Western Sahara Africa NA 96270.60 0
## 4 Canada North America 35535348 10036042.98 1
## 5 United States North America 318622525 9510743.74 1
## 6 Kazakhstan Asia 17288285 2729810.51 1
⌾ Mutate a new column with case_when()
case_when(DATA,
condition1 ~ result1,
condition2 ~ result2,
condition3 ~ result3,...)
It means that if the condition
is satisfied, then retrieve the result
.
add a column to classify the country into large area large population (LALP), large area small population (LASP), small area large population (SALP), and small area small population (SASP) by the median of each attribute
# calculate the medium first
pop_med=median(world$pop, na.rm=T)
area_med=median(world$area_km2, na.rm=T)
# mutate a column for classfication
world_mut3=mutate(world, pop_area_cla=case_when(
pop>pop_med & area_km2>area_med ~ "LALP",
pop>pop_med & area_km2<area_med ~ "LASP",
pop<pop_med & area_km2>area_med ~ "SALP",
pop<pop_med & area_km2<area_med ~ "SASP"))
# check out the first 6 rows
head(world_mut3[, c("name_long","continent","pop","area_km2","pop_area_cla")])
## name_long continent pop area_km2 pop_area_cla
## 1 Fiji Oceania 885806 19289.97 SASP
## 2 Tanzania Africa 52234869 932745.79 LALP
## 3 Western Sahara Africa NA 96270.60 <NA>
## 4 Canada North America 35535348 10036042.98 LALP
## 5 United States North America 318622525 9510743.74 LALP
## 6 Kazakhstan Asia 17288285 2729810.51 LALP
group_by()
is usually followed by filter()
or summarise()
. To connect two different function, package dplyr
provides a useful tools, pipe (%>%
), to connect the functions. When using it, data should only place in the first function.
group_by() %>% filter()
group_by(DATA, column)%>%
filter(condition)
group_by() %>% summrise()
group_by(DATA, column)%>%
summarise(operation)
⌾ Group by with filter
filter the largest population (pop
) of each continent
# calculate the medium first
world_gro1=group_by(world, continent)%>%
filter(pop==max(pop, na.rm=T))
# check out the result
world_gro1
## # A tibble: 6 x 10
## # Groups: continent [6]
## iso_a2 name_long conti~1 regio~2 subre~3 type area_~4 pop lifeExp gdpPe~5
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl> <dbl>
## 1 US United St~ North ~ Americ~ Northe~ Coun~ 9.51e6 3.19e8 78.8 51922.
## 2 RU Russian F~ Europe Europe Easter~ Sove~ 1.70e7 1.44e8 70.7 25285.
## 3 BR Brazil South ~ Americ~ South ~ Sove~ 8.51e6 2.04e8 75.0 15374.
## 4 NG Nigeria Africa Africa Wester~ Sove~ 9.05e5 1.76e8 52.5 5672.
## 5 AU Australia Oceania Oceania Austra~ Coun~ 7.69e6 2.35e7 82.3 43547.
## 6 CN China Asia Asia Easter~ Coun~ 9.41e6 1.36e9 75.9 12759.
## # ... with abbreviated variable names 1: continent, 2: region_un, 3: subregion,
## # 4: area_km2, 5: gdpPercap
By using group_by() %>% filter()
, all of the attributes would be retained.
⌾ Group by with summarise
calculate the total population (pop
) and average life expectancy (lifeExp
) of each continent
# calculate the medium first
world_gro2=group_by(world, continent)%>%
summarise(total_pop=sum(pop, na.rm=T),
ave_lifeExp=mean(lifeExp, na.rm=T))
# check out the result
world_gro2
## # A tibble: 8 x 3
## continent total_pop ave_lifeExp
## <chr> <dbl> <dbl>
## 1 Africa 1154946633 61.2
## 2 Antarctica 0 NaN
## 3 Asia 4311408059 72.6
## 4 Europe 669036256 78.1
## 5 North America 565028684 75.0
## 6 Oceania 37757833 74.0
## 7 Seven seas (open ocean) 0 NaN
## 8 South America 412060811 73.8
By using group_by() %>% summarise()
, it can retrieve the operation on specific column. Note that only the specified column are retained in the result.
Merge two different data by referencing the key (terminology in database management: primary key)
left_join(DATA1, DATA2, by=c("col1"="col2"))
It means that using col1
in DATA1
and col2
in DATA2
to be the ket referencee, in order to merge two data.
⌾ Left join two data
Here we first read the coffee production data of the world. The code is as follows.
coffee=read.csv("https://raw.githubusercontent.com/ChiaJung-Yeh/R-Code-Sharing/main/R_Preliminary/coffee.csv")
Then we can merge these two data by using the country name. Note that the country name in data world
and world_coffee
are all name_long
, and hence, we can simply set by="name_long"
, or even without setting it (the package will automatically find the common name to be the key.)
# left join two data
world_coffee=left_join(world, coffee, by="name_long")
# check out the first 6 rows
head(world_coffee)
## iso_a2 name_long continent region_un subregion
## 1 FJ Fiji Oceania Oceania Melanesia
## 2 TZ Tanzania Africa Africa Eastern Africa
## 3 EH Western Sahara Africa Africa Northern Africa
## 4 CA Canada North America Americas Northern America
## 5 US United States North America Americas Northern America
## 6 KZ Kazakhstan Asia Asia Central Asia
## type area_km2 pop lifeExp gdpPercap
## 1 Sovereign country 19289.97 885806 69.96000 8222.254
## 2 Sovereign country 932745.79 52234869 64.16300 2402.099
## 3 Indeterminate 96270.60 NA NA NA
## 4 Sovereign country 10036042.98 35535348 81.95305 43079.143
## 5 Country 9510743.74 318622525 78.84146 51921.985
## 6 Sovereign country 2729810.51 17288285 71.62000 23587.338
## coffee_production_2016 coffee_production_2017
## 1 NA NA
## 2 81 66
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
The result show that only Tanzania produce coffee in the first 6 countries in data.
Arrange the value of specific column.
arrange(DATA, col)
It means that using col
to arrange data by ascending value. Note that the data class of col
should be numeric or factor.
⌾ Arrange data by ascending order
arrange the data by population with ascending value
# arrange data by population
world_arr1=arrange(world, pop)
# check out the first 6 rows
head(world_arr1[, c("name_long","continent","pop")])
## name_long continent pop
## 1 Greenland North America 56295
## 2 Vanuatu Oceania 258850
## 3 New Caledonia Oceania 268050
## 4 Iceland Europe 327386
## 5 Belize North America 351694
## 6 Bahamas North America 382169
⌾ Arrange data by descending order
The order in arrange()
is ascending by default. What if we want the value descending?
Simply use function desc()
in arrange()
to do so.
arrange the data by population with descending value
# arrange data by population
world_arr2=arrange(world, desc(pop))
# check out the first 6 rows
head(world_arr2[, c("name_long","continent","pop")])
## name_long continent pop
## 1 China Asia 1364270000
## 2 India Asia 1293859294
## 3 United States North America 318622525
## 4 Indonesia Asia 255131116
## 5 Brazil South America 204213133
## 6 Pakistan Asia 185546257
Rename the column name of the data.
rename(DATA, var_new=var_old)
The function above would replace the column name var_old
with var_new
in the data.
⌾ Rename by rename()
rename name_long
to country
in data world
# rename the column
world_ren=rename(world, country=name_long)
# check out the first 6 rows
head(world_ren[, c("country","continent")])
## country continent
## 1 Fiji Oceania
## 2 Tanzania Africa
## 3 Western Sahara Africa
## 4 Canada North America
## 5 United States North America
## 6 Kazakhstan Asia
Remove the duplicated rows in data frame.
distinct(DATA)
⌾ Distinct by distinct()
Here we first use function select()
to choose the columns we want. We can then use duplicated()
(see Section 1.4 ⌾ Check out the duplicated rows) to test if it exists duplicated rows. Last, we use distinct()
to remove the repeated ones.
# select data only three columns: continent, region_un, and subregion
world_sub=select(world, continent, region_un, subregion)
# check out the first 6 rows
head(world_sub)
## continent region_un subregion
## 1 Oceania Oceania Melanesia
## 2 Africa Africa Eastern Africa
## 3 Africa Africa Northern Africa
## 4 North America Americas Northern America
## 5 North America Americas Northern America
## 6 Asia Asia Central Asia
# check if there is duplicated rows in world_sub (show first 10 results)
duplicated(world_sub)[1:10]
## [1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE
# world_sub has duplicated rows, using distinct to retain the unique one
world_sub_dis=distinct(world_sub)
# check out the first 6 rows
head(world_sub_dis)
## continent region_un subregion
## 1 Oceania Oceania Melanesia
## 2 Africa Africa Eastern Africa
## 3 Africa Africa Northern Africa
## 4 North America Americas Northern America
## 5 Asia Asia Central Asia
## 6 Asia Asia South-Eastern Asia
Now, we know the basic syntax of R programming, and are familiar with package dplyr
. It is enough to do some big data analysis!!
In the class in 12/10 (Sat.), we need to dig into the GIS operation in R. Before the class, please install the following packages in advance. Just run the code below!
# sf (for GIS)
install.packages("sf")
# ggplot2 (for plot)
install.packages("ggplot2")
# spData (for GIS data)
install.packages("spData")
# TWspdata (for GIS data of Taiwan)
install.packages("devtools")
devtools::install_github("ChiaJung-Yeh/TWspdata")
# TDX (for connecting to transport data)
devtools::install_github("ChiaJung-Yeh/NYCU_TDX")
If you have any questions regarding the installation of R or the basic syntax of R programming, please feel free to contact TA (葉家榮).
E-mail: robert1328.mg10@nycu.edu.tw
See you soon!