print('Hello world!')
#> [1] "Hello world!"
8 R Fundamentals
A few advantages about R:
- Free and open source comparing to some other tools like Excel and SPSS.
- Optimized with vectorization.
8.1 Hello world for R
8.2 Essential concepts
- In R, assignments is
<-
, not=
.=
actually works, but it may cause confusions. So it is always recommended to use<-
. The R Studio keybinding for<-
isalt+-
. .
is NOT a special character in R, and can be used in variable names. Sois.na()
simply means a function calledis.na
. It is not a functionna
in a packageis
as in Python.- In R, the block is defined by
{}
. Indentation is not that important. - R has a better package management system than Python, and therefore in most cases you don’t need virtual environment for R.
8.2.1 R Markdown / Quarto
The counterpart of Jupyter notebook in R is .rmd/.qmd
file. Similar to a notebook, in a R Markdown / Quarto file, there is a so-called code block that can run the codes inside to produce documents with both texts and codes and codes outputs.
In the following two sections about R, you are supposed to submit .rmd/.qmd
file.
Quarto is an extension/continuation of R Markdown. Most R Markdown file can be directly translated to a Quarto file without many modifications. The main difference between R Markdown and Quarto is that Quarto has better support for other languages such as Python and Julia. You may go to its homepage for more details.
This note is produced by Quarto.
The most import part of R Markdown / Quarto is the code block, that is
```{r}
print('Hello world!')
```
In Quarto, you may also write
```{python}
print('Hello world!')
```
There are many options to adjust how the code blocks are excacuted. You don’t need to worry about them right now. Currently just try to write your report together with code blocks.
8.3 Data structures
Main reference here is [1] and [2].
8.3.1 Vectors
Vector is one of the basic data structure in R. It is created by c()
function. Sometimes it is called atomic vector
. You may store any data types in it. R recognizes six basic types: double, integers, characters, logicals, complex and raw.
The data type inside a vector can be checked by typeof
function.
<- c(1, 2, 3, 4, 5, 6)
die typeof(die)
#> [1] "double"
For consecutive numbers, an easier way to create vector is to use :
.
<- 1:6 die
Note that vector index starts from 1 in R, while list index starts from 0 in Python.
1]
die[#> [1] 1
When slicing with vectors, don’t forget to use c()
.
c(2, 3)]
die[#> [1] 2 3
2:3]
die[#> [1] 2 3
You may use length()
function to get its length.
length(die)
#> [1] 6
8.3.2 Attributes
R objects may have attributes. Attributes won’t be shown by default when you show the object. You may find the attributes of a R object by calling the attributes()
function.
The following example show that the vector die
defined in Section 8.3.1 doesn’t have attributes.
attributes(die)
#> NULL
Attributes can be read and write using attr
function. See the following example.
Example 8.1
attr(die, 'date') <- '2022-01-01'
die#> [1] 1 2 3 4 5 6
#> attr(,"date")
#> [1] "2022-01-01"
attr(die, 'date') <- NULL
die#> [1] 1 2 3 4 5 6
You may think attributes as metadata attached to a R object. They are used to tell some useful infomation of the object. Some functions will interact with certain attributes. R itself treat attributes class
, comment
, dim
, dimnames
, names
, row.names
and tsp
specially. We will only talk about class
and names
here. dim
will be discussed in the next section. Others will be discussed when we use them.
class
: This is different from the class in Python.class
in R is an attribute which talks about the class of an object. If the attributeclass
is not assigned to an object, the object will have an implicit class:matrix
,array
,function
,numeric
or the result oftypeof
.
attr(x, 'class')
will show the “external” class of an object. You may also use class(x)
to read and write attribute class
. If the class
is not assigned, class(x)
will show the implicit class, while attr(x, 'class')
will show NULL
.
Example 8.2
attr(die, 'class')
#> NULL
class(die)
#> [1] "integer"
class(die) <- 'a die'
attr(die, 'class')
#> [1] "a die"
class(die)
#> [1] "a die"
names
: This attribute is used to name each element in a vector. After the names are assigned, it won’t be displayed below the data like other attributes. It will be displayed above the data with correct alignment. Similar toclass
, you may usenames()
to read and write the attribute.
Example 8.3
names(die) <- c('one', 'two', 'three', 'four', 'five', 'six')
die#> one two three four five six
#> 1 2 3 4 5 6
attributes(die)
#> $names
#> [1] "one" "two" "three" "four" "five" "six"
names(die)
#> [1] "one" "two" "three" "four" "five" "six"
is.vector(die)
#> [1] TRUE
When you store different types of data into a single vector in R, R will convert them into a single type. The default way to do so is
- if there are only logicals and numbers, logicals will be converted to numbers by
TRUE->1
andFALSE->0
. - if characters are presented, all are converted to characters by what it is.
c(1, TRUE)
#> [1] 1 1
c('1', 1, TRUE)
#> [1] "1" "1" "TRUE"
We can apply regular operators to vectors. The defaul way is to apply the operators element-wise.
8.3.3 matrices and arrays
<- matrix(c(1,2,3,4,5,6), nrow=2)
m 1, ]
m[#> [1] 1 3 5
A matrix has dim
attribute.
dim(m)
#> [1] 2 3
Note that by assigning and removing dim
attribute, you may change the object between vectors and matrices.
Example 8.4
m#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
is.matrix(m)
#> [1] TRUE
is.vector(m)
#> [1] FALSE
dim(m)
#> [1] 2 3
dim(m) <- NULL
m#> [1] 1 2 3 4 5 6
is.matrix(m)
#> [1] FALSE
is.vector(m)
#> [1] TRUE
The dim
of a matrix/vector can be a long vector. In this case, it will become an array.
8.3.4 factors
Factor is speical vector. It is a way to handle categorical data. The idea is the limit the possible values. In a factor all possible values are called level
, which is an attribute.
Example 8.5 We would like to talk about all months. We first define a vector of the valid levels:
<- c(
month_levels "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
Then we could start to transform some month vector into factors, by the function factor()
.
<- c("Dec", "Apr", "Jan", "Mar")
x1 <- factor(x1, level=month_levels)
y1 sort(x1)
#> [1] "Apr" "Dec" "Jan" "Mar"
sort(y1)
#> [1] Jan Mar Apr Dec
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Note that sorting y1
is based on the levels.
<- c("Dec", "Apr", "Jam", "Mar")
x2 <- factor(x2, level=month_levels)
y2
y2#> [1] Dec Apr <NA> Mar
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Note that y2
contains NA
value since there is an entry in x2
that is not valid.
8.3.5 Lists
List is very similar to a vector. The main difference is that vector can only store values, while list can store objects. The most typical example of objects is another vector. Please see the following example.
Example 8.6
c(1:2, 3:4)
#> [1] 1 2 3 4
list(1:2, 3:4)
#> [[1]]
#> [1] 1 2
#>
#> [[2]]
#> [1] 3 4
The attributes of an object is stored in an array.
<- matrix(c(1,2,3,4,5,6), nrow=2)
m <- attributes(m)
a class(a)
#> [1] "list"
8.3.6 data.frame
Data.Frame is a list with the class
attribute data.frame
, together with some restriction on the shape of each columns. You may think about it in terms of tables.
<- data.frame(face = c("ace", "two", "six"),
df suit = c("clubs", "clubs", "clubs"),
value = c(1, 2, 3))
df#> face suit value
#> 1 ace clubs 1
#> 2 two clubs 2
#> 3 six clubs 3
- Data Frame group vectors. Each vector represents a column.
- Different column can contain a different type of data, but every cell within one column must be the same type of data.
data.frame()
can be used to create a data.frame.- The type of a data.frame is a list. Similar to matrix comparing to vector, a
data.frame
is alist
withclass
data.frame
, as well as a few other attributes.
8.3.7 Examples
Example 8.7 Consider a date.frame representing a deck of cards. Here we use expand.grid()
to perform the Cartesian product.
<- c('spades', 'hearts', 'clubs', 'diamonds')
suit <- 1:13
face <- expand.grid(suit, face)
deck head(deck)
#> Var1 Var2
#> 1 spades 1
#> 2 hearts 1
#> 3 clubs 1
#> 4 diamonds 1
#> 5 spades 2
#> 6 hearts 2
We may assign names to change the column names.
names(deck) <- c('suit', 'face')
head(deck)
#> suit face
#> 1 spades 1
#> 2 hearts 1
#> 3 clubs 1
#> 4 diamonds 1
#> 5 spades 2
#> 6 hearts 2
Note that since suit
and face
are two vectors, merge()
can also do the Cartesian product. expand.grid()
is good for both vectors and data.frame.
<- merge(suit, face)
deck head(deck)
#> x y
#> 1 spades 1
#> 2 hearts 1
#> 3 clubs 1
#> 4 diamonds 1
#> 5 spades 2
#> 6 hearts 2
8.3.8 Load data
8.3.8.1 build-in datasets
R has many build-in datasets. You may use data()
to see all of them. Here are a few common datasets.
mtcars
: Motor Trend Car Road Tests: The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)
data(mtcars)
iris
: iris data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
data(iris)
ToothGrowth
: ToothGrowth data set contains the result from an experiment studying the effect of vitamin C on tooth growth in 60 Guinea pigs.
data(ToothGrowth)
PlantGrowth
: Results obtained from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment condition.
data(PlantGrowth)
USArrests
: This data set contains statistics about violent crime rates by us state.
data(USArrests)
8.3.8.2 Read from files
The build-in read.csv()
function can directly read .csv
file into a data.frame.
Example 8.8 We use the file yob1880.txt
from Chapter 5 here. Put the file in the working folder and run the following code.
<- read.csv('yob1880.txt', header = FALSE)
df head(df)
We may also manually assign columns names.
names(df) <- c('name', 'sex', 'counts')
head(df)
#> name sex counts
#> 1 Mary F 7065
#> 2 Anna F 2604
#> 3 Emma F 2003
#> 4 Elizabeth F 1939
#> 5 Minnie F 1746
#> 6 Margaret F 1578
To save data is straightforward.
write.csv(df, file='df.csv', row.names=FALSE)
8.3.9 Flow control
8.3.9.1 for
loop
Example 8.9
for (x in 1:10){
print(x)
}#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6
#> [1] 7
#> [1] 8
#> [1] 9
#> [1] 10
8.3.9.2 if-else
Example 8.10
<- 200
a <- 33
b
if (b > a) {
print("b is greater than a")
else if (a == b) {
} print("a and b are equal")
else {
} print("a is greater than b")
}#> [1] "a is greater than b"
8.3.9.3 Functions
The standard format to define a function is my_function <- function(input) {}
where the function name is on the left side of <-
, the input arguments are in the ()
, and the function body is in {}
. The output of the last line of the function body is the return value of the function.
Example 8.11
<- function() {
myfunction <- 1:6
die sum(die)
}
myfunction()
#> [1] 21
If you just type the function name without ()
, R will return the definition of the function.
myfunction#> function() {
#> die <- 1:6
#> sum(die)
#> }
The function sample(x)
: sample
takes a sample of the specified size from the elements of x
using either with or without replacement.
sample(x, size, replace = FALSE, prob = NULL)
:
x
: either a vector of one or more elements from which to choose, or a positive integer.size
: a non-negative integer giving the number of items to choose.replace
: should sampling be with replacement?prob
: a vector of probability weights for obtaining the elements of the vector being sampled.
8.4 R notations
8.4.1 Selecting Values
Let us start from a data.frame df
. The basic usage is df[ , ]
, where the first index is to subset the rows and the second index is to subset the columns. There are six ways to writing indexes.
- Positive integers: the regular way.
df[i, j]
means the data in the ith row and jth column.- If both
i
andj
are vectors, a data.frame will be returned. - If
i
orj
are a vector, a vector will be returned. If you still want a data.frame, you may add the optiondrop=FALSE
. - If only one index is provided, it refers to the column.
Example 8.12 We consider the simplified version of a deck. The deck only contains face values from 1 to 5.
1:2, 1:2]
deck[#> Var1 Var2
#> 1 spades 1
#> 2 hearts 1
1:2, 1]
deck[#> [1] spades hearts
#> Levels: spades hearts clubs diamonds
1:2, 1, drop=FALSE]
deck[#> Var1
#> 1 spades
#> 2 hearts
1]
deck[#> Var1
#> 1 spades
#> 2 hearts
#> 3 clubs
#> 4 diamonds
#> 5 spades
#> 6 hearts
#> 7 clubs
#> 8 diamonds
#> 9 spades
#> 10 hearts
#> 11 clubs
#> 12 diamonds
#> 13 spades
#> 14 hearts
#> 15 clubs
#> 16 diamonds
#> 17 spades
#> 18 hearts
#> 19 clubs
#> 20 diamonds
- Negative integers: remove the related index.
For example,
deck[-1, 1:3]
means it wants all rows except row 1, and column 1 to 3.deck[-(2:20), 1:2]
means it wants all rows ecepte row 2 to row 20, and column 1 to 2.- Negative index and positive index cannot be used together in the same index.
- Blank Spaces: want every value in the dimension.
1]
deck[, #> [1] spades hearts clubs diamonds spades hearts clubs diamonds
#> [9] spades hearts clubs diamonds spades hearts clubs diamonds
#> [17] spades hearts clubs diamonds
#> Levels: spades hearts clubs diamonds
1, ]
deck[#> Var1 Var2
#> 1 spades 1
- Logical values: select the rows or columns according to the value. The dimension should have exactly the same number of elements as the logical vector.
<- c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
rows
deck[rows,]#> Var1 Var2
#> 1 spades 1
#> 3 clubs 1
#> 5 spades 2
#> 6 hearts 2
#> 8 diamonds 2
#> 10 hearts 3
#> 11 clubs 3
#> 13 spades 4
#> 15 clubs 4
#> 16 diamonds 4
#> 18 hearts 5
#> 20 diamonds 5
1:2, c(TRUE, FALSE)]
deck[#> [1] spades hearts
#> Levels: spades hearts clubs diamonds
- Names: select columns based on
names
attribute.
'Var2']
deck[, #> [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5
8.4.2 Dollar signs and double brackets
List and data.frame obey an optional second system of notation. You can extract values using $
syntax: the data.frame’s name and the column name separated by a $
will select a column and return a vector (since the data in each column is actually a vector).
Example 8.13 Here is an exmaple about data.frames.
1]
deck[, #> [1] spades hearts clubs diamonds spades hearts clubs diamonds
#> [9] spades hearts clubs diamonds spades hearts clubs diamonds
#> [17] spades hearts clubs diamonds
#> Levels: spades hearts clubs diamonds
$Var1
deck#> [1] spades hearts clubs diamonds spades hearts clubs diamonds
#> [9] spades hearts clubs diamonds spades hearts clubs diamonds
#> [17] spades hearts clubs diamonds
#> Levels: spades hearts clubs diamonds
Note that if we select from the data.frame using index, we will get a data.frame.
1]
deck[#> Var1
#> 1 spades
#> 2 hearts
#> 3 clubs
#> 4 diamonds
#> 5 spades
#> 6 hearts
#> 7 clubs
#> 8 diamonds
#> 9 spades
#> 10 hearts
#> 11 clubs
#> 12 diamonds
#> 13 spades
#> 14 hearts
#> 15 clubs
#> 16 diamonds
#> 17 spades
#> 18 hearts
#> 19 clubs
#> 20 diamonds
class(deck[1])
#> [1] "data.frame"
Example 8.14 Here is an example about lists.
<- list(numbers = c(1, 2), logical = TRUE, strings = c("a", "b", "c"))
lst $numbers
lst#> [1] 1 2
Note that if we select from the list using index, we will get a list.
1]
lst[#> $numbers
#> [1] 1 2
class(lst[1])
#> [1] "list"
Please think through these two examples and figure out the similarity between them.
Understanding the return value type is very important. Many of the R function work with vectors, but they don’t work with lists. So using the correct way to get values is very important.
There is a command called attach()
which let you get access to deck$face
by just typing face
. It is highly recommanded NOT to do this. It is much better to make everything explicit, especially when using IDE, typing is much easier.
8.5 Modifying values
8.5.1 Changing values in place
You can use R’s notation system to modify values within an R object.
- In general when working with vectors, the two vectors should have the same length.
- If the lengths are different, R will repeat the shorter one to make it match with the longer one. This is called the vector recycling rule. R will throw a warning if the two lengths are not proposional.
Example 8.15
1:4 + 1:2
#> [1] 2 4 4 6
1:4 + 1:3
#> Warning in 1:4 + 1:3: longer object length is not a multiple of shorter object
#> length
#> [1] 2 4 6 5
- We may create values that do not yet exist in the object. R will expand the object to accommodate the new values.
Example 8.16
<- 1:6
vec
vec#> [1] 1 2 3 4 5 6
7] <- 0
vec[
vec#> [1] 1 2 3 4 5 6 0
Example 8.17
<- data.frame(a=c(1,2), b=c('a', 'b'))
df
df#> a b
#> 1 1 a
#> 2 2 b
$c <- 3:4
df
df#> a b c
#> 1 1 a 3
#> 2 2 b 4
8.5.2 Logical subsetting
We could compare two vectors element-wise, and the result is a logical vector. Then we could use this result to subset the vector / data.frame.
Example 8.18
<- c('spades', 'hearts', 'clubs', 'diamonds')
suit <- 1:5
face <- expand.grid(suit, face) deck
$Var1 == 'hearts'
deck#> [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
#> [13] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
$Var2[deck$Var1 == 'hearts']
deck#> [1] 1 2 3 4 5
$Var1 == 'hearts',]
deck[deck#> Var1 Var2
#> 2 hearts 1
#> 6 hearts 2
#> 10 hearts 3
#> 14 hearts 4
#> 18 hearts 5
We could directly assign values to the subset. Note that the following assignment create a new column with NA
values.
$Var3[deck$Var1 == 'hearts'] <- 1
deck
deck#> Var1 Var2 Var3
#> 1 spades 1 NA
#> 2 hearts 1 1
#> 3 clubs 1 NA
#> 4 diamonds 1 NA
#> 5 spades 2 NA
#> 6 hearts 2 1
#> 7 clubs 2 NA
#> 8 diamonds 2 NA
#> 9 spades 3 NA
#> 10 hearts 3 1
#> 11 clubs 3 NA
#> 12 diamonds 3 NA
#> 13 spades 4 NA
#> 14 hearts 4 1
#> 15 clubs 4 NA
#> 16 diamonds 4 NA
#> 17 spades 5 NA
#> 18 hearts 5 1
#> 19 clubs 5 NA
#> 20 diamonds 5 NA
Other than the regualr logical operators, R provides a speical one: %in%
.
x %in% y
: Is x
in the vector y
?
If x
is a vector, the output is a vector with the same length as x
, telling whether each element of x
is in y
or not.
Other than the regular Boolean operators, R provides two special ones: any
and all
.
any(cond1, cond2, ...)
: Are any of these conditions true?all(cond1, cond2, ...)
: Are all of these conditions true?
8.5.3 Missing values NA
In R, missing values are NA
, and you can directly work with NA
. Any computations related to NA
will return NA
.
na.rm
: Most R functions come with the optional argumentna.rm
. If you set it to beTRUE
, the function will ignoreNA
when evaluating the function.
Example 8.19
mean(c(NA, 1:50))
#> [1] NA
mean(c(NA, 1:50), na.rm=TRUE)
#> [1] 25.5
is.na()
: This is a function testing whether an object isNA
.
8.6 Exercises
Exercise 8.1 Start a R Markdown / Quarto file. In the first section write a R code block to print Hello world!
.
Exercise 8.2 Which of these are character strings and which are numbers? 1
, "1"
, "one"
.
Exercise 8.3 Create an atomic vector that stores just the face names of the cards: the ace of spades, king of spades, queen of spades, jack of spades, and ten of spades. Which type of vector will you use to save the names?
Hint: The face name of the ace of spades would be ace
and spades
is the suit.
Exercise 8.4 Create the following matrix, which stores the name and suit of every card in a royal flush.
#> [,1] [,2]
#> [1,] "ace" "spades"
#> [2,] "king" "spades"
#> [3,] "queen" "spades"
#> [4,] "jack" "spades"
#> [5,] "ten" "spades"
Exercise 8.5 Many card games assign a numerical value to each card. For example, in blackjack, each face card is worth 10 points, each number card is worth between 2 and 10 points, and each ace is worth 1 or 11 points, depending on the final score.
Make a virtual playing card by combining “ace” “heart” and 1 into a vector. What type of atomic vector will result? Check if you are right, and explain your reason.
Exercise 8.6 Use a list to store a single playing card, like the ace of hearts, which has a point value of one. The list should save the face of the card, the suit, and the point value in separate elements.
Exercise 8.7 Consider the following data.frame.
<- c('spades', 'hearts', 'clubs', 'diamonds')
suit <- 1:5
face <- expand.grid(suit, face) deck
Please write some codes to count how many rows whose Var1
are equal to hearts
.
Exercise 8.8 Converte the following sentences into tests written with R code.
w <- c(-1, 0, 1)
. Isw
positive?x <- c(5, 15)
. Isx
greater than10
and less than20
?y <- 'February'
. Is objecty
the wordFebruary
?z <- c("Monday", "Tuesday", "Friday")
. Is every value inz
a day of the week?
Exercise 8.9 Please write a function to shuffle the row of a data.frame. You may use the following data.frame deck
for test.
<- c('spades', 'hearts', 'clubs', 'diamonds')
suit <- 1:13
face <- expand.grid(suit, face) deck