print('Hello world!')
#> [1] "Hello world!"8 R Fundamentals
A few advantages about R:
- Free and open source comparing to some other tools like Excel and SPSS.
- Optimized with vectorization.
8.1 Hello world for R
8.2 Essential concepts
- In R, assignments is
<-, not=.=actually works, but it may cause confusions. So it is always recommended to use<-. The R Studio keybinding for<-isalt+-. .is NOT a special character in R, and can be used in variable names. Sois.na()simply means a function calledis.na. It is not a functionnain a packageisas in Python.- In R, the block is defined by
{}. Indentation is not that important. - R has a better package management system than Python, and therefore in most cases you don’t need virtual environment for R.
8.2.1 R Markdown / Quarto
The counterpart of Jupyter notebook in R is .rmd/.qmd file. Similar to a notebook, in a R Markdown / Quarto file, there is a so-called code block that can run the codes inside to produce documents with both texts and codes and codes outputs.
In the following two sections about R, you are supposed to submit .rmd/.qmd file.
Quarto is an extension/continuation of R Markdown. Most R Markdown file can be directly translated to a Quarto file without many modifications. The main difference between R Markdown and Quarto is that Quarto has better support for other languages such as Python and Julia. You may go to its homepage for more details.
This note is produced by Quarto.
The most import part of R Markdown / Quarto is the code block, that is
```{r}
print('Hello world!')
```In Quarto, you may also write
```{python}
print('Hello world!')
```There are many options to adjust how the code blocks are excacuted. You don’t need to worry about them right now. Currently just try to write your report together with code blocks.
8.3 Data structures
Main reference here is [1] and [2].
8.3.1 Vectors
Vector is one of the basic data structure in R. It is created by c() function. Sometimes it is called atomic vector. You may store any data types in it. R recognizes six basic types: double, integers, characters, logicals, complex and raw.
The data type inside a vector can be checked by typeof function.
die <- c(1, 2, 3, 4, 5, 6)
typeof(die)
#> [1] "double"For consecutive numbers, an easier way to create vector is to use :.
die <- 1:6Note that vector index starts from 1 in R, while list index starts from 0 in Python.
die[1]
#> [1] 1When slicing with vectors, don’t forget to use c().
die[c(2, 3)]
#> [1] 2 3die[2:3]
#> [1] 2 3You may use length() function to get its length.
length(die)
#> [1] 68.3.2 Attributes
R objects may have attributes. Attributes won’t be shown by default when you show the object. You may find the attributes of a R object by calling the attributes() function.
The following example show that the vector die defined in Section 8.3.1 doesn’t have attributes.
attributes(die)
#> NULLAttributes can be read and write using attr function. See the following example.
Example 8.1
attr(die, 'date') <- '2022-01-01'
die
#> [1] 1 2 3 4 5 6
#> attr(,"date")
#> [1] "2022-01-01"
attr(die, 'date') <- NULL
die
#> [1] 1 2 3 4 5 6You may think attributes as metadata attached to a R object. They are used to tell some useful infomation of the object. Some functions will interact with certain attributes. R itself treat attributes class, comment, dim, dimnames, names, row.names and tsp specially. We will only talk about class and names here. dim will be discussed in the next section. Others will be discussed when we use them.
class: This is different from the class in Python.classin R is an attribute which talks about the class of an object. If the attributeclassis not assigned to an object, the object will have an implicit class:matrix,array,function,numericor the result oftypeof.
attr(x, 'class') will show the “external” class of an object. You may also use class(x) to read and write attribute class. If the class is not assigned, class(x) will show the implicit class, while attr(x, 'class') will show NULL.
Example 8.2
attr(die, 'class')
#> NULL
class(die)
#> [1] "integer"
class(die) <- 'a die'
attr(die, 'class')
#> [1] "a die"
class(die)
#> [1] "a die"names: This attribute is used to name each element in a vector. After the names are assigned, it won’t be displayed below the data like other attributes. It will be displayed above the data with correct alignment. Similar toclass, you may usenames()to read and write the attribute.
Example 8.3
names(die) <- c('one', 'two', 'three', 'four', 'five', 'six')
die
#> one two three four five six
#> 1 2 3 4 5 6
attributes(die)
#> $names
#> [1] "one" "two" "three" "four" "five" "six"
names(die)
#> [1] "one" "two" "three" "four" "five" "six"
is.vector(die)
#> [1] TRUEWhen you store different types of data into a single vector in R, R will convert them into a single type. The default way to do so is
- if there are only logicals and numbers, logicals will be converted to numbers by
TRUE->1andFALSE->0. - if characters are presented, all are converted to characters by what it is.
c(1, TRUE)
#> [1] 1 1
c('1', 1, TRUE)
#> [1] "1" "1" "TRUE"We can apply regular operators to vectors. The defaul way is to apply the operators element-wise.
8.3.3 matrices and arrays
m <- matrix(c(1,2,3,4,5,6), nrow=2)
m[1, ]
#> [1] 1 3 5A matrix has dim attribute.
dim(m)
#> [1] 2 3Note that by assigning and removing dim attribute, you may change the object between vectors and matrices.
Example 8.4
m
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
is.matrix(m)
#> [1] TRUE
is.vector(m)
#> [1] FALSE
dim(m)
#> [1] 2 3
dim(m) <- NULL
m
#> [1] 1 2 3 4 5 6
is.matrix(m)
#> [1] FALSE
is.vector(m)
#> [1] TRUEThe dim of a matrix/vector can be a long vector. In this case, it will become an array.
8.3.4 factors
Factor is speical vector. It is a way to handle categorical data. The idea is the limit the possible values. In a factor all possible values are called level, which is an attribute.
Example 8.5 We would like to talk about all months. We first define a vector of the valid levels:
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)Then we could start to transform some month vector into factors, by the function factor().
x1 <- c("Dec", "Apr", "Jan", "Mar")
y1 <- factor(x1, level=month_levels)
sort(x1)
#> [1] "Apr" "Dec" "Jan" "Mar"
sort(y1)
#> [1] Jan Mar Apr Dec
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov DecNote that sorting y1 is based on the levels.
x2 <- c("Dec", "Apr", "Jam", "Mar")
y2 <- factor(x2, level=month_levels)
y2
#> [1] Dec Apr <NA> Mar
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov DecNote that y2 contains NA value since there is an entry in x2 that is not valid.
8.3.5 Lists
List is very similar to a vector. The main difference is that vector can only store values, while list can store objects. The most typical example of objects is another vector. Please see the following example.
Example 8.6
c(1:2, 3:4)
#> [1] 1 2 3 4
list(1:2, 3:4)
#> [[1]]
#> [1] 1 2
#>
#> [[2]]
#> [1] 3 4The attributes of an object is stored in an array.
m <- matrix(c(1,2,3,4,5,6), nrow=2)
a <- attributes(m)
class(a)
#> [1] "list"8.3.6 data.frame
Data.Frame is a list with the class attribute data.frame, together with some restriction on the shape of each columns. You may think about it in terms of tables.
df <- data.frame(face = c("ace", "two", "six"),
suit = c("clubs", "clubs", "clubs"),
value = c(1, 2, 3))
df
#> face suit value
#> 1 ace clubs 1
#> 2 two clubs 2
#> 3 six clubs 3- Data Frame group vectors. Each vector represents a column.
- Different column can contain a different type of data, but every cell within one column must be the same type of data.
data.frame()can be used to create a data.frame.- The type of a data.frame is a list. Similar to matrix comparing to vector, a
data.frameis alistwithclassdata.frame, as well as a few other attributes.
8.3.7 Examples
Example 8.7 Consider a date.frame representing a deck of cards. Here we use expand.grid() to perform the Cartesian product.
suit <- c('spades', 'hearts', 'clubs', 'diamonds')
face <- 1:13
deck <- expand.grid(suit, face)
head(deck)
#> Var1 Var2
#> 1 spades 1
#> 2 hearts 1
#> 3 clubs 1
#> 4 diamonds 1
#> 5 spades 2
#> 6 hearts 2We may assign names to change the column names.
names(deck) <- c('suit', 'face')
head(deck)
#> suit face
#> 1 spades 1
#> 2 hearts 1
#> 3 clubs 1
#> 4 diamonds 1
#> 5 spades 2
#> 6 hearts 2Note that since suit and face are two vectors, merge() can also do the Cartesian product. expand.grid() is good for both vectors and data.frame.
deck <- merge(suit, face)
head(deck)
#> x y
#> 1 spades 1
#> 2 hearts 1
#> 3 clubs 1
#> 4 diamonds 1
#> 5 spades 2
#> 6 hearts 28.3.8 Load data
8.3.8.1 build-in datasets
R has many build-in datasets. You may use data() to see all of them. Here are a few common datasets.
mtcars: Motor Trend Car Road Tests: The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)
data(mtcars)iris: iris data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
data(iris)ToothGrowth: ToothGrowth data set contains the result from an experiment studying the effect of vitamin C on tooth growth in 60 Guinea pigs.
data(ToothGrowth)PlantGrowth: Results obtained from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment condition.
data(PlantGrowth)USArrests: This data set contains statistics about violent crime rates by us state.
data(USArrests)8.3.8.2 Read from files
The build-in read.csv() function can directly read .csv file into a data.frame.
Example 8.8 We use the file yob1880.txt from Chapter 5 here. Put the file in the working folder and run the following code.
df <- read.csv('yob1880.txt', header = FALSE)
head(df)We may also manually assign columns names.
names(df) <- c('name', 'sex', 'counts')
head(df)
#> name sex counts
#> 1 Mary F 7065
#> 2 Anna F 2604
#> 3 Emma F 2003
#> 4 Elizabeth F 1939
#> 5 Minnie F 1746
#> 6 Margaret F 1578To save data is straightforward.
write.csv(df, file='df.csv', row.names=FALSE)8.3.9 Flow control
8.3.9.1 for loop
Example 8.9
for (x in 1:10){
print(x)
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6
#> [1] 7
#> [1] 8
#> [1] 9
#> [1] 108.3.9.2 if-else
Example 8.10
a <- 200
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print("a and b are equal")
} else {
print("a is greater than b")
}
#> [1] "a is greater than b"8.3.9.3 Functions
The standard format to define a function is my_function <- function(input) {} where the function name is on the left side of <-, the input arguments are in the (), and the function body is in {}. The output of the last line of the function body is the return value of the function.
Example 8.11
myfunction <- function() {
die <- 1:6
sum(die)
}
myfunction()
#> [1] 21If you just type the function name without (), R will return the definition of the function.
myfunction
#> function() {
#> die <- 1:6
#> sum(die)
#> }The function sample(x): sample takes a sample of the specified size from the elements of x using either with or without replacement.
sample(x, size, replace = FALSE, prob = NULL):
x: either a vector of one or more elements from which to choose, or a positive integer.size: a non-negative integer giving the number of items to choose.replace: should sampling be with replacement?prob: a vector of probability weights for obtaining the elements of the vector being sampled.
8.4 R notations
8.4.1 Selecting Values
Let us start from a data.frame df. The basic usage is df[ , ], where the first index is to subset the rows and the second index is to subset the columns. There are six ways to writing indexes.
- Positive integers: the regular way.
df[i, j]means the data in the ith row and jth column.- If both
iandjare vectors, a data.frame will be returned. - If
iorjare a vector, a vector will be returned. If you still want a data.frame, you may add the optiondrop=FALSE. - If only one index is provided, it refers to the column.
Example 8.12 We consider the simplified version of a deck. The deck only contains face values from 1 to 5.
deck[1:2, 1:2]
#> Var1 Var2
#> 1 spades 1
#> 2 hearts 1
deck[1:2, 1]
#> [1] spades hearts
#> Levels: spades hearts clubs diamonds
deck[1:2, 1, drop=FALSE]
#> Var1
#> 1 spades
#> 2 hearts
deck[1]
#> Var1
#> 1 spades
#> 2 hearts
#> 3 clubs
#> 4 diamonds
#> 5 spades
#> 6 hearts
#> 7 clubs
#> 8 diamonds
#> 9 spades
#> 10 hearts
#> 11 clubs
#> 12 diamonds
#> 13 spades
#> 14 hearts
#> 15 clubs
#> 16 diamonds
#> 17 spades
#> 18 hearts
#> 19 clubs
#> 20 diamonds- Negative integers: remove the related index.
For example,
deck[-1, 1:3]means it wants all rows except row 1, and column 1 to 3.deck[-(2:20), 1:2]means it wants all rows ecepte row 2 to row 20, and column 1 to 2.- Negative index and positive index cannot be used together in the same index.
- Blank Spaces: want every value in the dimension.
deck[, 1]
#> [1] spades hearts clubs diamonds spades hearts clubs diamonds
#> [9] spades hearts clubs diamonds spades hearts clubs diamonds
#> [17] spades hearts clubs diamonds
#> Levels: spades hearts clubs diamonds
deck[1, ]
#> Var1 Var2
#> 1 spades 1- Logical values: select the rows or columns according to the value. The dimension should have exactly the same number of elements as the logical vector.
rows <- c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
deck[rows,]
#> Var1 Var2
#> 1 spades 1
#> 3 clubs 1
#> 5 spades 2
#> 6 hearts 2
#> 8 diamonds 2
#> 10 hearts 3
#> 11 clubs 3
#> 13 spades 4
#> 15 clubs 4
#> 16 diamonds 4
#> 18 hearts 5
#> 20 diamonds 5
deck[1:2, c(TRUE, FALSE)]
#> [1] spades hearts
#> Levels: spades hearts clubs diamonds- Names: select columns based on
namesattribute.
deck[, 'Var2']
#> [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 58.4.2 Dollar signs and double brackets
List and data.frame obey an optional second system of notation. You can extract values using $ syntax: the data.frame’s name and the column name separated by a $ will select a column and return a vector (since the data in each column is actually a vector).
Example 8.13 Here is an exmaple about data.frames.
deck[, 1]
#> [1] spades hearts clubs diamonds spades hearts clubs diamonds
#> [9] spades hearts clubs diamonds spades hearts clubs diamonds
#> [17] spades hearts clubs diamonds
#> Levels: spades hearts clubs diamonds
deck$Var1
#> [1] spades hearts clubs diamonds spades hearts clubs diamonds
#> [9] spades hearts clubs diamonds spades hearts clubs diamonds
#> [17] spades hearts clubs diamonds
#> Levels: spades hearts clubs diamondsNote that if we select from the data.frame using index, we will get a data.frame.
deck[1]
#> Var1
#> 1 spades
#> 2 hearts
#> 3 clubs
#> 4 diamonds
#> 5 spades
#> 6 hearts
#> 7 clubs
#> 8 diamonds
#> 9 spades
#> 10 hearts
#> 11 clubs
#> 12 diamonds
#> 13 spades
#> 14 hearts
#> 15 clubs
#> 16 diamonds
#> 17 spades
#> 18 hearts
#> 19 clubs
#> 20 diamonds
class(deck[1])
#> [1] "data.frame"Example 8.14 Here is an example about lists.
lst <- list(numbers = c(1, 2), logical = TRUE, strings = c("a", "b", "c"))
lst$numbers
#> [1] 1 2Note that if we select from the list using index, we will get a list.
lst[1]
#> $numbers
#> [1] 1 2
class(lst[1])
#> [1] "list"Please think through these two examples and figure out the similarity between them.
Understanding the return value type is very important. Many of the R function work with vectors, but they don’t work with lists. So using the correct way to get values is very important.
There is a command called attach() which let you get access to deck$face by just typing face. It is highly recommanded NOT to do this. It is much better to make everything explicit, especially when using IDE, typing is much easier.
8.5 Modifying values
8.5.1 Changing values in place
You can use R’s notation system to modify values within an R object.
- In general when working with vectors, the two vectors should have the same length.
- If the lengths are different, R will repeat the shorter one to make it match with the longer one. This is called the vector recycling rule. R will throw a warning if the two lengths are not proposional.
Example 8.15
1:4 + 1:2
#> [1] 2 4 4 6
1:4 + 1:3
#> Warning in 1:4 + 1:3: longer object length is not a multiple of shorter object
#> length
#> [1] 2 4 6 5- We may create values that do not yet exist in the object. R will expand the object to accommodate the new values.
Example 8.16
vec <- 1:6
vec
#> [1] 1 2 3 4 5 6
vec[7] <- 0
vec
#> [1] 1 2 3 4 5 6 0Example 8.17
df <- data.frame(a=c(1,2), b=c('a', 'b'))
df
#> a b
#> 1 1 a
#> 2 2 b
df$c <- 3:4
df
#> a b c
#> 1 1 a 3
#> 2 2 b 48.5.2 Logical subsetting
We could compare two vectors element-wise, and the result is a logical vector. Then we could use this result to subset the vector / data.frame.
Example 8.18
suit <- c('spades', 'hearts', 'clubs', 'diamonds')
face <- 1:5
deck <- expand.grid(suit, face)deck$Var1 == 'hearts'
#> [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
#> [13] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
deck$Var2[deck$Var1 == 'hearts']
#> [1] 1 2 3 4 5
deck[deck$Var1 == 'hearts',]
#> Var1 Var2
#> 2 hearts 1
#> 6 hearts 2
#> 10 hearts 3
#> 14 hearts 4
#> 18 hearts 5We could directly assign values to the subset. Note that the following assignment create a new column with NA values.
deck$Var3[deck$Var1 == 'hearts'] <- 1
deck
#> Var1 Var2 Var3
#> 1 spades 1 NA
#> 2 hearts 1 1
#> 3 clubs 1 NA
#> 4 diamonds 1 NA
#> 5 spades 2 NA
#> 6 hearts 2 1
#> 7 clubs 2 NA
#> 8 diamonds 2 NA
#> 9 spades 3 NA
#> 10 hearts 3 1
#> 11 clubs 3 NA
#> 12 diamonds 3 NA
#> 13 spades 4 NA
#> 14 hearts 4 1
#> 15 clubs 4 NA
#> 16 diamonds 4 NA
#> 17 spades 5 NA
#> 18 hearts 5 1
#> 19 clubs 5 NA
#> 20 diamonds 5 NAOther than the regualr logical operators, R provides a speical one: %in%.
x %in% y: Is x in the vector y?
If x is a vector, the output is a vector with the same length as x, telling whether each element of x is in y or not.
Other than the regular Boolean operators, R provides two special ones: any and all.
any(cond1, cond2, ...): Are any of these conditions true?all(cond1, cond2, ...): Are all of these conditions true?
8.5.3 Missing values NA
In R, missing values are NA, and you can directly work with NA. Any computations related to NA will return NA.
na.rm: Most R functions come with the optional argumentna.rm. If you set it to beTRUE, the function will ignoreNAwhen evaluating the function.
Example 8.19
mean(c(NA, 1:50))
#> [1] NA
mean(c(NA, 1:50), na.rm=TRUE)
#> [1] 25.5is.na(): This is a function testing whether an object isNA.
8.6 Exercises
Exercise 8.1 Start a R Markdown / Quarto file. In the first section write a R code block to print Hello world!.
Exercise 8.2 Which of these are character strings and which are numbers? 1, "1", "one".
Exercise 8.3 Create an atomic vector that stores just the face names of the cards: the ace of spades, king of spades, queen of spades, jack of spades, and ten of spades. Which type of vector will you use to save the names?
Hint: The face name of the ace of spades would be ace and spades is the suit.
Exercise 8.4 Create the following matrix, which stores the name and suit of every card in a royal flush.
#> [,1] [,2]
#> [1,] "ace" "spades"
#> [2,] "king" "spades"
#> [3,] "queen" "spades"
#> [4,] "jack" "spades"
#> [5,] "ten" "spades"
Exercise 8.5 Many card games assign a numerical value to each card. For example, in blackjack, each face card is worth 10 points, each number card is worth between 2 and 10 points, and each ace is worth 1 or 11 points, depending on the final score.
Make a virtual playing card by combining “ace” “heart” and 1 into a vector. What type of atomic vector will result? Check if you are right, and explain your reason.
Exercise 8.6 Use a list to store a single playing card, like the ace of hearts, which has a point value of one. The list should save the face of the card, the suit, and the point value in separate elements.
Exercise 8.7 Consider the following data.frame.
suit <- c('spades', 'hearts', 'clubs', 'diamonds')
face <- 1:5
deck <- expand.grid(suit, face)Please write some codes to count how many rows whose Var1 are equal to hearts.
Exercise 8.8 Converte the following sentences into tests written with R code.
w <- c(-1, 0, 1). Iswpositive?x <- c(5, 15). Isxgreater than10and less than20?y <- 'February'. Is objectythe wordFebruary?z <- c("Monday", "Tuesday", "Friday"). Is every value inza day of the week?
Exercise 8.9 Please write a function to shuffle the row of a data.frame. You may use the following data.frame deck for test.
suit <- c('spades', 'hearts', 'clubs', 'diamonds')
face <- 1:13
deck <- expand.grid(suit, face)