8 R Fundamentals

A few advantages about R:

Free and open source comparing to some other tools like Excel and SPSS.
Optimized with vectorization.

8.1 Hello world for R

print('Hello world!')
#> [1] "Hello world!"

8.2 Essential concepts

In R, assignments is <-, not =. = actually works, but it may cause confusions. So it is always recommended to use <-. The R Studio keybinding for <- is alt+-.
. is NOT a special character in R, and can be used in variable names. So is.na() simply means a function called is.na. It is not a function na in a package is as in Python.
In R, the block is defined by {}. Indentation is not that important.
R has a better package management system than Python, and therefore in most cases you don’t need virtual environment for R.

8.2.1 R Markdown / Quarto

The counterpart of Jupyter notebook in R is .rmd/.qmd file. Similar to a notebook, in a R Markdown / Quarto file, there is a so-called code block that can run the codes inside to produce documents with both texts and codes and codes outputs.

In the following two sections about R, you are supposed to submit .rmd/.qmd file.

Note

Quarto is an extension/continuation of R Markdown. Most R Markdown file can be directly translated to a Quarto file without many modifications. The main difference between R Markdown and Quarto is that Quarto has better support for other languages such as Python and Julia. You may go to its homepage for more details.

This note is produced by Quarto.

The most import part of R Markdown / Quarto is the code block, that is

```{r}
print('Hello world!')
```

In Quarto, you may also write

```{python}
print('Hello world!')
```

There are many options to adjust how the code blocks are excacuted. You don’t need to worry about them right now. Currently just try to write your report together with code blocks.

8.3 Data structures

Main reference here is [1] and [2].

8.3.1 Vectors

Vector is one of the basic data structure in R. It is created by c() function. Sometimes it is called atomic vector. You may store any data types in it. R recognizes six basic types: double, integers, characters, logicals, complex and raw.

The data type inside a vector can be checked by typeof function.

die <- c(1, 2, 3, 4, 5, 6)
typeof(die)
#> [1] "double"

For consecutive numbers, an easier way to create vector is to use :.

die <- 1:6

Caution

Note that vector index starts from 1 in R, while list index starts from 0 in Python.

die[1]
#> [1] 1

When slicing with vectors, don’t forget to use c().

die[c(2, 3)]
#> [1] 2 3

die[2:3]
#> [1] 2 3

You may use length() function to get its length.

length(die)
#> [1] 6

8.3.2 Attributes

R objects may have attributes. Attributes won’t be shown by default when you show the object. You may find the attributes of a R object by calling the attributes() function.

The following example show that the vector die defined in Section 8.3.1 doesn’t have attributes.

attributes(die)
#> NULL

Attributes can be read and write using attr function. See the following example.

Example 8.1

attr(die, 'date') <- '2022-01-01'
die
#> [1] 1 2 3 4 5 6
#> attr(,"date")
#> [1] "2022-01-01"
attr(die, 'date') <- NULL
die
#> [1] 1 2 3 4 5 6

You may think attributes as metadata attached to a R object. They are used to tell some useful infomation of the object. Some functions will interact with certain attributes. R itself treat attributes class, comment, dim, dimnames, names, row.names and tsp specially. We will only talk about class and names here. dim will be discussed in the next section. Others will be discussed when we use them.

class: This is different from the class in Python. class in R is an attribute which talks about the class of an object. If the attribute class is not assigned to an object, the object will have an implicit class: matrix, array, function, numeric or the result of typeof.

attr(x, 'class') will show the “external” class of an object. You may also use class(x) to read and write attribute class. If the class is not assigned, class(x) will show the implicit class, while attr(x, 'class') will show NULL.

Example 8.2

attr(die, 'class')
#> NULL
class(die)
#> [1] "integer"
class(die) <- 'a die'
attr(die, 'class')
#> [1] "a die"
class(die)
#> [1] "a die"

names: This attribute is used to name each element in a vector. After the names are assigned, it won’t be displayed below the data like other attributes. It will be displayed above the data with correct alignment. Similar to class, you may use names() to read and write the attribute.

Example 8.3

names(die) <- c('one', 'two', 'three', 'four', 'five', 'six')
die
#>   one   two three  four  five   six 
#>     1     2     3     4     5     6
attributes(die)
#> $names
#> [1] "one"   "two"   "three" "four"  "five"  "six"
names(die)
#> [1] "one"   "two"   "three" "four"  "five"  "six"
is.vector(die)
#> [1] TRUE

Tip

When you store different types of data into a single vector in R, R will convert them into a single type. The default way to do so is

if there are only logicals and numbers, logicals will be converted to numbers by TRUE->1 and FALSE->0.
if characters are presented, all are converted to characters by what it is.

c(1, TRUE)
#> [1] 1 1
c('1', 1, TRUE)
#> [1] "1"    "1"    "TRUE"

Note

We can apply regular operators to vectors. The defaul way is to apply the operators element-wise.

8.3.3 matrices and arrays

m <- matrix(c(1,2,3,4,5,6), nrow=2)
m[1, ]
#> [1] 1 3 5

A matrix has dim attribute.

dim(m)
#> [1] 2 3

Note that by assigning and removing dim attribute, you may change the object between vectors and matrices.

Example 8.4

m
#>      [,1] [,2] [,3]
#> [1,]    1    3    5
#> [2,]    2    4    6
is.matrix(m)
#> [1] TRUE
is.vector(m)
#> [1] FALSE
dim(m)
#> [1] 2 3
dim(m) <- NULL
m
#> [1] 1 2 3 4 5 6
is.matrix(m)
#> [1] FALSE
is.vector(m)
#> [1] TRUE

Note

The dim of a matrix/vector can be a long vector. In this case, it will become an array.

8.3.4 factors

Factor is speical vector. It is a way to handle categorical data. The idea is the limit the possible values. In a factor all possible values are called level, which is an attribute.

Example 8.5 We would like to talk about all months. We first define a vector of the valid levels:

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

Then we could start to transform some month vector into factors, by the function factor().

x1 <- c("Dec", "Apr", "Jan", "Mar")
y1 <- factor(x1, level=month_levels)
sort(x1)
#> [1] "Apr" "Dec" "Jan" "Mar"
sort(y1)
#> [1] Jan Mar Apr Dec
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Note that sorting y1 is based on the levels.

x2 <- c("Dec", "Apr", "Jam", "Mar")
y2 <- factor(x2, level=month_levels)
y2
#> [1] Dec  Apr  <NA> Mar 
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Note that y2 contains NA value since there is an entry in x2 that is not valid.

8.3.5 Lists

List is very similar to a vector. The main difference is that vector can only store values, while list can store objects. The most typical example of objects is another vector. Please see the following example.

Example 8.6

c(1:2, 3:4)
#> [1] 1 2 3 4
list(1:2, 3:4)
#> [[1]]
#> [1] 1 2
#> 
#> [[2]]
#> [1] 3 4

Note

The attributes of an object is stored in an array.

m <- matrix(c(1,2,3,4,5,6), nrow=2)
a <- attributes(m)
class(a)
#> [1] "list"

8.3.6 `data.frame`

Data.Frame is a list with the class attribute data.frame, together with some restriction on the shape of each columns. You may think about it in terms of tables.

df <- data.frame(face = c("ace", "two", "six"),
                 suit = c("clubs", "clubs", "clubs"),
                 value = c(1, 2, 3))
df
#>   face  suit value
#> 1  ace clubs     1
#> 2  two clubs     2
#> 3  six clubs     3

Data Frame group vectors. Each vector represents a column.
Different column can contain a different type of data, but every cell within one column must be the same type of data.
data.frame() can be used to create a data.frame.
The type of a data.frame is a list. Similar to matrix comparing to vector, a data.frame is a list with class data.frame, as well as a few other attributes.

8.3.7 Examples

Example 8.7 Consider a date.frame representing a deck of cards. Here we use expand.grid() to perform the Cartesian product.

suit <- c('spades', 'hearts', 'clubs', 'diamonds')
face <- 1:13
deck <- expand.grid(suit, face)
head(deck)
#>       Var1 Var2
#> 1   spades    1
#> 2   hearts    1
#> 3    clubs    1
#> 4 diamonds    1
#> 5   spades    2
#> 6   hearts    2

We may assign names to change the column names.

names(deck) <- c('suit', 'face')
head(deck)
#>       suit face
#> 1   spades    1
#> 2   hearts    1
#> 3    clubs    1
#> 4 diamonds    1
#> 5   spades    2
#> 6   hearts    2

Note that since suit and face are two vectors, merge() can also do the Cartesian product. expand.grid() is good for both vectors and data.frame.

deck <- merge(suit, face)
head(deck)
#>          x y
#> 1   spades 1
#> 2   hearts 1
#> 3    clubs 1
#> 4 diamonds 1
#> 5   spades 2
#> 6   hearts 2

8.3.8 Load data

8.3.8.1 build-in datasets

R has many build-in datasets. You may use data() to see all of them. Here are a few common datasets.

mtcars: Motor Trend Car Road Tests: The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)

data(mtcars)

iris: iris data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

data(iris)

ToothGrowth: ToothGrowth data set contains the result from an experiment studying the effect of vitamin C on tooth growth in 60 Guinea pigs.

data(ToothGrowth)

PlantGrowth: Results obtained from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment condition.

data(PlantGrowth)

USArrests: This data set contains statistics about violent crime rates by us state.

data(USArrests)

8.3.8.2 Read from files

The build-in read.csv() function can directly read .csv file into a data.frame.

Example 8.8 We use the file yob1880.txt from Chapter 5 here. Put the file in the working folder and run the following code.

df <- read.csv('yob1880.txt', header = FALSE)
head(df)

We may also manually assign columns names.

names(df) <- c('name', 'sex', 'counts')
head(df)
#>        name sex counts
#> 1      Mary   F   7065
#> 2      Anna   F   2604
#> 3      Emma   F   2003
#> 4 Elizabeth   F   1939
#> 5    Minnie   F   1746
#> 6  Margaret   F   1578

Note

To save data is straightforward.

write.csv(df, file='df.csv', row.names=FALSE)

8.3.9 Flow control

8.3.9.1 `for` loop

Example 8.9

for (x in 1:10){
    print(x)
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6
#> [1] 7
#> [1] 8
#> [1] 9
#> [1] 10

8.3.9.2 `if-else`

Example 8.10

a <- 200
b <- 33

if (b > a) {
  print("b is greater than a")
} else if (a == b) {
  print("a and b are equal")
} else {
  print("a is greater than b")
}
#> [1] "a is greater than b"

8.3.9.3 Functions

The standard format to define a function is my_function <- function(input) {} where the function name is on the left side of <-, the input arguments are in the (), and the function body is in {}. The output of the last line of the function body is the return value of the function.

Example 8.11

myfunction <- function() {
    die <- 1:6
    sum(die)
}

myfunction()
#> [1] 21

If you just type the function name without (), R will return the definition of the function.

myfunction
#> function() {
#>     die <- 1:6
#>     sum(die)
#> }

Tip

The function sample(x): sample takes a sample of the specified size from the elements of x using either with or without replacement.

sample(x, size, replace = FALSE, prob = NULL):

x: either a vector of one or more elements from which to choose, or a positive integer.
size: a non-negative integer giving the number of items to choose.
replace: should sampling be with replacement?
prob: a vector of probability weights for obtaining the elements of the vector being sampled.

8.4 R notations

8.4.1 Selecting Values

Let us start from a data.frame df. The basic usage is df[ , ], where the first index is to subset the rows and the second index is to subset the columns. There are six ways to writing indexes.

Positive integers: the regular way.

df[i, j] means the data in the ith row and jth column.
If both i and j are vectors, a data.frame will be returned.
If i or j are a vector, a vector will be returned. If you still want a data.frame, you may add the option drop=FALSE.
If only one index is provided, it refers to the column.

Example 8.12 We consider the simplified version of a deck. The deck only contains face values from 1 to 5.

deck[1:2, 1:2]
#>     Var1 Var2
#> 1 spades    1
#> 2 hearts    1
deck[1:2, 1]
#> [1] spades hearts
#> Levels: spades hearts clubs diamonds
deck[1:2, 1, drop=FALSE]
#>     Var1
#> 1 spades
#> 2 hearts
deck[1]
#>        Var1
#> 1    spades
#> 2    hearts
#> 3     clubs
#> 4  diamonds
#> 5    spades
#> 6    hearts
#> 7     clubs
#> 8  diamonds
#> 9    spades
#> 10   hearts
#> 11    clubs
#> 12 diamonds
#> 13   spades
#> 14   hearts
#> 15    clubs
#> 16 diamonds
#> 17   spades
#> 18   hearts
#> 19    clubs
#> 20 diamonds

Negative integers: remove the related index.

For example,

deck[-1, 1:3] means it wants all rows except row 1, and column 1 to 3.
deck[-(2:20), 1:2] means it wants all rows ecepte row 2 to row 20, and column 1 to 2.
Negative index and positive index cannot be used together in the same index.

Blank Spaces: want every value in the dimension.

deck[, 1]
#>  [1] spades   hearts   clubs    diamonds spades   hearts   clubs    diamonds
#>  [9] spades   hearts   clubs    diamonds spades   hearts   clubs    diamonds
#> [17] spades   hearts   clubs    diamonds
#> Levels: spades hearts clubs diamonds
deck[1, ]
#>     Var1 Var2
#> 1 spades    1

Logical values: select the rows or columns according to the value. The dimension should have exactly the same number of elements as the logical vector.

rows <- c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
deck[rows,]
#>        Var1 Var2
#> 1    spades    1
#> 3     clubs    1
#> 5    spades    2
#> 6    hearts    2
#> 8  diamonds    2
#> 10   hearts    3
#> 11    clubs    3
#> 13   spades    4
#> 15    clubs    4
#> 16 diamonds    4
#> 18   hearts    5
#> 20 diamonds    5
deck[1:2, c(TRUE, FALSE)]
#> [1] spades hearts
#> Levels: spades hearts clubs diamonds

Names: select columns based on names attribute.

deck[, 'Var2']
#>  [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5

8.4.2 Dollar signs and double brackets

List and data.frame obey an optional second system of notation. You can extract values using $ syntax: the data.frame’s name and the column name separated by a $ will select a column and return a vector (since the data in each column is actually a vector).

Example 8.13 Here is an exmaple about data.frames.

deck[, 1]
#>  [1] spades   hearts   clubs    diamonds spades   hearts   clubs    diamonds
#>  [9] spades   hearts   clubs    diamonds spades   hearts   clubs    diamonds
#> [17] spades   hearts   clubs    diamonds
#> Levels: spades hearts clubs diamonds
deck$Var1
#>  [1] spades   hearts   clubs    diamonds spades   hearts   clubs    diamonds
#>  [9] spades   hearts   clubs    diamonds spades   hearts   clubs    diamonds
#> [17] spades   hearts   clubs    diamonds
#> Levels: spades hearts clubs diamonds

Note that if we select from the data.frame using index, we will get a data.frame.

deck[1]
#>        Var1
#> 1    spades
#> 2    hearts
#> 3     clubs
#> 4  diamonds
#> 5    spades
#> 6    hearts
#> 7     clubs
#> 8  diamonds
#> 9    spades
#> 10   hearts
#> 11    clubs
#> 12 diamonds
#> 13   spades
#> 14   hearts
#> 15    clubs
#> 16 diamonds
#> 17   spades
#> 18   hearts
#> 19    clubs
#> 20 diamonds
class(deck[1])
#> [1] "data.frame"

Example 8.14 Here is an example about lists.

lst <- list(numbers = c(1, 2), logical = TRUE, strings = c("a", "b", "c"))
lst$numbers
#> [1] 1 2

Note that if we select from the list using index, we will get a list.

lst[1]
#> $numbers
#> [1] 1 2
class(lst[1])
#> [1] "list"

Please think through these two examples and figure out the similarity between them.

Caution

Understanding the return value type is very important. Many of the R function work with vectors, but they don’t work with lists. So using the correct way to get values is very important.

Warning

There is a command called attach() which let you get access to deck$face by just typing face. It is highly recommanded NOT to do this. It is much better to make everything explicit, especially when using IDE, typing is much easier.

8.5 Modifying values

8.5.1 Changing values in place

You can use R’s notation system to modify values within an R object.

In general when working with vectors, the two vectors should have the same length.
If the lengths are different, R will repeat the shorter one to make it match with the longer one. This is called the vector recycling rule. R will throw a warning if the two lengths are not proposional.

Example 8.15

1:4 + 1:2
#> [1] 2 4 4 6
1:4 + 1:3
#> Warning in 1:4 + 1:3: longer object length is not a multiple of shorter object
#> length
#> [1] 2 4 6 5

We may create values that do not yet exist in the object. R will expand the object to accommodate the new values.

Example 8.16

vec <- 1:6
vec
#> [1] 1 2 3 4 5 6
vec[7] <- 0
vec
#> [1] 1 2 3 4 5 6 0

Example 8.17

df <- data.frame(a=c(1,2), b=c('a', 'b'))
df
#>   a b
#> 1 1 a
#> 2 2 b
df$c <- 3:4
df
#>   a b c
#> 1 1 a 3
#> 2 2 b 4

8.5.2 Logical subsetting

We could compare two vectors element-wise, and the result is a logical vector. Then we could use this result to subset the vector / data.frame.

Example 8.18

suit <- c('spades', 'hearts', 'clubs', 'diamonds')
face <- 1:5
deck <- expand.grid(suit, face)

deck$Var1 == 'hearts'
#>  [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
#> [13] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
deck$Var2[deck$Var1 == 'hearts']
#> [1] 1 2 3 4 5
deck[deck$Var1 == 'hearts',]
#>      Var1 Var2
#> 2  hearts    1
#> 6  hearts    2
#> 10 hearts    3
#> 14 hearts    4
#> 18 hearts    5

We could directly assign values to the subset. Note that the following assignment create a new column with NA values.

deck$Var3[deck$Var1 == 'hearts'] <- 1
deck
#>        Var1 Var2 Var3
#> 1    spades    1   NA
#> 2    hearts    1    1
#> 3     clubs    1   NA
#> 4  diamonds    1   NA
#> 5    spades    2   NA
#> 6    hearts    2    1
#> 7     clubs    2   NA
#> 8  diamonds    2   NA
#> 9    spades    3   NA
#> 10   hearts    3    1
#> 11    clubs    3   NA
#> 12 diamonds    3   NA
#> 13   spades    4   NA
#> 14   hearts    4    1
#> 15    clubs    4   NA
#> 16 diamonds    4   NA
#> 17   spades    5   NA
#> 18   hearts    5    1
#> 19    clubs    5   NA
#> 20 diamonds    5   NA

Tip

Other than the regualr logical operators, R provides a speical one: %in%.

x %in% y: Is x in the vector y?

If x is a vector, the output is a vector with the same length as x, telling whether each element of x is in y or not.

Tip

Other than the regular Boolean operators, R provides two special ones: any and all.

any(cond1, cond2, ...): Are any of these conditions true?
all(cond1, cond2, ...): Are all of these conditions true?

8.5.3 Missing values `NA`

In R, missing values are NA, and you can directly work with NA. Any computations related to NA will return NA.

na.rm: Most R functions come with the optional argument na.rm. If you set it to be TRUE, the function will ignore NA when evaluating the function.

Example 8.19

mean(c(NA, 1:50))
#> [1] NA
mean(c(NA, 1:50), na.rm=TRUE)
#> [1] 25.5

is.na(): This is a function testing whether an object is NA.

8.6 Exercises

Exercise 8.1 Start a R Markdown / Quarto file. In the first section write a R code block to print Hello world!.

Exercise 8.2 Which of these are character strings and which are numbers? 1, "1", "one".

Exercise 8.3 Create an atomic vector that stores just the face names of the cards: the ace of spades, king of spades, queen of spades, jack of spades, and ten of spades. Which type of vector will you use to save the names?

Hint: The face name of the ace of spades would be ace and spades is the suit.

Exercise 8.4 Create the following matrix, which stores the name and suit of every card in a royal flush.

#>      [,1]    [,2]    
#> [1,] "ace"   "spades"
#> [2,] "king"  "spades"
#> [3,] "queen" "spades"
#> [4,] "jack"  "spades"
#> [5,] "ten"   "spades"

Exercise 8.5 Many card games assign a numerical value to each card. For example, in blackjack, each face card is worth 10 points, each number card is worth between 2 and 10 points, and each ace is worth 1 or 11 points, depending on the final score.

Make a virtual playing card by combining “ace” “heart” and 1 into a vector. What type of atomic vector will result? Check if you are right, and explain your reason.

Exercise 8.6 Use a list to store a single playing card, like the ace of hearts, which has a point value of one. The list should save the face of the card, the suit, and the point value in separate elements.

Exercise 8.7 Consider the following data.frame.

suit <- c('spades', 'hearts', 'clubs', 'diamonds')
face <- 1:5
deck <- expand.grid(suit, face)

Please write some codes to count how many rows whose Var1 are equal to hearts.

Exercise 8.8 Converte the following sentences into tests written with R code.

w <- c(-1, 0, 1). Is w positive?
x <- c(5, 15). Is x greater than 10 and less than 20?
y <- 'February'. Is object y the word February?
z <- c("Monday", "Tuesday", "Friday"). Is every value in z a day of the week?

Exercise 8.9 Please write a function to shuffle the row of a data.frame. You may use the following data.frame deck for test.

suit <- c('spades', 'hearts', 'clubs', 'diamonds')
face <- 1:13
deck <- expand.grid(suit, face)

8.1 Hello world for R

8.2 Essential concepts

8.2.1 R Markdown / Quarto

8.3 Data structures

8.3.1 Vectors

8.3.2 Attributes

8.3.3 matrices and arrays

8.3.4 factors

8.3.5 Lists

8.3.6 data.frame

8.3.7 Examples

8.3.8 Load data

8.3.8.1 build-in datasets

8.3.8.2 Read from files

8.3.9 Flow control

8.3.9.1 for loop

8.3.9.2 if-else

8.3.9.3 Functions

8.4 R notations

8.4.1 Selecting Values

8.4.2 Dollar signs and double brackets

8.5 Modifying values

8.5.1 Changing values in place

8.5.2 Logical subsetting

8.5.3 Missing values NA

8.6 Exercises

8.3.6 `data.frame`

8.3.9.1 `for` loop

8.3.9.2 `if-else`

8.5.3 Missing values `NA`