[1] "Hello world!"
7 R Fundamentals
A few advantages about R:
- Free and open source comparing to some other tools like Excel and SPSS.
- Optimized with vectorization.
7.1 Hello world for R
7.2 Essential concepts
- In R, assignments is
<-
, not=
.=
actually works, but it may cause confusions. So it is always recommended to use<-
. The R Studio keybinding for<-
isalt+-
. .
is NOT a special character in R, and can be used in variable names. Sois.na()
simply means a function calledis.na
. It is not a functionna
in a packageis
as in Python.- In R, the block is defined by
{}
. Indentation is not that important. - R has a better package management system than Python, and therefore in most cases you don’t need virtual environment for R.
7.2.1 R Markdown / Quarto
The counterpart of Jupyter notebook in R is .rmd/.qmd
file. Similar to a notebook, in a R Markdown / Quarto file, there is a so-called code block that can run the codes inside to produce documents with both texts and codes and codes outputs.
In the following two sections about R, you are supposed to submit .rmd/.qmd
file.
The most import part of R Markdown / Quarto is the code block, that is
In Quarto, you may also write
There are many options to adjust how the code blocks are excacuted. You don’t need to worry about them right now. Currently just try to write your report together with code blocks.
7.2.2 Development tools
7.2.2.1 R Studio
For R, the almost definite choice of IDE is R Studio. You may download and install it from the homepage.
Note that R Studio will soon be renamed to posit
. Please keep an eye on it if it will make any differences.
7.2.2.2 R Studio Cloud
You may directly go to the homepage to use R Stuido from cloud. If you don’t use it a lot it should be free.
7.2.2.3 Google Colab
You may use R in Google Colab. The link is colab.to/r. After you open the notebook, you may go to Edit->Notebook settings
to change Runtime type to be R
.
The rest is similar to Jupyter notebook, while the codes are now R codes.
7.3 Data structures
Main reference here is [1] and [2].
7.3.1 Vectors
Vector is one of the basic data structure in R. It is created by c()
function. Sometimes it is called atomic vector
. You may store any data types in it. R recognizes six basic types: double, integers, characters, logicals, complex and raw.
The data type inside a vector can be checked by typeof
function.
For consecutive numbers, an easier way to create vector is to use :
.
When slicing with vectors, don’t forget to use c()
.
You may use length()
function to get its length.
7.3.2 Attributes
R objects may have attributes. Attributes won’t be shown by default when you show the object. You may find the attributes of a R object by calling the attributes()
function.
The following example show that the vector die
defined in Section 7.3.1 doesn’t have attributes.
Attributes can be read and write using attr
function. See the following example.
Example 7.1
You may think attributes as metadata attached to a R object. They are used to tell some useful infomation of the object. Some functions will interact with certain attributes. R itself treat attributes class
, comment
, dim
, dimnames
, names
, row.names
and tsp
specially. We will only talk about class
and names
here. dim
will be discussed in the next section. Others will be discussed when we use them.
class
: This is different from the class in Python.class
in R is an attribute which talks about the class of an object. If the attributeclass
is not assigned to an object, the object will have an implicit class:matrix
,array
,function
,numeric
or the result oftypeof
.
attr(x, 'class')
will show the “external” class of an object. You may also use class(x)
to read and write attribute class
. If the class
is not assigned, class(x)
will show the implicit class, while attr(x, 'class')
will show NULL
.
Example 7.2
names
: This attribute is used to name each element in a vector. After the names are assigned, it won’t be displayed below the data like other attributes. It will be displayed above the data with correct alignment. Similar toclass
, you may usenames()
to read and write the attribute.
7.3.3 matrices and arrays
A matrix has dim
attribute.
Note that by assigning and removing dim
attribute, you may change the object between vectors and matrices.
7.3.4 factors
Factor is speical vector. It is a way to handle categorical data. The idea is the limit the possible values. In a factor all possible values are called level
, which is an attribute.
Example 7.5 We would like to talk about all months. We first define a vector of the valid levels:
Then we could start to transform some month vector into factors, by the function factor()
.
[1] "Apr" "Dec" "Jan" "Mar"
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Note that sorting y1
is based on the levels.
[1] Dec Apr <NA> Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Note that y2
contains NA
value since there is an entry in x2
that is not valid.
7.3.5 Lists
List is very similar to a vector. The main difference is that vector can only store values, while list can store objects. The most typical example of objects is another vector. Please see the following example.
7.3.6 data.frame
Data.Frame is a list with the class
attribute data.frame
, together with some restriction on the shape of each columns. You may think about it in terms of tables.
df <- data.frame(face = c("ace", "two", "six"),
suit = c("clubs", "clubs", "clubs"),
value = c(1, 2, 3))
df
face suit value
1 ace clubs 1
2 two clubs 2
3 six clubs 3
- Data Frame group vectors. Each vector represents a column.
- Different column can contain a different type of data, but every cell within one column must be the same type of data.
data.frame()
can be used to create a data.frame.- The type of a data.frame is a list. Similar to matrix comparing to vector, a
data.frame
is alist
withclass
data.frame
, as well as a few other attributes.
7.3.7 Examples
Example 7.7 Consider a date.frame representing a deck of cards. Here we use expand.grid()
to perform the Cartesian product.
suit <- c('spades', 'hearts', 'clubs', 'diamonds')
face <- 1:13
deck <- expand.grid(suit, face)
head(deck)
Var1 Var2
1 spades 1
2 hearts 1
3 clubs 1
4 diamonds 1
5 spades 2
6 hearts 2
We may assign names to change the column names.
suit face
1 spades 1
2 hearts 1
3 clubs 1
4 diamonds 1
5 spades 2
6 hearts 2
Note that since suit
and face
are two vectors, merge()
can also do the Cartesian product. expand.grid()
is good for both vectors and data.frame.
7.3.8 Load data
7.3.8.1 build-in datasets
R has many build-in datasets. You may use data()
to see all of them. Here are a few common datasets.
mtcars
: Motor Trend Car Road Tests: The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)
iris
: iris data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
ToothGrowth
: ToothGrowth data set contains the result from an experiment studying the effect of vitamin C on tooth growth in 60 Guinea pigs.
PlantGrowth
: Results obtained from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment condition.
USArrests
: This data set contains statistics about violent crime rates by us state.
7.3.8.2 Read from files
The build-in read.csv()
function can directly read .csv
file into a data.frame.
Example 7.8 We use the file yob1880.txt
from Chapter 5 here. Put the file in the working folder and run the following code.
We may also manually assign columns names.
7.3.9 Flow control
7.3.9.1 for
loop
7.3.9.2 if-else
7.3.9.3 Functions
The standard format to define a function is my_function <- function(input) {}
where the function name is on the left side of <-
, the input arguments are in the ()
, and the function body is in {}
. The output of the last line of the function body is the return value of the function.
7.4 R notations
7.4.1 Selecting Values
Let us start from a data.frame df
. The basic usage is df[ , ]
, where the first index is to subset the rows and the second index is to subset the columns. There are six ways to writing indexes.
- Positive integers: the regular way.
df[i, j]
means the data in the ith row and jth column.- If both
i
andj
are vectors, a data.frame will be returned. - If
i
orj
are a vector, a vector will be returned. If you still want a data.frame, you may add the optiondrop=FALSE
. - If only one index is provided, it refers to the column.
Example 7.12 We consider the simplified version of a deck. The deck only contains face values from 1 to 5.
Var1 Var2
1 spades 1
2 hearts 1
[1] spades hearts
Levels: spades hearts clubs diamonds
Var1
1 spades
2 hearts
Var1
1 spades
2 hearts
3 clubs
4 diamonds
5 spades
6 hearts
7 clubs
8 diamonds
9 spades
10 hearts
11 clubs
12 diamonds
13 spades
14 hearts
15 clubs
16 diamonds
17 spades
18 hearts
19 clubs
20 diamonds
- Negative integers: remove the related index.
For example,
deck[-1, 1:3]
means it wants all rows except row 1, and column 1 to 3.deck[-(2:20), 1:2]
means it wants all rows ecepte row 2 to row 20, and column 1 to 2.- Negative index and positive index cannot be used together in the same index.
- Blank Spaces: want every value in the dimension.
[1] spades hearts clubs diamonds spades hearts clubs diamonds
[9] spades hearts clubs diamonds spades hearts clubs diamonds
[17] spades hearts clubs diamonds
Levels: spades hearts clubs diamonds
Var1 Var2
1 spades 1
- Logical values: select the rows or columns according to the value. The dimension should have exactly the same number of elements as the logical vector.
rows <- c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
deck[rows,]
Var1 Var2
1 spades 1
3 clubs 1
5 spades 2
6 hearts 2
8 diamonds 2
10 hearts 3
11 clubs 3
13 spades 4
15 clubs 4
16 diamonds 4
18 hearts 5
20 diamonds 5
[1] spades hearts
Levels: spades hearts clubs diamonds
- Names: select columns based on
names
attribute.
7.4.2 Dollar signs and double brackets
List and data.frame obey an optional second system of notation. You can extract values using $
syntax: the data.frame’s name and the column name separated by a $
will select a column and return a vector (since the data in each column is actually a vector).
Example 7.13 Here is an exmaple about data.frames.
[1] spades hearts clubs diamonds spades hearts clubs diamonds
[9] spades hearts clubs diamonds spades hearts clubs diamonds
[17] spades hearts clubs diamonds
Levels: spades hearts clubs diamonds
[1] spades hearts clubs diamonds spades hearts clubs diamonds
[9] spades hearts clubs diamonds spades hearts clubs diamonds
[17] spades hearts clubs diamonds
Levels: spades hearts clubs diamonds
Note that if we select from the data.frame using index, we will get a data.frame.
Example 7.14 Here is an example about lists.
Note that if we select from the list using index, we will get a list.
Please think through these two examples and figure out the similarity between them.
7.5 Modifying values
7.5.1 Changing values in place
You can use R’s notation system to modify values within an R object.
- In general when working with vectors, the two vectors should have the same length.
- If the lengths are different, R will repeat the shorter one to make it match with the longer one. This is called the vector recycling rule. R will throw a warning if the two lengths are not proposional.
Example 7.15
- We may create values that do not yet exist in the object. R will expand the object to accommodate the new values.
7.5.2 Logical subsetting
We could compare two vectors element-wise, and the result is a logical vector. Then we could use this result to subset the vector / data.frame.
Example 7.18
[1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
[13] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
[1] 1 2 3 4 5
Var1 Var2
2 hearts 1
6 hearts 2
10 hearts 3
14 hearts 4
18 hearts 5
We could directly assign values to the subset. Note that the following assignment create a new column with NA
values.
Var1 Var2 Var3
1 spades 1 NA
2 hearts 1 1
3 clubs 1 NA
4 diamonds 1 NA
5 spades 2 NA
6 hearts 2 1
7 clubs 2 NA
8 diamonds 2 NA
9 spades 3 NA
10 hearts 3 1
11 clubs 3 NA
12 diamonds 3 NA
13 spades 4 NA
14 hearts 4 1
15 clubs 4 NA
16 diamonds 4 NA
17 spades 5 NA
18 hearts 5 1
19 clubs 5 NA
20 diamonds 5 NA
7.5.3 Missing values NA
In R, missing values are NA
, and you can directly work with NA
. Any computations related to NA
will return NA
.
na.rm
: Most R functions come with the optional argumentna.rm
. If you set it to beTRUE
, the function will ignoreNA
when evaluating the function.
is.na()
: This is a function testing whether an object isNA
.
7.6 Exercises
Exercise 7.1 Which of these are character strings and which are numbers? 1
, "1"
, "one"
.
Exercise 7.2 Create an atomic vector that stores just the face names of the cards: the ace of spades, king of spades, queen of spades, jack of spades, and ten of spades. Which type of vector will you use to save the names?
Hint: The face name of the ace of spades would be ace
and spades
is the suit.
Exercise 7.3 Create the following matrix, which stores the name and suit of every card in a royal flush.
[,1] [,2]
[1,] "ace" "spades"
[2,] "king" "spades"
[3,] "queen" "spades"
[4,] "jack" "spades"
[5,] "ten" "spades"
Exercise 7.4 Many card games assign a numerical value to each card. For example, in blackjack, each face card is worth 10 points, each number card is worth between 2 and 10 points, and each ace is worth 1 or 11 points, depending on the final score.
Make a virtual playing card by combining “ace” “heart” and 1 into a vector. What type of atomic vector will result? Check if you are right, and explain your reason.
Exercise 7.5 Use a list to store a single playing card, like the ace of hearts, which has a point value of one. The list should save the face of the card, the suit, and the point value in separate elements.
Exercise 7.6 Consider the following data.frame.
Please write some codes to count how many rows whose Var1
are equal to hearts
.
Exercise 7.7 Converte the following sentences into tests written with R code. - w <- c(-1, 0, 1)
. Is w
positive? - x <- c(5, 15)
. Is x
greater than 10
and less than 20
? - y <- 'February'
. Is object y
the word February
? - z <- c("Monday", "Tuesday", "Friday")
. Is every value in z
a day of the week?
7.7 Projects
Exercise 7.8 Start a R Markdown / Quarto file. In the first section write a R code block to print Hello world!
.