Chapter 2 Simple Beginnings

We begin by introducing the basic elements of R. You will use these less often than you might expect, but they are the building blocks for the higher-level tools introduced in Chapter 3, and offer the comfort of familiarity. Where we feel comparisons would aid understanding, we provide short examples in Python.

2.1 Learning Objectives

  • Name and describe R’s atomic data types and create objects of those types.
  • Explain what ‘scalar’ values actually are in R.
  • Identify correct and incorrect variable names in R.
  • Create vectors in R and index them to select single values, ranges of values, and selected values.
  • Explain the difference between NA and NULL and correctly use tests for each.
  • Explain the difference between a list and a vector.
  • Explain the difference between indexing with [ and with [[.
  • Use [ and [[ correctly to extract elements and sub-structures from data structures in R.
  • Create a named list in R.
  • Access elements by name using both [ and $ notation.
  • Correctly identify cases in which back-quoting is necessary when accessing elements via $.
  • Create and index matrices in R.
  • Create for loops and if/else statements in R.
  • Explain why vectors cannot be used directly in conditional expressions and correctly use all and any to combine their values.
  • Define functions taking a fixed number of named arguments and/or a variable number of arguments.
  • Explain what vectorization is and create vectorized equivalents of unnested loops containing simple conditional tests.

2.2 How do I say hello?

We begin with a traditional greeting. In Python, we write:

Hello, world!

We can run the equivalent R in the RStudio Console (Figure 2.1):

[1] "Hello, world!"
RStudio Console

Figure 2.1: RStudio Console

Python prints what we asked for, but what does the [1] in R’s output signify? Is it perhaps something akin to a line number? Let’s take a closer look by evaluating a couple of expressions without calling print:

[1] "This is in single quotes."
[1] "This is in double quotes."

[1] doesn’t appear to be a line number; let’s ignore it for now and do a little more exploring.

Note that R uses double quotes to display strings even when we give it a single-quoted string (which is no worse than Python using single quotes when we’ve given it doubles).

2.3 How do I add numbers?

In Python, we add numbers using +.


We can check the type of the result using type, which tells us that the result 6 is an integer:

<class 'int'>

What does R do?

[1] 6
[1] "double"

R’s type inspection function is called typeof rather than type, and it returns the type’s name as a string. That’s all fine, but it seems odd for integer addition to produce a double-precision floating-point result. Let’s try an experiment:

[1] "double"

Ah: by default, R represents numbers as floating-point values, even if they look like integers when written. We can force a literal value to be an integer by appending an upper-case L (which stands for “long integer”):

[1] "integer"

Arithmetic on integers does produce integers:

[1] "integer"

and if we want to convert a floating-point number to an integer we can do so:

[1] "integer"

But wait: what is that dot in as.integer’s name? Is there an object called as with a method called integer? The answer is “no”: . is (usually) just another character in R; like the underscore _, it is used to make names more readable.

2.4 How do I store many numbers together?

The Elder Gods do not bother to learn most of our names because there are so many of us and we are so ephemeral. Similarly, we only give a handful of values in our programs their own names; we lump the rest together into lists, matrices, and more esoteric structure so that we too can create, manipulate, and dispose of multitudes with a single imperious command.

The most common such structure in Python is the list. We create lists using square brackets and assign a list to a variable using =. If the variable does not exist, it is created:

[3, 5, 7, 11]

Since assignment is a statement rather than an expression, it has no result, so Python does not display anything when this command is run.

The equivalent operation in R uses a function called c, which stands for “column” and which creates a vector:

[1]  3  5  7 11

Assignment is done using a left-pointing arrow <- (though other forms exist, which we will discuss later). As in Python, assignment is a statement rather than an expression, so we enter the name of the newly-created variable to get R to display its value.

Now that we can create vectors in R, we can explain the errant [1] in our previous examples. To start, let’s have a look at the lengths of various things in Python:

[3, 5, 7, 11] 4
Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: object of type 'int' has no len()

Detailed traceback: 
  File "<string>", line 1, in <module>

Fair enough: the length of a list is the number of elements it contains, and since a scalar like the integer 4 doesn’t contain elements, it has no length. What of R’s vectors?

[1] 4

Good—and numbers?

[1] 1

That’s surprising. Let’s have a closer look:

[1] "double"

That’s also unexpected: the type of the vector is the type of the elements it contains. This all becomes clear once we realize that there are no scalars in R. 4 is not a single lonely integer, but rather a vector of length one containing the value 4. When we display its value, the [1] that R prints is the index of its first value. We can prove this by creating and displaying a longer vector:

 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5
[26]  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10

In order to help us find our way in our data, R automatically breaks long lines and displays the starting index of each line. These indices also show us that R counts from 1 as humans do, rather than from zero. (There are a great many myths about why programming languages do the latter. The truth is stranger than any fiction could be.)

2.5 How do I index a vector?

Python’s rules for indexing are simple once you understand them (a statement which is also true of quantum mechanics and necromancy). To avoid confusing indices with values, let’s create a list of color names and index that:

Error in py_call_impl(callable, dots$args, dots$keywords): IndexError: list index out of range

Detailed traceback: 
  File "<string>", line 1, in <module>

Indexing the equivalent vector in R with the indices 1 to 3 produces unsurprising results:

[1] "eburnean"
[1] "wenge"

What happens if we go off the end?

[1] NA

R handles gaps in data using the special value NA (short for “not available”), and returns this value when we ask for a nonexistent element of a vector. But it does more than this—much more. In Python, a negative index counts backward from the end of a list. In R, we use a negative index to indicate a value that we don’t want:

[1] "glaucous" "wenge"   

But wait. If every value in R is a vector, then when we use 1 or -1 as an index, we’re actually using a vector to index another one. What happens if the index itself contains more than one value?

Error in colors[1, 2]: incorrect number of dimensions

That didn’t work because R interprets [i, j] as being row and column indices, and our vector has only one dimension. What if we create a vector with c(...) and use that as a subscript?

[1] "wenge"    "eburnean" "glaucous"

That works, and allows us to repeat elements:

[1] "eburnean" "eburnean" "eburnean"

Note that this is pull indexing, i.e., the value at location i in the index vector specifies which element of the source vector is being pulled into that location in the result vector (Figure 2.2).

Pull Indexing

Figure 2.2: Pull Indexing

We can also select out several elements:

[1] "wenge"

But we cannot simultaneously select elements in (with positive indices) and out (with negative ones):

Error in colors[c(1, -1)]: only 0's may be mixed with negative subscripts

That error message is suggestive: what happens if we use 0 as an index?


In order to understand this rather cryptic response, we can try calling the function character ourselves with a positive argument:

[1] "" "" ""

Ah: character(N) constructs a vector of empty strings of the specified length. The expression character(0) presumably therefore means “an empty vector of type character”. From this, we conclude that the index 0 doesn’t correspond to any elements, so R gives us back something of the right type but with no content. As a check, let’s try indexing with 0 and 1 together:

[1] "eburnean"

So when 0 is mixed with either positive or negative indices, it is ignored, which will undoubtedly lead to some puzzling bugs. What if in-bounds and out-of-bounds indices are mixed?

[1] "eburnean" NA        

That is consistent with the behavior of single indices.

2.6 How do I create new vectors from old?

Modern Python encourages programmers to use list comprehensions instead of loops, i.e., to write:

[6, 10, 14, 18]

instead of:

[6, 10, 14, 18]

If original is a NumPy array, we can shorten this to 2 * original. R provides this capability in the language itself:

[1]  6 10 14 18

Modern R strongly encourages us to vectorize computations in this way, i.e., to do operations on whole vectors at once rather than looping over their contents. To aid this, all arithmetic operations work element by element on vectors:

[1] 10.10000 20.05000 30.03333

If two vectors of unequal length are used together, the elements of the shorter are recycled. This behaves sensibly if one of the vectors is a scalar—it is just re-used as many times as necessary:

[1] 105 205 305

If both vectors have several elements, the shorter is repeated as often as necessary. This works, but is so likely to lead to hard-to-find bugs that R produces a warning message:

Warning in hundreds + thousands: longer object length is not a multiple of
shorter object length
[1] 1100 2200 1300

R also provides vectorized alternatives to if-else statements. If we use a vector containing the logical (or Boolean) values TRUE and FALSE as an index, it selects elements corresponding to TRUE values:

[1] "eburnean" "glaucous" "wenge"   
[1] "eburnean" "wenge"   

This is called logical indexing, though to the best of my knowledge illogical indexing is not provided as an alternative. The function ifelse uses this to do what its name suggests: select a value from one vector if a condition is TRUE, and a corresponding value from another vector if the condition is FALSE:

[1] "eburnean" "glaucous" "m"       

All three vectors are of the same length, and the first (the condition) is usually constructed using the values of one or both of the other vectors:

[1] "eburnean" "glaucous" "WENGE"   
Vector Conditionals

Figure 2.3: Vector Conditionals

2.7 How else does R represent the absence of data?

The special value NA means “there’s supposed to be a value here but we don’t know what it is.” A different value, NULL, represents the absence of a vector. It is not the same as a vector of zero length, though testing that statement produces a rather odd result:


The safe way to test if something is NULL is to use the function is.null:

[1] TRUE

Circling back, the safe way to test whether a value is NA is not to use direct comparison:

[1] NA

The result is NA because if we don’t know what the value is, we can’t know if it’s equal to threshold or not. Instead, we should always use the function

[1] TRUE

2.8 How can I store a mix of different types of objects?

One of the things that newcomers to R often trip over is the various ways in which structures can be indexed. All of the following are legal:

but they can behave differently depending on what kind of thing thing is. To explain, we must first take a look at lists.

A list in R is a vector that can contain values of many different types. (The technical term for this is heterogeneous, in contrast with a homogeneous data structure that can only contain one type of value.) We’ll use this list in our examples:

[1] "first"

[1]   2  20 200

[1] 3.3

The output tells us that the first element of thing is a vector of one element, that the second is a vector of three elements, and the third is again a vector of one element; the major indices are shown in [[…]], while the indices of the contained elements are shown in […]. (Again, remember that "first" and 3.3 are actually vectors of length 1.)

In keeping with R’s conventions, we will henceforth use [[ and [ to refer to the two kinds of indexing rather than [[…]] and […].

2.9 What is the difference between [ and [[?

The output above strongly suggests that we can get the elements of a list using [[ (double square brackets):

[1] "first"
[1]   2  20 200
[1] 3.3

Let’s have a look at the types of those three values:

[1] "character"
[1] "double"
[1] "double"

That seems sensible. Now, what do we get if we index single square brackets […]?

[1] "first"

That looks like a list, not a vector—let’s check:

[1] "list"

This shows the difference between [[ and [: the former peels away a layer of data structure, returning only the sub-structure, while the latter gives us back a structure of the same type as the thing being indexed. Since a “scalar” is just a vector of length 1, there is no difference between [[ and [ when they are applied to vectors:

[1] "second"
[1] "character"
[1] "second"
[1] "character"

Flattening and Recursive Indexing

If a list is just a vector of objects, why do we need the function list? Why can’t we create a list with c("first", c(2, 20, 200), 30)? The answer is that R flattens the arguments to c, so that c(c(1, 2), c(3, 4)) produces c(1, 2, 3, 4). It also does automatic type conversion: c("first", c(2, 20, 200), 30) produces a vector of character strings c("first", "2", "20", "200", "30"). This is helpful once you get used to it (which once again is true of both quantum mechanics and necromancy).

Another “helpful, ish” behavior is that using [[ with a list subsets recursively: if thing <- list(a = list(b = list(c = list(d = 1)))), then thing[[c("a", "b", "c", "d")]] selects the 1.

2.10 How can I access elements by name?

R allows us to name the elements in vectors and lists: if we assign c(one = 1, two = 2, three = 3) to names, then names["two"] is 2. We can use this to create a lookup table:

           m            f           nb            f            f            m 
      "Male"     "Female" "Non-binary"     "Female"     "Female"       "Male" 

If the structure in question is a list rather than an atomic vector of numbers, characters, or logicals, we can use the syntax lookup$m instead of lookup["m"]:

[1] "Male"

We will explore this in more detail when we look at the tidyverse in Chapter 3, since that is where access-by-name is used most often. For now, simply note that if the name of an element isn’t a legal variable name, we have to put it in backward quotes to use it with $:

[1] "F"

If you have control, or at least the illusion thereof, choose names such as first_field that don’t require back-quoting.

2.11 How can I create and index a matrix?

Matrices are frequently used in statistics, so R provides built-in support for them. After a <- matrix(1:9, nrow = 3), a is a 3x3 matrix containing the values 1 through 9:

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Behind the scenes, a matrix is a vector with an attribute called dim that stores its dimensions:

[1] 3 3

a[3, 3] is a vector of length 1 containing the value 9 (again, “scalars” in R are actually vectors), while a[1,] is the vector c(1, 4, 7) (because we are selecting the first row of the matrix) and a[,1] is the vector c(1, 2, 3) (because we are selecting the first column of the matrix). Elements can still be accessed using a single index, which returns the value from that location in the underlying vector:

[1] 8

2.12 How do I choose and repeat things?

We cherish the illusion of free will so much that we embed a pretense of it in our machines in the form of conditional statements using if and else. (Ironically, we then instruct those same machines to make the same decisions over and over. It’s no wonder they sometimes appear mad…) For example, here is a snippet of Python that uses for and if to display the signs of the numbers in a list:

The pos_neg of -15 is -1
The pos_neg of 0 is 0
The pos_neg of 15 is 1
The final value of v is 15

Its direct translation into R is:

The sign of -15 is -1
The sign of 0 is 0
The sign of 15 is 1
The final value of v is 15

There are a few things to note here:

  1. The parentheses in the loop header are required: we cannot simply write for v in values.
  2. The curly braces around the body of the loop and around the bodies of the conditional branches are optional, since each contains only a single statement. However, they should always be there to help readability.
  3. As in Python, the loop variable v persists after the loop is over.
  4. glue::glue (the function glue from the library of the same name) interpolates variables into strings in sensible ways. We will load this library and use plain old glue in the explanations that follow. (Note that R uses :: to get functions out of packages rather than Python’s ..)
  5. We have called our temporary variable pos_neg rather than sign so that we don’t accidentally overwrite the rather useful built-in R function with the latter name. Name collisions of this sort are just as easy in R as they are in Python.

2.13 How can I vectorize loops and conditionals?

The example above is not how we should write R: everything in that snippet can and should be vectorized. The simplest way to do this is to use the aforementioned built-in function:

[1] -1  0  1
The sign of -15 is -1
The sign of 0 is 0
The sign of 15 is 1

But what if the function we want doesn’t exist (or if we don’t know what it’s called)? In that case, the easiest approach is often to create a new vector whose values are derived from those of the vector we had and trust R to match up corresponding elements:

The sign of -15 is -1
The sign of 0 is 0
The sign of 15 is 1

This solution makes use of case_when, which is a vectorized analog of if/else if/else. Each branch uses the ~ operator to combine a Boolean test on the left with a result on the right. We will see other uses for ~ in subsequent chapters.

2.14 How can I express a range of values?

for in R loops over the values in a vector, just as it does in Python. If we want to loop over the indices instead, we can use the function seq_along:

The length of color 1 is 1
The length of color 2 is 1
The length of color 3 is 1
The length of color 4 is 1

This output makes no sense until we remember that every value is a vector, and that length returns the length of a vector, so that length(colors[0]) is telling us that colors[0] contains one element. If we want the number of characters in the strings, we can use R’s built-in nchar or the more modern function stringr::str_length:

The length of color 1 is 8
The length of color 2 is 8
The length of color 3 is 8
The length of color 4 is 5

As you may already have guessed, seq_along returns a vector containing a sequence of integers:

[1] 1 2 3 4

Since sequences of this kind are used frequently, R lets us write them using range expressions:

[1]  5  6  7  8  9 10

Their most common use is as indices to vectors:

[1] "glaucous" "squamous"

We can similarly subtract a range of colors by index:

[1] "squamous" "wenge"   

However, R does not allow tripartite expressions of the form start:end:step. For that, we must use seq:

[1]  1  4  7 10

This example also shows that ranges in R are inclusive at both ends, i.e., they run up to and including the upper bound. As is traditional among programming language advocates, people claim that this is more natural and then cite some supportive anecdote as if it were proof.

Repeating Things

The function rep repeats things, so rep("a", 3) is c("a", "a", "a"). If the second argument is a vector of the same length as the first, it specifies how many times each item in the first vector is to be repeated: rep(c("a", "b"), c(2, 3)) is c("a", "a", "b", "b", "b").

2.15 How can I use a vector in a conditional statement?

We cannot use a vector directly as a condition in an if statement:

Warning in if (numbers) {: the condition has length > 1 and only the first
element will be used

Instead, we must collapse the vector into a single logical value:

[1] "This, on the other hand, should work."

The function all returns TRUE if every element in its argument is TRUE; it corresponds to a logical “and” of all its inputs. We can use a corresponding function any to check if at least one value is TRUE, which corresponds to a logical “or” across the whole input.

2.16 How do I create and call functions?

As we have already seen, we call functions in R much as we do in Python:

[1] 6

We define a new function using the function keyword. This creates the function; to name it, we must assign the newly-created function to a variable:

[1] "right" "left" 

As this example shows, the result of a function is the value of the last expression evaluated within it. A function can return a value earlier using the return function; we can use return for the final value as well, but most R programmers do not.

[1] "right" "left" 

Returning NULL when our function’s inputs are invalid as we have done above is foolhardy, as doing so means that swap can fail without telling us that it has done so. Consider:


We will look at what we should do instead in Chapter 8.

2.17 How can I write a function that takes variable arguments?

If the number of arguments given to a function is not the number expected, R complains:

Error in swap("one", "two", "three"): unused arguments ("two", "three")

(Note that we are passing three separate values here, not a single vector containing three values.) If we want a function to handle a varying number of arguments, we represent the “extra” arguments with an ellipsis ... (three dots), which serves the same purpose as Python’s *args:


The function paste creates a string by combining its arguments with the specified separator.

R uses a special data structure to represent the extra arguments in .... If we want to work with those arguments one by one, we must explicitly convert ... to a list:

[1] 16

2.18 How can I provide default values for arguments?

Like Python and most other modern programming languages, R lets us define default values for arguments and then pass arguments by name:

first='with just first' second='second' third='third'
first='with first and second by position' second='positional' third='third'
first='with first and third by name' second='second' third='by name'

One caution: when you use a name in a function call, R ignores things that aren’t functions when looking up the function. This means that the call to orange() in the code below produces 110 rather than an error because purple(purple) is interpreted as “pass the value 10 into the globally-defined function purple” rather than “try to call a function 10(10)”:

[1] 110

2.19 How can I hide the value that R returns?

If the value returned by a function isn’t assigned to something, R displays it. Since this usually isn’t what we want in library functions, we can use the function invisible to mark a value as “not to be printed” (though the value can still be assigned). For example, we can convert:

[1] 20

to this:

The calculation is still being done, but the output is suppressed.

2.20 How can I assign to a global variable from inside a function?

The assignment operator <<- means “assign to a variable outside the current scope”. As the example below shows, this means that what looks like creation of a new local variable can actually be modification of a global one:

[1] "new value"

This should only and always be done with care: modern R strongly encourages a functional style of programming in which functions do not modify their input data, and nobody thinks that modifying global variables is a good idea any more.

2.21 Key Points

  • Use print(expression) to print the value of a single expression.
  • Variable names may include letters, digits, ., and _, but . should be avoided, as it sometimes has special meaning.
  • R’s atomic data types include logical, integer, double (also called numeric), and character.
  • R stores collections in homogeneous vectors of atomic types, or in heterogeneous lists.
  • ‘Scalars’ in R are actually vectors of length 1.
  • Vectors and lists are created using the function c(...).
  • Vector indices from 1 to length(vector) select single elements.
  • Negative indices to vectors deselect elements from the result.
  • The index 0 on its own selects no elements, creating a vector or list of length 0.
  • The expression low:high creates the vector of integers from low to high inclusive.
  • Subscripting a vector with a vector of numbers selects the elements at those locations (possibly with repeats).
  • Subscripting a vector with a vector of logicals selects elements where the indexing vector is TRUE.
  • Values from short vectors (such as ‘scalars’) are repeated to match the lengths of longer vectors.
  • The special value NA represents missing values, and (almost all) operations involving NA produce NA.
  • The special values NULL represents a nonexistent vector, which is not the same as a vector of length 0.
  • A list is a heterogeneous vector capable of storing values of any type (including other lists).
  • Indexing with [ returns a structure of the same type as the structure being indexed (e.g., returns a list when applied to a list).
  • Indexing with [[ strips away one level of structure (i.e., returns the indicated element without any wrapping).
  • Use list('name' = value, ...) to name the elements of a list.
  • Use either L['name'] or L$name to access elements by name.
  • Use back-quotes around the name with $ notation if the name is not a legal R variable name.
  • Use matrix(values, nrow = N) to create a matrix with N rows containing the given values.
  • Use m[i, j] to get the value at the i’th row and j’th column of a matrix.
  • Use m[i,] to get a vector containing the values in the i’th row of a matrix.
  • Use m[,j] to get a vector containing the values in the j’th column of a matrix.
  • Use for (loop_variable in collection){ ...body... } to create a loop.
  • Use if (expression) { ...body... } else if (expression) { ...body... } else { ...body... } to create conditionals.
  • Expression conditions must have length 1; use any(...) and all(...) to collapse logical vectors to single values.
  • Use function(...arguments...) { ...body... } to create a function.
  • Use variable <- function(…arguments…) { …body… }` to create a function and give it a name.
  • The body of a function can be a single expression or a block in curly braces.
  • The last expression evaluated in a function is returned as its result.
  • Use return(expression) to return a result early from a function.