G Key Points
G.1 Simple Beginnings
- Use
print(expression)
to print the value of a single expression. - Variable names may include letters, digits,
.
, and_
, but.
should be avoided, as it sometimes has special meaning. - R’s atomic data types include logical, integer, double (also called numeric), and character.
- R stores collections in homogeneous vectors of atomic types, or in heterogeneous lists.
- ‘Scalars’ in R are actually vectors of length 1.
- Vectors and lists are created using the function
c(...)
. - Vector indices from 1 to length(vector) select single elements.
- Negative indices to vectors deselect elements from the result.
- The index 0 on its own selects no elements, creating a vector or list of length 0.
- The expression
low:high
creates the vector of integers fromlow
tohigh
inclusive. - Subscripting a vector with a vector of numbers selects the elements at those locations (possibly with repeats).
- Subscripting a vector with a vector of logicals selects elements where the indexing vector is
TRUE
. - Values from short vectors (such as ‘scalars’) are repeated to match the lengths of longer vectors.
- The special value
NA
represents missing values, and (almost all) operations involvingNA
produceNA
. - The special values
NULL
represents a nonexistent vector, which is not the same as a vector of length 0. - A list is a heterogeneous vector capable of storing values of any type (including other lists).
- Indexing with
[
returns a structure of the same type as the structure being indexed (e.g., returns a list when applied to a list). - Indexing with
[[
strips away one level of structure (i.e., returns the indicated element without any wrapping). - Use
list('name' = value, ...)
to name the elements of a list. - Use either
L['name']
orL$name
to access elements by name. - Use back-quotes around the name with
$
notation if the name is not a legal R variable name. - Use
matrix(values, nrow = N)
to create a matrix withN
rows containing the given values. - Use
m[i, j]
to get the value at the i’th row and j’th column of a matrix. - Use
m[i,]
to get a vector containing the values in the i’th row of a matrix. - Use
m[,j]
to get a vector containing the values in the j’th column of a matrix. - Use
for (loop_variable in collection){ ...body... }
to create a loop. - Use
if (expression) { ...body... } else if (expression) { ...body... } else { ...body... }
to create conditionals. - Expression conditions must have length 1; use
any(...)
andall(...)
to collapse logical vectors to single values. - Use
function(...arguments...) { ...body... }
to create a function. - Use variable <- function(…arguments…) { …body… }` to create a function and give it a name.
- The body of a function can be a single expression or a block in curly braces.
- The last expression evaluated in a function is returned as its result.
- Use
return(expression)
to return a result early from a function.
G.2 The Tidyverse
install.packages('name')
installs packages.library(name)
(without quoting the name) loads a package.library(tidyverse)
loads the entire collection of tidyverse libraries at once.read_csv(filename)
reads CSV files that use the string ‘NA’ to represent missing values.read_csv
infers each column’s data types based on the first thousand values it reads.- A tibble is the tidyverse’s version of a data frame, which represents tabular data.
head(tibble)
andtail(tibble)
inspect the first and last few rows of a tibble.summary(tibble)
displays a summary of a tibble’s structure and values.tibble$column
selects a column from a tibble, returning a vector as a result.tibble['column']
selects a column from a tibble, returning a tibble as a result.tibble[,c]
selects columnc
from a tibble, returning a tibble as a result.tibble[r,]
selects rowr
from a tibble, returning a tibble as a result.- Use ranges and logical vectors as indices to select multiple rows/columns or specific rows/columns from a tibble.
tibble[[c]]
selects columnc
from a tibble, returning a vector as a result.min(...)
,mean(...)
,max(...)
, andstd(...)
calculates the minimum, mean, maximum, and standard deviation of data.- These aggregate functions include
NA
s in their calculations, and so will produceNA
if the input data contains any. - Use
func(data, na.rm = TRUE)
to removeNA
s from data before calculations are done (but make sure this is statistically justified). filter(tibble, condition)
selects rows from a tibble that pass a logical test on their values.arrange(tibble, column)
orarrange(desc(column))
arrange rows according to values in a column (the latter in descending order).select(tibble, column, column, ...)
selects columns from a tibble.select(tibble, -column)
selects out a column from a tibble.mutate(tibble, name = expression, name = expression, ...)
adds new columns to a tibble using values from existing columns.group_by(tibble, column, column, ...)
groups rows that have the same values in the specified columns.summarize(tibble, name = expression, name = expression)
aggregates tibble values (by groups if the rows have been grouped).tibble %>% function(arguments)
performs the same operation asfunction(tibble, arguments)
.- Use
%>%
to create pipelines in which the left side of each%>%
becomes the first argument of the next stage.
G.3 Creating Packages
- Develop data-cleaning scripts one step at a time, checking intermediate results carefully.
- Use
read_csv
to read CSV-formatted tabular data into a tibble. - Use the
skip
andna
parameters ofread_csv
to skip rows and interpret certain values asNA
. - Use
str_replace
to replace portions of strings that match patterns with new strings. - Use
is.numeric
to test if a value is a number andas.numeric
to convert it to a number. - Use
map
to apply a function to every element of a vector in turn. - Use
map_dfc
andmap_dfr
to map functions across the columns and rows of a tibble. - Pre-allocate storage in a list for each result from a loop and fill it in rather than repeatedly extending the list.
- An R package can contain code, data, and documentation.
- R code is distributed as compiled bytecode in packages, not as source.
- R packages are almost always distributed through CRAN, the Comprehensive R Archive Network.
- Most of a project’s metadata goes in a file called
DESCRIPTION
. - Metadata related to imports and exports goes in a file called
NAMESPACE
. - Add patterns to a file called
.Rbuildignore
to ignore files or directories when building a project. - All source code for a package must go in the
R
sub-directory. library
calls in a package’s source code will not be executed as the package is loaded after distribution.- Data can be included in a package by putting it in the
data
sub-directory. - Data must be in
.rda
format in order to be loaded as part of a package. - Data in other formats can be put in the
inst/extdata
directory, and will be installed when the package is installed. - Add comments starting with
#'
to an R file to document functions. - Use roxygen2 to extract these comments to create manual pages in the
man
directory. - Use
@export
directives in roxygen2 comment blocks to make functions visible outside a package. - Add required libraries to the
Imports
section of theDESCRIPTION
file to indicate that your package depends on them. - Use
package::function
to access externally-defined functions inside a package. - Alternatively, add
@import
directives to roxygen2 comment blocks to make external functions available inside the package. - Import
.data
fromrlang
and use.data$column
to refer to columns instead of using bare column names. - Create a file called
R/package.R
and documentNULL
to document the package as a whole. - Create a file called
R/dataset.R
and document the string‘dataset’
to document a dataset.
G.4 Non-Standard Evaluation
- R uses lazy evaluation: expressions are evaluated when their values are needed, not before.
- Use
expr
to create an expression without evaluating it. - Use
eval
to evaluate an expression in the context of some data. - Use
enquo
to create a quosure containing an unevaluated expression and its environment. - Use
quo_get_expr
to get the expression out of a quosure. - Use
!!
to splice the expression in a quosure into a function call.
G.5 Intellectual Debt
- Don’t use
setwd
. - The formula operator
~
delays evaluation of its operand or operands. ~
was created to allow users to pass formulas into functions, but is used more generally to delay evaluation.- Some tidyverse functions define
.
to be the whole data,.x
and.y
to be the first and second arguments, and..N
to be the N’th argument. - These convenience parameters are primarily used when the data being passed to a pipelined function needs to go somewhere other than in the first parameter’s slot.
- ‘Copy-on-modify’ means that data is aliased until something attempts to modify it, at which point it duplicated, so that data always appears to be unchanged.
G.6 Testing and Error Handling
- Operations signal conditions in R when errors occur.
- The three built-in levels of conditions are messages, warnings, and errors.
- Programs can signal these themselves using the functions
message
,warning
, andstop
. - Operations can be placed in a call to the function
try
to suppress errors, but this is a bad idea. - Operations can be placed in a call to the function
tryCatch
to handle errors. - Use testthat to write unit tests for R.
- Put unit tests for an R package in the
tests/testthat
directory. - Put tests in files called
test_group.R
and call themtest_something
. - Use
test_dir
to run tests from a particular that match a pattern. - Write tests for data transformation steps as well as library functions.
G.7 Advanced Topics
- The
reticulate
library allows R programs to access data in Python programs and vice versa. - Use
py.whatever
to access a top-level Python variable from R. - Use
r.whatever
to access a top-level R definition from Python. - R is always indexed from 1 (even in Python) and Python is always indexed from 0 (even in R).
- Numbers in R are floating point by default, so use a trailing ‘L’ to force a value to be an integer.
- A Python script run from an R session believes it is the main script, i.e.,
__name__
is'__main__'
inside the Python script. - S3 is the most commonly used object-oriented programming system in R.
- Every object can store metadata about itself in attributes, which are set and queried with
attr
. - The
dim
attribute stores the dimensions of a matrix (which is physically stored as a vector). - The
class
attribute of an object defines its class or classes (it may have several character entries). - When
F(X, ...)
is called, andX
has classC
, R looks for a function calledF.C
(the.
is just a naming convention). - If an object has multiple classes in its
class
attribute, R looks for a corresponding method for each in turn. - Every user defined class
C
should have functionsnew_C
(to create it),validate_C
(to validate its integrity), andC
(to create and validate). - Use the
DBI
package to work with relational databases. - Use
DBI::dbConnect(...)
with database-specific parameters to connect to a specific database. - Use
dbGetQuery(connection, "query")
to send an SQL query string to a database and get a data frame of results. - Parameterize queries using
:name
as a placeholder in the query andparams = list(name = value)
as a third parameter todbGetQuery
to specify actual values. - Use
dbFetch
in awhile
loop to page results. - Use
dbWriteTable
to write an entire data frame to a table, anddbExecute
to execute a single insertion statement. - Dates… why did it have to be dates?
Wickham, Hadley. 2019. Advanced R. 2nd ed. Chapman; Hall/CRC.
Wilkinson, Leland. 2005. The Grammar of Graphics. Springer.