Where to Start Programming with Data?

A Beginner’s Guide and List of Resources for Social Scientists

I want to learn how to program with data. Can you point me to a few good reads for beginners? Should I learn R or Python? What’s the best way to get started? Where should I begin? Admittedly, I don’t have any data to back that claim up, but I feel the number of get-me-started-with-coding requests has increased ever since many other activities were put on hold for the forseeable future.

Learning a programming language seems even more reasonable to many researchers. I don’t know whether it’s a fear of missing out on all those hackathons or a slowdown in people’s schedules that let’s them address the thing they wanted to do for a long time.

Invididual motivation may be as heterogeneous as personal starting points – in particular for academic researchers who worked with data before. There is hardly a clean slate, some real experience here and some semi-knowledge there. But that’s fine. I hope this article helps you improve your starting point and point you to what’s next after getting started.

If you came to 🍒-pick, you’re welcome to. Find a list of useful links at the end of this post.

Crash Course – Lesson 1: Understand Your Ecosystem

Language Wars Are Utterly Useless

The two popular choices for social scientists who want to start programming with data are the R Language for Statistical Computing and Python. Both languages are good choices. Don’t buy into language wars. Further down the road you will learn to understand both languages. Start with the language your close peers use. Reading source code from more experienced developers is one of the best ways to learn. The closer others’ work is to what you want to do the bigger the motivation to dig into it.

That being said, we stick to the R Language for Statistical Computing for this 101 as we have to stick to one language here. R is easy to install and offers a plethora of great online resources. R is an interpreted language which means it is well suited for interactive use. Talk to it, get a result back. Do something with this result give it to R, get a result back. Line by line. Just like pocket calculator. Do not take this for granted. Working with a compiled language is totally different feel. Also, R has one dominant IDE – that’s a text editor on steroids – almost every researcher uses: the freely available R Studio Integreated Development Environment (IDE). Once you have downloaded R itself and the R Studio editor/IDE you’re good to go.

R != R Studio & Other Notes for STATA and SPSS users…

My emphasis on the point that R is not R Studio is not to take anything away from how great R Studio is and how important its role is to the modern R experience. I believe though, to understand this modularity at an early stage is important because many social scientists looking to gear up their programming a notch have worked with data and software before considering R. People who almost solely worked with STATA which common in economics or SPSS which is common in, e.g., psychology, tend to believe there is one program as opposed to a language which potentially a large amount of editors. For most SPSS users it is very uncommon to use another editor than the one inside the SPSS application. Also, SPSS and STATA are rather programs with extensive macro / scripting capabilities as opposed to full fledged programming languages. As a result STATA syntax is simpler and easier to remember than R syntax in the beginning. Yet, the syntax of these programs is super domain specific. General tasks like string operations or having multiple datasets in memory with these programs makes you feel like washing your dishes with your feet once you have seen a full fledged programming language at work.

But, domain specific programs like STATA have undoubtedly convenient implementations of econometric models in place. Sometimes it’s a good solution –at least for starters– to do all the data pre-processing and/or merging in R to continue with your STATA models. You even trigger your STATA command from inside R. Also, realizing R is not R Studio is an important step along the line to see not all R programs are interactive. Some may very well live on a server and be called automatically every day without human interaction.

Extensions packages from CRAN

The standard resource for R extension packages is the Comprehensive R Archive Network (CRAN). Last time I checked, CRAN had more than 15K packages from various fields of research. The CRAN Task Views are a great list of field specific, curated sites that provides an overview of the packages in a respective field. Task views can really help to decide which package to download. Besides making the choice and the online research that comes along with it, downloading a package is usually just a matter of running:

# NOTE: the quotes are mandatory here
# NOTE: installation needs to run only once (unless you update the package)
# NOTE: the subsequent library call is needed to make the package 
# available in a particular R session work
install.packages("the_name_of_the_package")
library(the_name_of_the_package)

And these are only the list of packages that made the CRAN review process…

Extensions packages from Github – Don’t use {devtools}!

… there are many many more around on the world’s leading open source code repository platform Github. And guess what, there is an R package for almost everything, even for installing packages from github. Often R projects on Github recommend to use the {devtools} R package to install to install the package directly from Github. But I would NOT recommend to use {devtools} even if the developer of the package I am interested in suggests to do so. This is one trap a beginner could step into, waste time trying to install the package to install packages and get frustrated before she even started. This is because devtools is an extension for package developers and has quite some dependencies which can make installation tricky. If you’re unlucky, particularly on Windows, installation of dependencies beyond R can turn into a problem. A better option is to use the lighter {remotes} package:

# make sure to install the remotes package from CRAN before.
install.packages("remotes")
# This will install the package called tstools from 
# my github account
remotes::install_github("mbannert/tstools")

Bioconductor – a Package Archive for Biostatistics

Bioconductor has its own community I am not overly familiar with, but I do know they have lots of great packages. Installing a packaging from Bioconductor is fairly easy as well. You will need to download the {BiocManager} package first.

# make sure to install the BiocManager package from CRAN before.
install.packages("BiocManager")
# list all available packages
BiocManager::available()
# install the A3 packages
BiocManager::install("A3")

Note that BioConductor packages can very well depend on CRAN packages which will just be installed automatically when a dependency is resolved.

Crash Course – Lesson 2: Getting Started

At the start of this section, I assume you have installed R and R Studio on your local machine or have access to R Studio Server running somewhere. I also assume you are familiar with R Studio’s panes and how to switch between script window (Ctrl/Cmd+1) and the console (Ctrl/Cmd+2). I believe you know by now how to run code (Ctrl/Cmd+Enter) selected in the script window without leaving your keyboard for the mouse or touchpad.

Basic Data Structures

First, let’s get familiar with basic data structures in R. Everything is a vector, even scalars are vectors of length one. The c() command let’s you concatenate elements into vectors. The assignment operator <- (press Alt - in R Studio) assigns values to variables.

a <- c(1,2,3)
a
## [1] 1 2 3

Matrices are basically two dimensional vectors. All columns of a matrix need to have the same data type, i.e., all numerics, all characters and so forth. If that is not the case, elements are coerced into the smallest common denominator which is more often than not the character type.

m <- matrix(1:9,nrow = 3, ncol = 3)
m
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Data.frames are similar to matrices inasmuch as they are typically 2-dimensional and all columns have to be of the same length. They lift the second restrictions, though: data types may vary across columns.

d <- data.frame(first_column = 1:3,
                second_column = LETTERS[1:3])
d
##   first_column second_column
## 1            1             A
## 2            2             B
## 3            3             C

Finally, lists are even more flexible. They allow to nest objects of different data types. Their elements can even contain functions or entire datasets. To illustrate this, let us put all of the objects create before into a list. And that’s not enough. R and many of its extension packages ship with demo dataset to illustrate functionality.

# load one of the most famous demo datasets
data(iris)
# create an empty list and add stuff
l <- list()
l$a_vector <- a
l$a_matrix <- m
l$a_data.frame <- d
# add the first 6 lines of the iris dataset
l$iris <- head(iris)
l
## $a_vector
## [1] 1 2 3
## 
## $a_matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## $a_data.frame
##   first_column second_column
## 1            1             A
## 2            2             B
## 3            3             C
## 
## $iris
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

So far you have seen the assignment operator <- which you should use to create new variables of any type and the 4 most important data types in base R: vector, matrix, data.frame and list. Along the way you might have noticed a few other things I will explain shortly.

Indexing [], [[]]

When programming with data, the ability to access subsets in simple fashion is very important. In R, square brackets indicate indexing as opposed to ordinary braces () which hold parameters of functions or curly braces {} which delimit a function body or control structure.

Consider a vector with 4 elements

b <- c(1,10,20,30)
b[3]
## [1] 20

the above code returns 20 because the index asked for the 3rd element of the vector. Unlike Python or Matlab, R starts counting at 1, not 0. Indexing works for matrices and data.frames, too. Just think [rows, columns]:

d[1,]
##   first_column second_column
## 1            1             A
d[,1]
## [1] 1 2 3

Leaving an index blank means ‘all rows’ and/or ‘all columns’. Lists have double squared brackets in addition to single ones. [[1]] will return the first element of a list. A single [1] will return a list containing the first element of the initial list.

l[[1]]
## [1] 1 2 3
l[2]
## $a_matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

We’ve just seen position based indexing. Two other popular forms of indexing are name based indexing and logical indexing.

# name based indexing
l["a_matrix"]
## $a_matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

You will also quite often see the $ operator with lists and data.frames which popular thanks to R Studio’s auto complete feature (hit the tab key after you pressed $ after a data.frame object).

d$second_column
## [1] A B C
## Levels: A B C
# logical indexing
b[c(TRUE, TRUE, FALSE, TRUE)]
## [1]  1 10 30

The latter is particularly relevant when you want to use control structures, create your own functions or use filters. There is a lot more to say about R basics, but I guess the above insights should enable you to move around, follow some R discussions and tackle some first challenges. For more comprehensive discussion of data types, indexing and more check the Chapters 2 & 3 of Official Introduction to R.

Functions

R is a functional language. For starters this means: think in functions. Use existing functions and learn how to create your own functions and apply them to your data. A function consists of a name, zero or more parameters, and a function body. A function is defined once and called often.

# function definition
name_of_the_f <- function(parameter_1 = 3,
                          another_param_2 = 5){
  # function body = what the function does
  result <- parameter_1 * another_param_2
  # last non-assigned object is automatically returned
  # so this last line is
  # aequivalent to 
  # return(result)
  result
}

Call your function with different parameters

# returns 20
name_of_the_f(2,10)
## [1] 20
# returns 5
name_of_the_f(1)
## [1] 5
# returns 30
name_of_the_f(another_param_2 = 10)
## [1] 30

Note how input parameters are sequentially mapped to the parameters of a functions definition. When parameters are not specified, defaults are being used. In case there is no input parameter and no default the code will break. You can also explicitly assign values to the parameters.

In R it is very common to pass functions to another function. The apply function family is a good example of this approach. lapply stands for ‘list apply’ and applies one function to all elements of a list.

data(iris)
data(mtcars)
ll <- list(ds1 = iris,
           ds2 = mtcars)

# returns a list of summaries
lapply(ll, summary)
## $ds1
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##                 
## 
## $ds2
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Now it’s your turn! Seriously. Often, newcomers who watch an experienced programmer say things like ‘How can she remember all of this code?’. They seem to suggest that good programmers just got a lot of capacity to memorize syntax. Admittedly, just like with a natural language you need to learn a bit of vocabular to move around. But it’s a lot less to thoroughly remember than you might think. The stuff discussed above is already a large chunk of what you really need to be familiar with – at least for starters. Almost everything else can be looked while you start to tackle basic real world data problems.

That being said, one of your first tasks could be to make yourself familiar with a few more basic functions that help you to move around. ?function_name opens up to the R help window showing a function’s documentation. Many functions come with examples at the end of its documentation. Running these examples line by line will greatly help you if you are more of an applied mind.

head()
tail()
str()
ls()
sum()
mean()
summary()
lm()

Crash Course Lesson 3: Advanced Ecosystems Inside R

One thing about R that you should know rather sooner than later is that there are at least two major approaches besides base R which are an ecosystem of their own: the tidyverse and data.table.

The tidyverse

The tidyverse (inofficially called the Hadleyverse after its iniator Hadley Wickham) is a bunch of R packages that follow the same general idea: clean code, tidy data. Many claim that you should right away learn the tidyverse and rather leave base R alone. To me, the tidyverse is still an extension and even if you choose to put all your money on the tidyverse, it won’t hurt to learn the minimal pieces of base R described above.

The tidyverse is documented and marketed very well including introductory resources, so I won’t go past a very minimal description here.

The most important packages of the tidyverse are {dplyr} and {ggplot2} and their respective imports (other packages they make use of). The {dplyr} package allows you to use a speedier version of the data.frame called tibble. The basic idea is not to copy the entire data.frame when a single element is changed. Base R does just this. In the tidyverse the idea described above is implement around the pipe %>% operator. In coding to pipe means to direct the result of one function call to the next.

library(dplyr)
mtcars %>% 
  filter(cyl == 6) %>% 
  select(mpg, cyl, gear)
##    mpg cyl gear
## 1 21.0   6    4
## 2 21.0   6    4
## 3 21.4   6    3
## 4 18.1   6    3
## 5 19.2   6    4
## 6 17.8   6    4
## 7 19.7   6    5

The second part of the tidyverse is equally well documented, gets a quick mention here as it is one of the most popular standouts of the R language: the {ggplot2} data visualization package. The {ggplot2} package is the R implementation of a concept called grammer of graphics. The basic idea is to approach graphics layer by layer: an axis layer, a points layer on top, maybe another line layer on top of that and so forth.

library(ggplot2)
gg <- ggplot(mtcars, aes(x = mpg,
                         y = hp,
                         color = as.factor(cyl)))
gg + 
  geom_point() +
  theme_minimal() +
  scale_color_viridis_d()

data.table

Another very powerful ecosystem within the R world is Matt Dowle’s {data.table} package. It’s the fastest data.frame alternative R has. Again, the idea is to avoid copying an entire data.frame with potentiall millions of rows when a single element is changed. R {data.table} lets the much quicker C language do the R job but provides an R interface to it. The implementation has an SQL approach in mind and thus comes intuitive to people who worked with an SQL database before.

When you are used to working with base R, be aware! The {data.table} package modifies objects in place. {data.table} ships with a very fast .csv I/O functions (write/read) which can be a good reason to use data.table if you have to deal with large .csv files.

library(data.table)
input <- "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"

flights <- fread(input)
dim(flights)
## [1] 253316     11

# get all flights from JFK in June
ans <- flights[origin == "JFK" & month == 6L]
head(ans)
##    year month day dep_delay arr_delay carrier origin dest air_time distance
## 1: 2014     6   1        -9        -5      AA    JFK  LAX      324     2475
## 2: 2014     6   1       -10       -13      AA    JFK  LAX      329     2475
## 3: 2014     6   1        18        -1      AA    JFK  LAX      326     2475
## 4: 2014     6   1        -6       -16      AA    JFK  LAX      320     2475
## 5: 2014     6   1        -4       -45      AA    JFK  LAX      326     2475
## 6: 2014     6   1        -6       -23      AA    JFK  LAX      329     2475
##    hour
## 1:    8
## 2:   12
## 3:    7
## 4:   10
## 5:   18
## 6:   14

The official CRAN repository offers an introductory piece on data.table.

List of Resources

I’ve composed a list of links I wish had been around when I started learning how to program with data. It’s not a complete list, but I am confident, there is no BS on that list.

Comprehensive, General Reads

  • Advanced R by Hadley Wickham More fundamental than its title suggests. Well written documentation. Probably the right dose of comprehensive but applied enough to extend the audience to more than R gurus.

  • How R searches and find stuff: This post was recommended to me by Martin Maechler years ago and I never forgot about it since. Martin is an R Core member who knew more about R ~20 years ago than most of us know by now. The fact he recommended this post knighted it for me. It may be very nerdy and advanced from a beginners point of view, yet it is insightful and inclusive if you want to read some deeper thoughts about the language.

  • R for Data science Applied, on the problem book by Hadley Wickham/

  • ggplot2 documentation – Grammar of Graphics

  • Bookdown Collection of free R Books

Tasks and Puzzles

Community and News Digests

“I came for the software and stayed for community”.

The above David Allen (David works for Revolution Analytics which was acquired by Microsoft) quote summarizes one of the biggest strengths of the R Language: its one of kind community. The R community is incredibly inclusive, diversity has been nurtured for years and is on top of all that competent.

Feedback?

Miss your favorite ressource? What’s your approach to onboarding newcomers to programming with data. I’m curious. Let me know - hit me on Twitter. chirp chirp.

Avatar
Matt Bannert
gut checks stack. makes public data public. runs on rap & open source.

I am interested in data science devOps for official statistics, open source, time series, rstats, Python and SQL. I contribute to RAdwords, tstools, timeseriesdb and kofdata.

Related