Before we begin, you will need to install these packages
install.packages("jsonlite")
install.packages("rvest")
install.packages("DBI")
install.packages("RSQLite")
install.packages("dbplyr")
install.packages("pollstR")Now we load a few R packages
Today we are going to talk about getting data, examples of common data formats, and useful tools to access data.
First let’s have a bit of a philosophical discussion about data.
As data analysts, this is what we wished data looked like whenever we start a project
However, the reality, is data is rarely in that form in comes in all types of “raw” formats that need to be transformed into a “clean” format.
For example, in field of genomics, raw data looks like something like this:
Or if you are interested in analyzing data from Twitter:
Or data from Electronic Healthcare Records (EHRs):
We all have our scary spreadsheet tales. Here is Jenny Bryan from RStudio and UBC actually asking for some of those spreasheet tales on twitter.
For example, this is an actual spreadsheet from Enron in 2001:
From https://simplystatistics.org/2016/07/20/relativity-raw-data/ raw data is defined as data…
…if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.
Data lives anywhere and everywhere. Data might be stored simply in a .csv or .txt file. Data might be stored in an Excel or Google Spreadsheet. Data might be stored in large databases that require users to write special functions to interact with to extract the data they are interested in.
For example, you may have heard of the terms mySQL or MongoDB.
From Wikipedia, MySQL is defined as an open-source relational database management system (RDBMS). Its name is a combination of “My”, the name of co-founder Michael Widenius’s daughter,[7] and “SQL”, the abbreviation for Structured Query Language..
From Wikipeda, MongoDB is defined as “a free and open-source cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schemata.”
So after reading that, we get the sense that there are multiple ways large databases can be structured, data can be formatted and interacted with. In addition, we see that database programs (e.g. MySQL and MongoDB) can also interact with each other.
We will learn more about SQL and JSON in a bit.
A great article in PeerJ was written titled How to share data for collaboration, in which the authors describe a set of guidelines for sharing data:
We highlight the need to provide raw data to the statistician, the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician. With these guidelines we hope to avoid errors and delays in data analysis. the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician.
It’s a great paper that describes the information you should pass to a statistician to facilitate the most efficient and timely analysis. Specifically:
tidy dataset, but possibly yes.md, .txt or Word file.First let’s talk about a few important things before we download any data.
When you are starting a data analysis, you have already learned about the use of .Rproj files. When you open up a .Rproj file, RStudio changes the path (location on your computer) to the .Rproj location.
After opening up a .Rproj file, you can test this by
When you open up someone else’s R code or analysis, you might also see the setwd() function being used which explicitly tells R to change the absolute path or absolute location of which directory to move into.
For example, say I want to clone a GitHub repo from Roger, which has 100 R script files, and in every one of those files at the top is:
The problem is, if I want to use his code, I will need to go and hand-edit every single one of those paths (C:\Users\Roger\path\only\that\Roger\has) to the path that I want to use on my computer or wherever I saved the folder on my computer (e.g. /Users/Stephanie/Documents/path/only/I/have).
So instead of absolute paths:
A better idea is to use relative paths:
Within R, an even better idea is to use the here R package will recognize the top-level directory of a Git repo and supports building all paths relative to that. For more on project-oriented workflow suggestions, read this post from Jenny Bryan.
here packageIn her post, she writes
“I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work.”
Instead of using setwd() at the top your .R or .Rmd file, she suggests:
.here. Or, if you use RStudio and/or Git, those both leave characteristic files behind that will get the job done.here() function from the here package to build the path when you read or write a file. Create paths relative to the top-level directory.cd to the correct folder first.Let’s test this out. We can use getwd() to see our current working directory path and the files available using list.file()
## [1] "/Users/shicks/Documents/github/teaching/jhu-advdatasci/2019/lectures"
## [1] "01-introduction-lecture.html"
## [2] "01-introduction-lecture.Rmd"
## [3] "03-elements-prinicples-success.pdf"
## [4] "04-gettingdata-api.html"
## [5] "04-gettingdata-api.Rmd"
## [6] "TypesOfQuestions.pdf"
OK so our current location is in the lectures sub-folder of the 2019 course repository. Let’s try using the here package.
## here() starts at /Users/shicks/Documents/github/teaching/jhu-advdatasci/2019
## [1] "_navbar.yml" "_site.yml"
## [3] "2019-advdatasci.Rproj" "additional.css"
## [5] "book" "data"
## [7] "discussion.html" "discussion.Rmd"
## [9] "homework.html" "homework.Rmd"
## [11] "homeworks" "imgs"
## [13] "index.html" "index.Rmd"
## [15] "lectures" "Makefile"
## [17] "projects.html" "projects.Rmd"
## [19] "R_functions.sh" "README.html"
## [21] "README.md" "resources.html"
## [23] "resources.Rmd" "site_libs"
## [25] "slide_functions.R" "styles.css"
## [27] "syllabus.html" "syllabus.Rmd"
## [1] "cameras.csv" "Chinook.sqlite"
Now we see that using the here::here() function is a relative path (relative to the .Rproj file in our 2019 repository. We also see there is a cameras.csv file in the data folder. Let’s read it into R with the readr package.
## Parsed with column specification:
## cols(
## address = col_character(),
## direction = col_character(),
## street = col_character(),
## crossStreet = col_character(),
## intersection = col_character(),
## `Location 1` = col_character()
## )
## # A tibble: 80 x 6
## address direction street crossStreet intersection `Location 1`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 GARRISON BL… E/B "Garris… Wabash Ave "Garrison \n … (39.341209, …
## 2 HILLEN ST &… W/B "Hillen… Forrest St "Hillen \n & … (39.29686, -…
## 3 EDMONDSON A… E/B "Edmons… Woodbridge… "Edmonson\n … (39.293453, …
## 4 YORK RD & G… S/B "York R… Gitting Ave "York Rd \n &… (39.370493, …
## 5 RUSSELL ST … S/B "Russel… Hamburg St "Russell\n &… (39.279819, …
## 6 S MARTIN LU… S/B "MLK Jr… Pratt St "MLK Jr. Blvd… (39.286027, …
## 7 ORLEANS ST … E/B Orleans Linwood Ave Orleans & L… (39.295866, …
## 8 E NORTHERN … W/B "Northe… Springlake… "Northern Pkw… (39.364311, …
## 9 W COLD SPRI… E/B "Cold S… Roland Ave "Cold Spring\… (39.343906, …
## 10 E NORTHERN … W/B "Northe… York Road "Northern Pkw… (39.365146, …
## # … with 70 more rows
We can also ask for the full paths for specific files
## [1] "/Users/shicks/Documents/github/teaching/jhu-advdatasci/2019/data/cameras.csv"
If you want to download a file, one way to use the file.exists(), dir.create() and list.files() functions.
file.exists(here("my", "relative", "path")) = logical test if the file existsdir.create(here("my", "relative", "path")) = create a folderlist.files(here("my", "relative", "path")) = list contents of folderLet’s say we wanted to find out where are all the Fixed Speed Cameras in Baltimore?
To do this, we can use the Open Baltimore API which has information on the locations of fixed speed cameras in Baltimore.
In case you aren’t familiar with fixed speed cameras, the website states:
Motorists who drive aggressively and exceed the posted speed limit by at least 12 miles per hour will receive $40 citations in the mail. These citations are not reported to insurance companies and no license points are assigned. Notification signs will be placed at all speed enforcement locations so that motorists will be aware that they are approaching a speed check zone. The goal of the program is to make the streets of Baltimore safer for everyone by changing aggressive driving behavior. In addition to the eight portable speed enforcement units, the city has retrofitted 50 red light camera locations with the automated speed enforcement technology.
When we go to the website, we see that the data can be provided to us as a .csv file. To download in this data, we can do the following:
file_url <- paste0("https://data.baltimorecity.gov/api/",
"views/dz54-2aru/rows.csv?accessType=DOWNLOAD")
download.file(file_url,
destfile=here("data", "cameras.csv"))
list.files(here("data"))Alternatively, if we want to only download the file once each time we knit our reproducible report or homework or project, we can us wrap the code above into a !file.exists() function.
if(!file.exists(here("data", "cameras.csv"))){
file_url <- paste0("https://data.baltimorecity.gov/api/",
"views/dz54-2aru/rows.csv?accessType=DOWNLOAD")
download.file(file_url,
destfile=here("data", "cameras.csv"))
}
list.files(here("data"))## [1] "cameras.csv" "Chinook.sqlite"
From there, we can read in the cameras.csv like we have already learned how to do using the readr::read_csv() function:
## Parsed with column specification:
## cols(
## address = col_character(),
## direction = col_character(),
## street = col_character(),
## crossStreet = col_character(),
## intersection = col_character(),
## `Location 1` = col_character()
## )
## # A tibble: 80 x 6
## address direction street crossStreet intersection `Location 1`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 GARRISON BL… E/B "Garris… Wabash Ave "Garrison \n … (39.341209, …
## 2 HILLEN ST &… W/B "Hillen… Forrest St "Hillen \n & … (39.29686, -…
## 3 EDMONDSON A… E/B "Edmons… Woodbridge… "Edmonson\n … (39.293453, …
## 4 YORK RD & G… S/B "York R… Gitting Ave "York Rd \n &… (39.370493, …
## 5 RUSSELL ST … S/B "Russel… Hamburg St "Russell\n &… (39.279819, …
## 6 S MARTIN LU… S/B "MLK Jr… Pratt St "MLK Jr. Blvd… (39.286027, …
## 7 ORLEANS ST … E/B Orleans Linwood Ave Orleans & L… (39.295866, …
## 8 E NORTHERN … W/B "Northe… Springlake… "Northern Pkw… (39.364311, …
## 9 W COLD SPRI… E/B "Cold S… Roland Ave "Cold Spring\… (39.343906, …
## 10 E NORTHERN … W/B "Northe… York Road "Northern Pkw… (39.365146, …
## # … with 70 more rows
jsonliteJSON (or JavaScript Object Notation) is a file format that stores information in human-readable, organized, logical, easy-to-access manner.
For example, here is what a JSON file looks like:
var stephanie = {
"age" : "33",
"hometown" : "Baltimore, MD",
"gender" : "female",
"cars" : {
"car1" : "Hyundai Elantra",
"car2" : "Toyota Rav4",
"car3" : "Honda CR-V"
}
}Some features about JSON object:
{}Let’s say we want to use the GitHub API to find out how many of my GitHub repositories have open issues?
We will use the jsonlite R package and the fromJSON() function to convert from a JSON object to a data frame.
We will read in a JSON file located at https://api.github.com/users/stephaniehicks/repos
github_url = "https://api.github.com/users/stephaniehicks/repos"
library(jsonlite)
jsonData <- fromJSON(github_url)The function fromJSON() has now converted the JSON file into a data frame with the names:
## [1] "id" "node_id" "name"
## [4] "full_name" "private" "owner"
## [7] "html_url" "description" "fork"
## [10] "url" "forks_url" "keys_url"
## [13] "collaborators_url" "teams_url" "hooks_url"
## [16] "issue_events_url" "events_url" "assignees_url"
## [19] "branches_url" "tags_url" "blobs_url"
## [22] "git_tags_url" "git_refs_url" "trees_url"
## [25] "statuses_url" "languages_url" "stargazers_url"
## [28] "contributors_url" "subscribers_url" "subscription_url"
## [31] "commits_url" "git_commits_url" "comments_url"
## [34] "issue_comment_url" "contents_url" "compare_url"
## [37] "merges_url" "archive_url" "downloads_url"
## [40] "issues_url" "pulls_url" "milestones_url"
## [43] "notifications_url" "labels_url" "releases_url"
## [46] "deployments_url" "created_at" "updated_at"
## [49] "pushed_at" "git_url" "ssh_url"
## [52] "clone_url" "svn_url" "homepage"
## [55] "size" "stargazers_count" "watchers_count"
## [58] "language" "has_issues" "has_projects"
## [61] "has_downloads" "has_wiki" "has_pages"
## [64] "forks_count" "mirror_url" "archived"
## [67] "disabled" "open_issues_count" "license"
## [70] "forks" "open_issues" "watchers"
## [73] "default_branch"
How many are private repos? How many have forks?
##
## FALSE
## 30
##
## 0 1 2 3 4 5 8
## 20 4 2 1 1 1 1
What’s the most popular language?
##
## HTML JavaScript R Ruby Shell TeX
## 7 1 11 2 1 2
To find out how many repos that I have with open issues, we can just create a table:
##
## 0 1
## 28 2
Whew! Not as many as I thought.
How many do you have?
Finally, I will leave you with a few other examples of using GitHub API:
dplyr package?rvestDo we want to purchase a book on Amazon?
Next we are going to learn about what to do if your data is on a website (XML or HTML) formatted to be read by humans instead of R.
We will use the (really powerful) rvest R package to do what is often called “scraping data from the web”.
Before we do that, we need to set up a few things:
We’re going to be scraping this page: it just contains the (first page of) reviews of the ggplot2 book by Hadley Wickham.
We use the rvest package to download this page.
Now h is an xml_document that contains the contents of the page:
## {xml_document}
## <html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n<span id="cr-state-object" data-state='{"asin":"0387981403", ...
How can you actually pull the interesting information out? That’s where CSS selectors come in.
CSS selectors are a way to specify a subset of nodes (that is, units of content) on a web page (e.g., just getting the titles of reviews). CSS selectors are very powerful and not too challenging to master- here’s a great tutorial But honestly you can get a lot done even with very little understanding, by using a tool called SelectorGadget.
Install the SelectorGadget on your web browser. (If you use Chrome you can use the Chrome extension, otherwise drag the provided link into your bookmarks bar). Here’s a guide for how to use it with rvest to “point-and-click” your way to a working selector.
For example, if you just wanted the titles, you’ll end up with a selector that looks something like .a-text-bold span. You can pipe your HTML object along with that selector into the html_nodes function, to select just those nodes:
## {xml_nodeset (10)}
## [1] <span class="">Must-have reference for R graphics</span>
## [2] <span class="">Still a great package and highly worth learning - bu ...
## [3] <span class="">Excellent</span>
## [4] <span class="">Nice resource, but already out of date</span>
## [5] <span class="">The best guide to the best graphics (I think) out th ...
## [6] <span class="">Graphing in R</span>
## [7] <span class="">Excellent content, poor adaptation to kindle</span>
## [8] <span class="">Excellent R resource for the Kindle</span>
## [9] <span class="">Great book, outdated</span>
## [10] <span class="">Indispensable resource for ggplot2 users</span>
But you need the text from each of these, not the full tags. Pipe to the html_text function to pull these out:
## [1] "Must-have reference for R graphics"
## [2] "Still a great package and highly worth learning - but the text is getting quite out of date."
## [3] "Excellent"
## [4] "Nice resource, but already out of date"
## [5] "The best guide to the best graphics (I think) out there."
## [6] "Graphing in R"
## [7] "Excellent content, poor adaptation to kindle"
## [8] "Excellent R resource for the Kindle"
## [9] "Great book, outdated"
## [10] "Indispensable resource for ggplot2 users"
Now we’ve extracted something useful! Similarly, let’s grab the format (hardcover or paperback). Some experimentation with SelectorGadget shows it’s:
## [1] "Format: Paperback" "Format: Paperback"
## [3] "Format: Kindle Edition" "Format: Paperback"
## [5] "Format: Paperback" "Format: Kindle Edition"
## [7] "Format: Kindle Edition" "Format: Kindle Edition"
## [9] "Format: Paperback" "Format: Paperback"
Now, we may be annoyed that it always starts with Format:. Let’s introduce the stringr package.
formats <- h %>%
html_nodes(".a-size-mini.a-color-secondary") %>%
html_text() %>%
stringr::str_replace("Format: ", "")
formats## [1] "Paperback" "Paperback" "Kindle Edition" "Paperback"
## [5] "Paperback" "Kindle Edition" "Kindle Edition" "Kindle Edition"
## [9] "Paperback" "Paperback"
We could do similar exercise for extracting the number of stars and whether or not someone found a review useful. This would help us decide if we were interested in purchasing the book!
SQLite databaseAnother important type of data you might interact with are databases (such as SQL or SQLite). There are several ways to query databases in R.
First, we will download a .sqlite database. This is a portable version of a SQL database. For our purposes, we will use the chinook sqlite database here. The database represents a “digital media store, including tables for artists, albums, media tracks, invoices and customers”.
From the Readme.md file:
Sample Data
Media related data was created using real data from an iTunes Library. It is possible for you to use your own iTunes Library to generate the SQL scripts, see instructions below. Customer and employee information was manually created using fictitious names, addresses that can be located on Google maps, and other well formatted data (phone, fax, email, etc.). Sales information is auto generated using random data for a four year period.
if(!file.exists(here("data", "Chinook.sqlite"))){
file_url <- paste0("https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite")
download.file(file_url,
destfile=here("data", "Chinook.sqlite"))
}
list.files(here("data"))## [1] "cameras.csv" "Chinook.sqlite"
The main workhorse packages that we will use are the DBI and dplyr packages. Let’s look at the DBI::dbConnect() help file
So we need a driver and one example is RSQLite::SQLite(). Let’s look at the help file
Ok so with RSQLite::SQLite() and DBI::dbConnect() we can connect to a SQLite database. Let’s try that with our Chinook.sqlite file that we downloaded. Chinook.sqlite
## <SQLiteConnection>
## Path: /Users/shicks/Documents/github/teaching/jhu-advdatasci/2019/data/Chinook.sqlite
## Extensions: TRUE
So we have opened up a connection with the SQLite database. Next, we can see what tables are available in the database using the dbListTables() function:
## [1] "Album" "Artist" "Customer" "Employee"
## [5] "Genre" "Invoice" "InvoiceLine" "MediaType"
## [9] "Playlist" "PlaylistTrack" "Track"
From RStudio’s website, there are several ways to interact with SQL Databases. One of the simplest ways that we will use here is to leverage the dplyr framework.
"The
dplyrpackage now has a generalized SQL backend for talking to databases, and the newdbplyrpackage translates R code into database-specific variants. As of this writing, SQL variants are supported for the following databases: Oracle, Microsoft SQL Server, PostgreSQL, Amazon Redshift, Apache Hive, and Apache Impala. More will follow over time.
So if we want to query a SQL databse with dplyr, the benefit of usingdbplyr` is:
"You can write your code in
dplyrsyntax, anddplyrwill translate your code into SQL. There are several benefits to writing queries indplyrsyntax: you can keep the same consistent language both for R objects and database tables, no knowledge of SQL or the specific SQL variant is required, and you can take advantage of the fact thatdplyruses lazy evaluation.
Let’s take a closer look at the conn database that we just connected to:
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
## src: sqlite 3.29.0 [/Users/shicks/Documents/github/teaching/jhu-advdatasci/2019/data/Chinook.sqlite]
## tbls: Album, Artist, Customer, Employee, Genre, Invoice, InvoiceLine,
## MediaType, Playlist, PlaylistTrack, Track
You can think of the multiple tables similar to having multiple worksheets in a spreadsheet.
Let’s try interacting with one.
dplyr syntaxFirst, let’s look at the first ten rows in the Album table.
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.29.0
## # [/Users/shicks/Documents/github/teaching/jhu-advdatasci/2019/data/Chinook.sqlite]
## AlbumId Title ArtistId
## <int> <chr> <int>
## 1 1 For Those About To Rock We Salute You 1
## 2 2 Balls to the Wall 2
## 3 3 Restless and Wild 2
## 4 4 Let There Be Rock 1
## 5 5 Big Ones 3
## 6 6 Jagged Little Pill 4
## 7 7 Facelift 5
## 8 8 Warner 25 Anos 6
## 9 9 Plays Metallica By Four Cellos 7
## 10 10 Audioslave 8
The output looks just like a data.frame that we are familiar with. But it’s important to know that it’s not really a dataframe. For example, what about if we use the dim() function?
## [1] NA 3
Interesting! We see that the number of rows returned is NA. This is because these functions are different than operating on datasets in memory (e.g. loading data into memory using read_csv()). Instead, dplyr communicates differently with a SQLite database.
Let’s consider our example. If we were to use straight SQL, the following SQL query returns the first 10 rows from the Album table:
In the background, dplyr does the following:
To better understand the dplyr code, we can use the show_query() function:
## <SQL>
## SELECT *
## FROM `Album`
## LIMIT 10
This is nice because instead of having to write the SQL query ourself, we can just use the dplyr and R syntax that we are used to.
However, the downside is that dplyr never gets to see the full Album table. It only sends our query to the database, waits for a response and returns the query. However, in this way we can interact with large datasets!
Many of the usual dplyr functions are available too:
select()filter()summarize()and many join functions.
Ok let’s try some of the functions out. First, let’s count how many albums each artist has made.
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.29.0
## # [/Users/shicks/Documents/github/teaching/jhu-advdatasci/2019/data/Chinook.sqlite]
## ArtistId n
## <int> <int>
## 1 1 2
## 2 2 2
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 1
## 10 10 1
Next, let’s plot it.
tbl(conn, "Album") %>%
group_by(ArtistId) %>%
summarize(n = count(ArtistId)) %>%
arrange(desc(n)) %>%
ggplot(aes(x = ArtistId, y = n)) +
geom_bar(stat = "identity")Let’s also extract the first letter from each album and plot the frequency of each letter.
tbl(conn, "Album") %>%
mutate(first_letter = str_sub(Title, end = 1)) %>%
ggplot(aes(first_letter)) +
geom_bar()The Huffington Post has an API which provides US opinion poll data on various political races and other non-political opinion polls.
There is an R package called pollstR which provides an easy user interface.
For example, the API has data on the Trump Job Approval
Here we use the pollster_charts_polls() function:
## Parsed with column specification:
## cols(
## Approve = col_double(),
## Disapprove = col_double(),
## Undecided = col_double(),
## poll_slug = col_character(),
## survey_house = col_character(),
## start_date = col_date(format = ""),
## end_date = col_date(format = ""),
## question_text = col_character(),
## sample_subpopulation = col_character(),
## observations = col_double(),
## margin_of_error = col_double(),
## mode = col_character(),
## partisanship = col_character(),
## partisan_affiliation = col_character()
## )
We can see what’s in the object:
## [1] "content" "url" "response"
The url links to the data itself
## [1] "https://elections.huffingtonpost.com/pollster/api/v2/charts/trump-job-approval/pollster-chart-poll-questions.tsv"
The content contains the polling data:
## # A tibble: 970 x 14
## Approve Disapprove Undecided poll_slug survey_house start_date
## <dbl> <dbl> <dbl> <chr> <chr> <date>
## 1 49 50 NA rasmusse… Rasmussen 2018-12-03
## 2 43 48 8 yougov-e… YouGov/Econ… 2018-12-02
## 3 41 52 6 ipsos-re… Ipsos/Reute… 2018-11-28
## 4 48 50 NA rasmusse… Rasmussen 2018-11-28
## 5 40 56 NA gallup-2… Gallup 2018-11-26
## 6 40 52 9 yougov-e… YouGov/Econ… 2018-11-25
## 7 48 50 NA rasmusse… Rasmussen 2018-11-25
## 8 43 45 12 grinnell… Grinnell/Se… 2018-11-24
## 9 44 51 6 ipsos-re… Ipsos/Reute… 2018-11-21
## 10 38 60 NA gallup-2… Gallup 2018-11-18
## # … with 960 more rows, and 8 more variables: end_date <date>,
## # question_text <chr>, sample_subpopulation <chr>, observations <dbl>,
## # margin_of_error <dbl>, mode <chr>, partisanship <chr>,
## # partisan_affiliation <chr>
This might be useful if you were ever interested in using polling data.
jsonlite for JSON (e.g. GitHub API)rvest to grab all the exact elements you want (e.g. book reviews)
DBI, RSQLite, dbplyr for interacting with SQLite databseshttr for tools to work with URLs and HTTPgooglesheets to interact with Google Sheets in Rgoogledrive](https://googledrive.tidyverse.org](http://googledrive.tidyverse.org/) to interact with your Google Drive