Regular Expressions in R

What are Regular Expressions?

Regular expressions can be thought of as a combination of literals and metacharacters
To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar
Regular expressions have a rich set of metacharacters

Primary Metacharacters

The world of regular expressions is large, but there are few key metacharacters that get used often.

Literals
Beginning/End of line
Character classes
Repetition
Parenthesized subexpression

Literals

Literals are just sequences of characters. They have no special meaning and are interpreted by R literally.

The word “brother” is a literal

"My brother also was killed."

fortune of a cursed elder brother, whom God confound. Jealousy, discord,

"What, is it you, reverend Father? You, the brother of the fair

brother, and cut my mother in pieces. A tall Bulgarian, six feet high,

brother, his saviour.

So is “Brother” (but not the same!).

journeying, the Holy Brotherhood entered the house; my lord the

Beginning/End of Line

^ indicates match at the beginning of a line
$ indicates match at the end of a line

^The

The lady then put a plump hand out from the bed, and Candide bathed it

The old woman spoke thus to Cunegonde:

The good old man smiled.

The skipper asked ten thousand piastres. Candide did not hesitate.

The Baron's lady weighed about three hundred and fifty pounds, and was

the$

arms. My dear Martin, yet once more Pangloss was right: all is for the

seen any one so beautiful as I, and that he never so much regretted the

the weak execrate the powerful, before whom they cringe; and the

plays with her is yet worse; and the play is still worse than the

    II. What became of Candide among the

Character Classes

The square brackets [ and ] indicate classes of characters to search for.

[a-z]

may die of joy in her company."

"The Aga, who was a very gallant man, took his whole seraglio with him,

thousand times worse; the coolness of the magistrate and of the skipper

[A-Z]

"My name is Ivan. I was once Emperor of all the Russias, but was

"'My mission is done,' said this honest eunuch; 'I go to embark for

dreamed of Pangloss at every adventure told to him.

[0-9]

1.D.  The copyright laws of the place where you are located also govern

the real fact is I am a Manichean."[21]

[19] P. 78. The first English translator curiously gives "a tourene of

[Tt]he

volumes of theology, you may well imagine that neither I nor any one

black eunuchs and twenty soldiers. The Turks killed prodigious numbers

the harbour which could be sent to Buenos Ayres. The person to whom they

You can now combine character classes with begninning/end of line markers.

^[Tt]he

the family of my lady Baroness, and the fair Cunegonde. I swear to you

the miseries of poverty and slavery, had been ravished almost every day,

The conversation was long: it turned chiefly on their form of

Matching Anything

The . is used to match anything, including nothing.

9.1

Fairbanks, AK, 99712., but its volunteers and employees are scattered

[25] P. 109. Élie-Catherine Fréron was a French critic (1719-1776) who

[26] P. 111. Gabriel Gauchat (1709-1779), French ecclesiastical writer,

Repetition

The + is used to indicate “repeat the immediately preceding symbol 1 or more times”
The * is used to indicate “repeat the immediately preceding symbol 0 or more times”
{} can be used to indicate a range of repetition

[0-9]+

    | Notes [p. 170]; spelt Robek in the text [p. 53]) have      |

Release Date: November 27, 2006 [EBook #19942]

voluntary death, first printed in 1735.

2[0-9]*

[27] P. 112. Nicholas Charles Joseph Trublet (1697-1770) was a French

    | Page 172: rougish amended to roguish; crows amended to     |

[26] P. 111. Gabriel Gauchat (1709-1779), French ecclesiastical writer,

[0-9]{4,6}

uninteresting. Achmet III. (_b._ 1673, _d._ 1739) was dethroned in 1730.

Release Date: November 27, 2006 [EBook #19942]

Christian rites. In 1730 the "honours of sepulture" were refused to

he .* good

surprised at what he heard. Martin found there was a good deal of reason

Cacambo, and he loved his master, because his master was a very good

wish that he were here. Certainly, if all things are good, it is in El

Parenthesized Subexpression

() can be used to “capture” subexpressions

([0-9]+)\1

***** This file should be named 19942-8.txt or 19942-8.zip *****

Release Date: November 27, 2006 [EBook #19942]

[27] P. 112. Nicholas Charles Joseph Trublet (1697-1770) was a French

$(.*)$

     you in writing (or by e-mail) within 30 days of receipt that s/he

skipper (were he even to rob him like the Surinam captain) to conduct

[35] P. 149. François Leopold Ragotsky (1676-1735).

Regular Expression Functions

The primary R functions for dealing with regular expressions are

grep, grepl: Search for matches of a regular expression/pattern in a character vector; either return the indices into the character vector that match, the strings that happen to match, or a TRUE/FALSE vector indicating which elements match
regexpr, `gregexpr: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match
sub, gsub: Search a character vector for regular expression matches and replace that match with another string
regexec: Easier to explain through demonstration.

grep

library(readr)
commit_log <- read_lines("../data/commit_logs_strip.txt.bz2")
head(commit_log)

[1] "commit 7f6ef08e80191712a5eb0d75c42931466e7bbe73"
[2] "Author: ZchMr <zc@.>"                           
[3] "Date:   Wed Oct 1 16:55:12 2014 -0400"          
[4] ""                                               
[5] "    date changes to pages/tickets"              
[6] ""

How many commits are there?

g <- grep("^commit", commit_log)
head(g)

[1]   1  64 179 208 246 275

length(g)

[1] 2384

Sometimes you want grep() to return the value instead of the index

g <- grep("^commit", commit_log, value = TRUE)
head(g)

[1] "commit 7f6ef08e80191712a5eb0d75c42931466e7bbe73"
[2] "commit 6fe5d43383d50c698993c9b46b33f08f4897c70f"
[3] "commit f24bed325a5a5a3989187edf704410d63e055efb"
[4] "commit 8b7272dff8ab78947cfcdc173673efccf43ed22a"
[5] "commit 45607de7dceedf44061f60024f9ddae99aeafb80"
[6] "commit ceb99122b6813e21067867f9cc16e62a2505adf2"

Who are the authors of these commits?

g <- grep("^Author", commit_log, value = TRUE, perl = TRUE)
head(g)

[1] "Author: ZchMr <zc@.>" "Author: ZchMr <zc@.>" "Author: DvdG <wh@.>" 
[4] "Author: DvdG <wh@.>"  "Author: DvdG <wh@.>"  "Author: DvdG <wh@.>"

length(unique(g))

[1] 18

grepl

By default, grep() returns the indices into the character vector where the regex pattern matches.

g <- grep("^Author", commit_log[1:100])
g

[1]  2 65

grepl() returns a logical vector indicating which element matches.

i <- grepl("^Author", commit_log[1:100])
i

  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[100] FALSE

Some limitations of grep():

The grep() function tells you which strings in a character vector match a certain pattern but it doesn’t tell you exactly where the match occurs or what the match is (for a more complicated regex).
The regexpr() function gives you the index into each string where the match begins and the length of the match for that string.
regexpr() only gives you the first match of the string (reading left to right). gregexpr() will give you all of the matches in a given string.

regexpr

How can we obtain the email addresses of the authors?

head(commit_log, 10)

 [1] "commit 7f6ef08e80191712a5eb0d75c42931466e7bbe73"                                                                     
 [2] "Author: ZchMr <zc@.>"                                                                                                
 [3] "Date:   Wed Oct 1 16:55:12 2014 -0400"                                                                               
 [4] ""                                                                                                                    
 [5] "    date changes to pages/tickets"                                                                                   
 [6] ""                                                                                                                    
 [7] "diff --git a/workbench/bts/boxoffice/src/views/review.blade.php b/workbench/bts/boxoffice/src/views/review.blade.php"
 [8] "index 0967709..6900b84 100644"                                                                                       
 [9] "--- a/workbench/bts/boxoffice/src/views/review.blade.php"                                                            
[10] "+++ b/workbench/bts/boxoffice/src/views/review.blade.php"

What if we use the regex <(.*)> and search for that?

We need to search the Author line for a pattern. We can first grep the Author lines and then search for a pattern.

author <- grep("^Author:", commit_log, value = TRUE, perl = TRUE)
head(author, 3)

[1] "Author: ZchMr <zc@.>" "Author: ZchMr <zc@.>" "Author: DvdG <wh@.>"

r <- regexpr("<.*>", author)
str(r)

 int [1:2384] 15 15 14 14 14 14 14 14 14 14 ...
 - attr(*, "match.length")= int [1:2384] 6 6 6 6 6 6 6 6 6 6 ...
 - attr(*, "index.type")= chr "chars"
 - attr(*, "useBytes")= logi TRUE

regexpr() returns a vector of integers indicating where the match starts
The attribute match.length indicates how long the match is
If there’s no match, regexpr() returns -1 with a match.length of -1.

The obvious way to select out a match is to use the indices and the substr() function.

substr(author[1], 15, 15 + 6 - 1)

[1] "<zc@.>"

substr(author[3], 14, 14 + 6 - 1)

[1] "<wh@.>"

regmatches

We can also use the regmatches() function to just grab all of the matches at once.

r <- regexpr("<.*>", author)
m <- regmatches(author, r)
head(m)

[1] "<zc@.>" "<zc@.>" "<wh@.>" "<wh@.>" "<wh@.>" "<wh@.>"

sub/gsub

But we still don’t have actual email addresses. We need to remove the < and > characters. We can use the sub() function for that.

sub("<", "", m[1:5])

[1] "zc@.>" "zc@.>" "wh@.>" "wh@.>" "wh@.>"

sub(">", "", m[1:5])

[1] "<zc@." "<zc@." "<wh@." "<wh@." "<wh@."

But we want to remove both < and >!

We can use a regular expression in sub().

sub("<|>", "", m[1:5])

[1] "zc@.>" "zc@.>" "wh@.>" "wh@.>" "wh@.>"

gsub() substitutes all occurrences of the regex (g is for “global”).

gsub("<|>", "", m[1:5])

[1] "zc@." "zc@." "wh@." "wh@." "wh@."

regexec

The regexec() function can make the previous task a bit simpler by using parenthesized sub-expressions.

author[1]

[1] "Author: ZchMr <zc@.>"

We can capture the email address portion of the line with parentheses.

regexec("^Author: [^ ]+ <(.*)>", author[1])

[[1]]
[1]  1 16
attr(,"match.length")
[1] 20  4
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

r <- regexec("^Author: [^ ]+ <(.*)>", author[1])
regmatches(author[1], r)

[[1]]
[1] "Author: ZchMr <zc@.>" "zc@."

When were all of the commits made?

r <- regexec("^Date: +(.*)$", commit_log, perl = TRUE)
m <- regmatches(commit_log, r)
head(m)

[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] "Date:   Wed Oct 1 16:55:12 2014 -0400"
[2] "Wed Oct 1 16:55:12 2014 -0400"        

[[4]]
character(0)

[[5]]
character(0)

[[6]]
character(0)

Now we can subset the elements that match

library(purrr)
u <- map_int(m, length) > 0
str(u)

 logi [1:7758154] FALSE FALSE TRUE FALSE FALSE FALSE ...

head(m[u])

[[1]]
[1] "Date:   Wed Oct 1 16:55:12 2014 -0400"
[2] "Wed Oct 1 16:55:12 2014 -0400"        

[[2]]
[1] "Date:   Wed Oct 1 16:14:22 2014 -0400"
[2] "Wed Oct 1 16:14:22 2014 -0400"        

[[3]]
[1] "Date:   Wed Oct 1 14:43:11 2014 -0400"
[2] "Wed Oct 1 14:43:11 2014 -0400"        

[[4]]
[1] "Date:   Wed Oct 1 14:42:44 2014 -0400"
[2] "Wed Oct 1 14:42:44 2014 -0400"        

[[5]]
[1] "Date:   Wed Oct 1 13:23:31 2014 -0400"
[2] "Wed Oct 1 13:23:31 2014 -0400"        

[[6]]
[1] "Date:   Wed Oct 1 13:12:39 2014 -0400"
[2] "Wed Oct 1 13:12:39 2014 -0400"

Finally, we can parse the dates/times.

library(lubridate)
dates <- map_chr(m[u], 2) %>%
  parse_date_time("abd HMS Y z", tz = "America/New_York",
                  quiet = TRUE)
str(dates)

 POSIXct[1:2384], format: "2014-10-01 16:55:12" "2014-10-01 16:14:22" ...

head(dates)

[1] "2014-10-01 16:55:12 EDT" "2014-10-01 16:14:22 EDT"
[3] "2014-10-01 14:43:11 EDT" "2014-10-01 14:42:44 EDT"
[5] "2014-10-01 13:23:31 EDT" "2014-10-01 13:12:39 EDT"

Histogram

You can make a histogram of the dates

hist(dates, "month", freq = TRUE)
rug(dates)

Summary