Primary Metacharacters
The world of regular expressions is large, but there are few key metacharacters that get used often.
Literals
Literals are just sequences of characters. They have no special meaning and are interpreted by R literally.
The word “brother” is a literal
"My brother also was killed."
fortune of a cursed elder brother, whom God confound. Jealousy, discord,
"What, is it you, reverend Father? You, the brother of the fair
brother, and cut my mother in pieces. A tall Bulgarian, six feet high,
brother, his saviour.
So is “Brother” (but not the same!).
journeying, the Holy Brotherhood entered the house; my lord the
Beginning/End of Line
^
indicates match at the beginning of a line
$
indicates match at the end of a line
^The
The lady then put a plump hand out from the bed, and Candide bathed it
The old woman spoke thus to Cunegonde:
The good old man smiled.
The skipper asked ten thousand piastres. Candide did not hesitate.
The Baron's lady weighed about three hundred and fifty pounds, and was
the$
arms. My dear Martin, yet once more Pangloss was right: all is for the
seen any one so beautiful as I, and that he never so much regretted the
the weak execrate the powerful, before whom they cringe; and the
plays with her is yet worse; and the play is still worse than the
II. What became of Candide among the
Character Classes
The square brackets [
and ]
indicate classes of characters to search for.
[a-z]
may die of joy in her company."
"The Aga, who was a very gallant man, took his whole seraglio with him,
thousand times worse; the coolness of the magistrate and of the skipper
[A-Z]
"My name is Ivan. I was once Emperor of all the Russias, but was
"'My mission is done,' said this honest eunuch; 'I go to embark for
dreamed of Pangloss at every adventure told to him.
[0-9]
1.D. The copyright laws of the place where you are located also govern
the real fact is I am a Manichean."[21]
[19] P. 78. The first English translator curiously gives "a tourene of
[Tt]he
volumes of theology, you may well imagine that neither I nor any one
black eunuchs and twenty soldiers. The Turks killed prodigious numbers
the harbour which could be sent to Buenos Ayres. The person to whom they
You can now combine character classes with begninning/end of line markers.
^[Tt]he
the family of my lady Baroness, and the fair Cunegonde. I swear to you
the miseries of poverty and slavery, had been ravished almost every day,
The conversation was long: it turned chiefly on their form of
Matching Anything
The .
is used to match anything, including nothing.
9.1
Fairbanks, AK, 99712., but its volunteers and employees are scattered
[25] P. 109. Élie-Catherine Fréron was a French critic (1719-1776) who
[26] P. 111. Gabriel Gauchat (1709-1779), French ecclesiastical writer,
Repetition
The +
is used to indicate “repeat the immediately preceding symbol 1 or more times”
The *
is used to indicate “repeat the immediately preceding symbol 0 or more times”
{}
can be used to indicate a range of repetition
[0-9]+
| Notes [p. 170]; spelt Robek in the text [p. 53]) have |
Release Date: November 27, 2006 [EBook #19942]
voluntary death, first printed in 1735.
2[0-9]*
[27] P. 112. Nicholas Charles Joseph Trublet (1697-1770) was a French
| Page 172: rougish amended to roguish; crows amended to |
[26] P. 111. Gabriel Gauchat (1709-1779), French ecclesiastical writer,
[0-9]{4,6}
uninteresting. Achmet III. (_b._ 1673, _d._ 1739) was dethroned in 1730.
Release Date: November 27, 2006 [EBook #19942]
Christian rites. In 1730 the "honours of sepulture" were refused to
he .* good
surprised at what he heard. Martin found there was a good deal of reason
Cacambo, and he loved his master, because his master was a very good
wish that he were here. Certainly, if all things are good, it is in El
Parenthesized Subexpression
()
can be used to “capture” subexpressions
([0-9]+)\1
***** This file should be named 19942-8.txt or 19942-8.zip *****
Release Date: November 27, 2006 [EBook #19942]
[27] P. 112. Nicholas Charles Joseph Trublet (1697-1770) was a French
\((.*)\)
you in writing (or by e-mail) within 30 days of receipt that s/he
skipper (were he even to rob him like the Surinam captain) to conduct
[35] P. 149. François Leopold Ragotsky (1676-1735).
Regular Expression Functions
The primary R functions for dealing with regular expressions are
grep
, grepl
: Search for matches of a regular expression/pattern in a character vector; either return the indices into the character vector that match, the strings that happen to match, or a TRUE/FALSE vector indicating which elements match
regexpr
, `gregexpr: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match
sub
, gsub
: Search a character vector for regular expression matches and replace that match with another string
regexec
: Easier to explain through demonstration.
grep
library(readr)
commit_log <- read_lines("../data/commit_logs_strip.txt.bz2")
head(commit_log)
[1] "commit 7f6ef08e80191712a5eb0d75c42931466e7bbe73"
[2] "Author: ZchMr <zc@.>"
[3] "Date: Wed Oct 1 16:55:12 2014 -0400"
[4] ""
[5] " date changes to pages/tickets"
[6] ""
How many commits are there?
g <- grep("^commit", commit_log)
head(g)
[1] 1 64 179 208 246 275
[1] 2384
Sometimes you want grep()
to return the value instead of the index
g <- grep("^commit", commit_log, value = TRUE)
head(g)
[1] "commit 7f6ef08e80191712a5eb0d75c42931466e7bbe73"
[2] "commit 6fe5d43383d50c698993c9b46b33f08f4897c70f"
[3] "commit f24bed325a5a5a3989187edf704410d63e055efb"
[4] "commit 8b7272dff8ab78947cfcdc173673efccf43ed22a"
[5] "commit 45607de7dceedf44061f60024f9ddae99aeafb80"
[6] "commit ceb99122b6813e21067867f9cc16e62a2505adf2"
Who are the authors of these commits?
g <- grep("^Author", commit_log, value = TRUE, perl = TRUE)
head(g)
[1] "Author: ZchMr <zc@.>" "Author: ZchMr <zc@.>" "Author: DvdG <wh@.>"
[4] "Author: DvdG <wh@.>" "Author: DvdG <wh@.>" "Author: DvdG <wh@.>"
[1] 18
grepl
By default, grep()
returns the indices into the character vector where the regex pattern matches.
g <- grep("^Author", commit_log[1:100])
g
[1] 2 65
grepl()
returns a logical vector indicating which element matches.
i <- grepl("^Author", commit_log[1:100])
i
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[100] FALSE
Some limitations of grep()
:
The grep()
function tells you which strings in a character vector match a certain pattern but it doesn’t tell you exactly where the match occurs or what the match is (for a more complicated regex).
The regexpr()
function gives you the index into each string where the match begins and the length of the match for that string.
regexpr()
only gives you the first match of the string (reading left to right). gregexpr()
will give you all of the matches in a given string.
regexpr
How can we obtain the email addresses of the authors?
[1] "commit 7f6ef08e80191712a5eb0d75c42931466e7bbe73"
[2] "Author: ZchMr <zc@.>"
[3] "Date: Wed Oct 1 16:55:12 2014 -0400"
[4] ""
[5] " date changes to pages/tickets"
[6] ""
[7] "diff --git a/workbench/bts/boxoffice/src/views/review.blade.php b/workbench/bts/boxoffice/src/views/review.blade.php"
[8] "index 0967709..6900b84 100644"
[9] "--- a/workbench/bts/boxoffice/src/views/review.blade.php"
[10] "+++ b/workbench/bts/boxoffice/src/views/review.blade.php"
What if we use the regex <(.*)>
and search for that?
We need to search the Author line for a pattern. We can first grep
the Author lines and then search for a pattern.
author <- grep("^Author:", commit_log, value = TRUE, perl = TRUE)
head(author, 3)
[1] "Author: ZchMr <zc@.>" "Author: ZchMr <zc@.>" "Author: DvdG <wh@.>"
r <- regexpr("<.*>", author)
str(r)
int [1:2384] 15 15 14 14 14 14 14 14 14 14 ...
- attr(*, "match.length")= int [1:2384] 6 6 6 6 6 6 6 6 6 6 ...
- attr(*, "index.type")= chr "chars"
- attr(*, "useBytes")= logi TRUE
regexpr()
returns a vector of integers indicating where the match starts
The attribute match.length
indicates how long the match is
If there’s no match, regexpr()
returns -1
with a match.length
of -1
.
The obvious way to select out a match is to use the indices and the substr()
function.
substr(author[1], 15, 15 + 6 - 1)
[1] "<zc@.>"
substr(author[3], 14, 14 + 6 - 1)
[1] "<wh@.>"
regmatches
We can also use the regmatches()
function to just grab all of the matches at once.
r <- regexpr("<.*>", author)
m <- regmatches(author, r)
head(m)
[1] "<zc@.>" "<zc@.>" "<wh@.>" "<wh@.>" "<wh@.>" "<wh@.>"
sub/gsub
But we still don’t have actual email addresses. We need to remove the <
and >
characters. We can use the sub()
function for that.
[1] "zc@.>" "zc@.>" "wh@.>" "wh@.>" "wh@.>"
[1] "<zc@." "<zc@." "<wh@." "<wh@." "<wh@."
But we want to remove both <
and >
!
We can use a regular expression in sub()
.
[1] "zc@.>" "zc@.>" "wh@.>" "wh@.>" "wh@.>"
gsub()
substitutes all occurrences of the regex (g
is for “global”).
[1] "zc@." "zc@." "wh@." "wh@." "wh@."
regexec
The regexec()
function can make the previous task a bit simpler by using parenthesized sub-expressions.
[1] "Author: ZchMr <zc@.>"
We can capture the email address portion of the line with parentheses.
regexec("^Author: [^ ]+ <(.*)>", author[1])
[[1]]
[1] 1 16
attr(,"match.length")
[1] 20 4
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
r <- regexec("^Author: [^ ]+ <(.*)>", author[1])
regmatches(author[1], r)
[[1]]
[1] "Author: ZchMr <zc@.>" "zc@."
When were all of the commits made?
r <- regexec("^Date: +(.*)$", commit_log, perl = TRUE)
m <- regmatches(commit_log, r)
head(m)
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] "Date: Wed Oct 1 16:55:12 2014 -0400"
[2] "Wed Oct 1 16:55:12 2014 -0400"
[[4]]
character(0)
[[5]]
character(0)
[[6]]
character(0)
Now we can subset the elements that match
library(purrr)
u <- map_int(m, length) > 0
str(u)
logi [1:7758154] FALSE FALSE TRUE FALSE FALSE FALSE ...
[[1]]
[1] "Date: Wed Oct 1 16:55:12 2014 -0400"
[2] "Wed Oct 1 16:55:12 2014 -0400"
[[2]]
[1] "Date: Wed Oct 1 16:14:22 2014 -0400"
[2] "Wed Oct 1 16:14:22 2014 -0400"
[[3]]
[1] "Date: Wed Oct 1 14:43:11 2014 -0400"
[2] "Wed Oct 1 14:43:11 2014 -0400"
[[4]]
[1] "Date: Wed Oct 1 14:42:44 2014 -0400"
[2] "Wed Oct 1 14:42:44 2014 -0400"
[[5]]
[1] "Date: Wed Oct 1 13:23:31 2014 -0400"
[2] "Wed Oct 1 13:23:31 2014 -0400"
[[6]]
[1] "Date: Wed Oct 1 13:12:39 2014 -0400"
[2] "Wed Oct 1 13:12:39 2014 -0400"
Finally, we can parse the dates/times.
library(lubridate)
dates <- map_chr(m[u], 2) %>%
parse_date_time("abd HMS Y z", tz = "America/New_York",
quiet = TRUE)
str(dates)
POSIXct[1:2384], format: "2014-10-01 16:55:12" "2014-10-01 16:14:22" ...
[1] "2014-10-01 16:55:12 EDT" "2014-10-01 16:14:22 EDT"
[3] "2014-10-01 14:43:11 EDT" "2014-10-01 14:42:44 EDT"
[5] "2014-10-01 13:23:31 EDT" "2014-10-01 13:12:39 EDT"
Histogram
You can make a histogram of the dates
hist(dates, "month", freq = TRUE)
rug(dates)