N-Gram Tutorial in R

By | July 13, 2017

What are n-grams?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward. For example, for the sentence: “The cow jumps over the moon”. If N=2 (known as a bigram), then the ngrams would be:

  • the cow
  • cow jumps
  • jumps over
  • over the
  • the moon

So you have 5 n-grams in this case.

A unigram is basically a list of all unique words in the corpus, which are called “tokens” or “terms” in machine learning parlance. Since a word may appear more than once, their frequencies should be stored as well. We’ll use special term matrices for this.

What are N-grams used for?

N-grams are used for a variety of different tasks. For example, when developing a language model, n-grams are used to develop not just unigram models but also bigram and trigram models. These models can be used in a variety of tasks such as language translation, spelling correction, word breaking, and text summarizing. N-grams are often used for simple word prediction models, such as Google’s autocomplete. Unigrams can be used to find the most common terms in a bunch of text, which can be useful for summarizing or historical analysis. See Google’s N-Gram Viewer for an example.

Coding – Start with the packages

We’ll need three packages for this tutorial. The Tm package is mostly for cleaning text data then creating document term matrices (DTM). The RWeka package creates an n-gram from a DTM. The Tm package is also used for extra cleaning — especially stemming. Stemming is an optional step, and if you use your n-grams for, say, word prediction, stemming does more harm than good. Take the following example:

“A penny saved is a penny _____”

The model should offer the word “earned”, but if the n-gram is stemmed, it will offer “earn” — the stemmed root — which isn’t helpful if you want to output the exact idiom. Stemming makes your database smaller, which is an important consideration, but reduces grammatical soundness. For this simple model, we won’t break up contractions, so a term like “we’ll” will not be broken up into “we” and “will”.

We’ll start by including the libraries, assuming the packages have already been installed.

library(tm)
library(RWeka)

Now we’ll be working with text. I’ll assume that you obtained a bunch of unstructured text, likely by scraping or just acquiring one of the many free samples of formal, informal and semi-formal text sources online. I took some phrases I found on the web and structured them into a simple vector for easy input. In reality, converting “unstructured” text to arrays of sentences, known as segmentation, is surprisingly difficult to roll yourself, but packages like NLTK and Snowballc can segment.

The data

Here’s a very small sample of sentences put into an vector, then put into  corpus object:

text <- c(
  "Spider-Man is a fictional superhero appearing in American comic books published by Marvel Comics.",
  "The character was created by writer-editor Stan Lee and writer-artist Steve Ditko, and first appeared in the comic Amazing Fantasy #15",
  "Lee and Ditko conceived the character as an orphan being raised by his Aunt May and Uncle Ben in New York City",
  "Spider-Man's creators gave him super strength and agility, the ability to cling to most surfaces",
  "The Spider-Man series broke ground by featuring Peter Parker, the high school student behind Spider-Man's secret identity",
  "Marvel has featured Spider-Man in several comic book series",
  "Spider-Man is one of the most popular and commercially successful superheroes",
  "As Marvel's flagship character and company mascot, he has appeared in countless forms of media, including comic strips",
  "Spider-Man has been well received as a superhero and comic book character",
  "ranked as one of the most popular comic book characters of all time, alongside DC Comics' most famous superheroes, Batman and Superman.",
  "Midtown High School student Peter Parker is a science-whiz orphan living with his Uncle Ben and Aunt May.",
  "he is bitten by a radioactive spider",
  "Through his knack for science, he develops a gadget that lets him fire adhesive webbing of his own design through , wrist-mounted barrels",
  "Despite his superpowers, Parker struggles to help his widowed aunt pay rent, is taunted by his peers",
  "Parker finds juggling his personal life and costumed adventures difficult.",
  "he meets roommate and best friend Harry Osborn, and girlfriend Gwen Stacy,[51] and Aunt May introduces him to Mary Jane Watson",
  "Working through his grief, Parker eventually develops tentative feelings toward Watson",
  "From 1984 to 1988, Spider-Man wore a black costume with a white spider design on his chest.",
  "The new costume originated in the Secret Wars limited series, where Spider-Man participates in a battle between Earth's major superheroes.",
  "Parker proposes to Watson a second time in The Amazing Spider-Man #290, with the wedding taking place in The Amazing Spider-Man Annual",
  "Following the \"reboot\", Parker's identity was no longer known to the general public",
  "Agonizing over his choices, always attempting to do right, he is nonetheless viewed with suspicion by the authorities",
  "Spider-Man's plight was to be misunderstood and persecuted by the very public that he swore to protect.",
  "In the first issue of The Amazing Spider-Man, J. Jonah Jameson, launches an editorial campaign against the \"Spider-Man menace.\"",
  "as early 1960s Marvel stories had often dealt with the Cold War and Communism",
  "From his high-school beginnings to his entry into college life, Spider-Man remained the superhero most relevant to young people.",
  "Fittingly, then, his comic book also contained some of the earliest references to the politics of young people.",
  "A bite from a radioactive spider triggers mutations in Peter Parker's body, granting him superpowers.",
  "Spider-Man has the ability to cling to walls, superhuman strength, a sixth sense (\"spider-sense\") that alerts him to danger",
  "With his talents, he sews his own costume to conceal his identity",
  "he constructs many devices that complement his powers, most notably mechanical web-shooters.",
  "After his parents died, Peter Parker was raised by his loving aunt, May Parker, and his uncle and father figure, Ben Parker.",
  "After Uncle Ben is murdered by a burglar, Aunt May is virtually Peter's only family, and she and Peter are very close.",
  "Eugene \"Flash\" Thompson is commonly depicted as Parker's high school bully, but in later comic issues he becomes a friend to Peter.",
  "Spider-Man has become Marvel's flagship character, and has often been used as the company mascot.",
  "Since 1962, hundreds of millions of comics featuring the character have been sold around the world.",
  "Spider-Man is the world's most profitable superhero.",
  "The culmination of nearly every superhero that came before him, Spider-Man is the hero of heroes",
  "Wizard magazine placed Spider-Man as the third greatest comic book character on their website.",
  "the title of the series was changed to Peter Parker: Spider-Man to establish that the original Spider-Man was being depicted."
)

Now, we want to create a simple corpus class. This is a special class similar to a matrix that has meta data. It’s important to use VCorpus() so that there will be multiple columns; if you use Corpus(), you’ll end up with just unigrams.

corpus <- VCorpus(VectorSource(text))

Cleaning the corpus

We’ll now standardize and clean our data, making values consistent and uniform for better instance matching. Names and other string data should be converted to either upper or lower case but VCorpus() seems to do that automatically, so I commented that out since it messes with VCorpus(); contractions like “don’t” should be changed to “do” and “not”; Words that frequently occur yet have little meaning (“stopwords” such as “the”) are removed to reduce our dataset; Numbers, which are treated as individual words, are removed because there might be too many of them. Also, many kinds of numbers, like SSNs or comment IDs are useless in most contexts; Punctuation like commas and periods are removed because they aren’t words and don’t greatly change the meaning of a sentence; Finally, any series of white spaces are merged into one. This is done last because stemming can result in extra spaces.

A more sophisticated model would keep compound nouns like “truck driver” or “blue jay” together, possibly with an underscore replacing the space — otherwise, they’re treated as independent tokens. Also, smoothing of the data can help reduce the effect of stopwords.

Notice that I’m stemming the document. This may greatly increase the processing time for data cleaning, so try to do the tm_map() data reduction methods before this step.

corpus <- tm_map(corpus, removeNumbers, lazy = FALSE)
corpus <- tm_map(corpus, removePunctuation, lazy = FALSE)
#corpus <- tm_map(corpus, tolower, lazy = FALSE) #not needed with VCorpus. 
corpus <- tm_map(corpus, removeWords, stopwords("english"), lazy = FALSE)
corpus <- tm_map(corpus, stemDocument, language = "english") 
corpus <- tm_map(corpus, stripWhitespace, lazy = FALSE)

Lazy must be set to FALSE when working with VCorpus().

Tokenizing

We’ll make three functions that pass control parameters for Weka’s n-gram tokenizer to the TermDocumentMatrix() function. This allows us to define the length of our n-grams. The source will be the corpus and the length is determined by the minimum and maximum variables: (3, 3) will create a trigram. These functions will be called as parameters later.

unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

Creating the Matrices

We’ll next create a TermDocumentMatrix for each n-gram, passing the control parameters we just made:

unigramTDM = TermDocumentMatrix(corpus, control = list(tokenize = unigramTokenizer))
bigramTDM = TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))
trigramTDM = TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))

The TermDocumentMatrix() function takes a corpus and returns a TDM, which is a matrix that describes the frequency of terms that occur in a collection of documents. In such a matrix, rows correspond to documents in the collection and columns correspond to terms.

Let’s have a look at a sample of a TDM with the following function:

inspect(trigramTDM[20:30,1:10])

This gives us the output below:

                           Docs
Terms                       1 10 2 3 4 5 6 7 8 9
  attempt right nonetheless 0  0 0 0 0 0 0 0 0 0
  aunt may introduc         0  0 0 0 0 0 0 0 0 0
  aunt may parker           0  0 0 0 0 0 0 0 0 0
  aunt may uncle            0  0 0 1 0 0 0 0 0 0
  aunt may virtual          0  0 0 0 0 0 0 0 0 0
  aunt pay rent             0  0 0 0 0 0 0 0 0 0
  battl earth major         0  0 0 0 0 0 0 0 0 0
  becom friend peter        0  0 0 0 0 0 0 0 0 0
  becom marvel flagship     0  0 0 0 0 0 0 0 0 0
  behind spiderman secret   0  0 0 0 0 1 0 0 0 0

Notice how the stemming is somewhat overzealous. Rather than try to form a proper root, it cheats and gets the longest root it can that can’t be inflected — even if it isn’t a word. Thus, “battles” and “battling” aren’t stemmed to “battle”, but to “battl” because the “e” can be affected by inflection (it’s deleting with an -ing suffix), but the “l” can stay because it’s never affected by inflection. The same goes for “becom” — the root of “becoming” and “becomes”. Note that, with this simple stemmer, “became” would be treated as another word rather than undergoing an ablaut transformation.

Also notice that “Spider-Man” was changed to “spiderman” with no hyphen or uppercase letters. This standardizes the word, as there may be many spellings. E.g., “Spiderman” or “Spider-man”. “Spider Man”, however, would be treated as two tokens in our simple model. But all is not lost: since bigrams and trigrams keep colocations, “spider” will likely be returned if “man” is given to the model.

Let’s get the most frequent terms in our unigram TDM. We’ll use findFreqTerms(), which takes a TDM and the minimum number of instances of a term.

findFreqTerms(unigramTDM, lowfreq=8)

Here is the output:

[1] "charact"   "comic"     "parker"    "peter"     "spiderman" "superhero"

That’s it! You now have three TDMs that you can use for analysis or modelling.

Here’s the entire code:

 

library(tm)
library(RWeka)

text <- c(
  "Spider-Man is a fictional superhero appearing in American comic books published by Marvel Comics.",
  "The character was created by writer-editor Stan Lee and writer-artist Steve Ditko, and first appeared in the comic Amazing Fantasy #15",
  "Lee and Ditko conceived the character as an orphan being raised by his Aunt May and Uncle Ben in New York City",
  "Spider-Man's creators gave him super strength and agility, the ability to cling to most surfaces",
  "The Spider-Man series broke ground by featuring Peter Parker, the high school student behind Spider-Man's secret identity",
  "Marvel has featured Spider-Man in several comic book series",
  "Spider-Man is one of the most popular and commercially successful superheroes",
  "As Marvel's flagship character and company mascot, he has appeared in countless forms of media, including comic strips",
  "Spider-Man has been well received as a superhero and comic book character",
  "ranked as one of the most popular comic book characters of all time, alongside DC Comics' most famous superheroes, Batman and Superman.",
  "Midtown High School student Peter Parker is a science-whiz orphan living with his Uncle Ben and Aunt May.",
  "he is bitten by a radioactive spider",
  "Through his knack for science, he develops a gadget that lets him fire adhesive webbing of his own design through , wrist-mounted barrels",
  "Despite his superpowers, Parker struggles to help his widowed aunt pay rent, is taunted by his peers",
  "Parker finds juggling his personal life and costumed adventures difficult.",
  "he meets roommate and best friend Harry Osborn, and girlfriend Gwen Stacy,[51] and Aunt May introduces him to Mary Jane Watson",
  "Working through his grief, Parker eventually develops tentative feelings toward Watson",
  "From 1984 to 1988, Spider-Man wore a black costume with a white spider design on his chest.",
  "The new costume originated in the Secret Wars limited series, where Spider-Man participates in a battle between Earth's major superheroes.",
  "Parker proposes to Watson a second time in The Amazing Spider-Man #290, with the wedding taking place in The Amazing Spider-Man Annual",
  "Following the \"reboot\", Parker's identity was no longer known to the general public",
  "Agonizing over his choices, always attempting to do right, he is nonetheless viewed with suspicion by the authorities",
  "Spider-Man's plight was to be misunderstood and persecuted by the very public that he swore to protect.",
  "In the first issue of The Amazing Spider-Man, J. Jonah Jameson, launches an editorial campaign against the \"Spider-Man menace.\"",
  "as early 1960s Marvel stories had often dealt with the Cold War and Communism",
  "From his high-school beginnings to his entry into college life, Spider-Man remained the superhero most relevant to young people.",
  "Fittingly, then, his comic book also contained some of the earliest references to the politics of young people.",
  "A bite from a radioactive spider triggers mutations in Peter Parker's body, granting him superpowers.",
  "Spider-Man has the ability to cling to walls, superhuman strength, a sixth sense (\"spider-sense\") that alerts him to danger",
  "With his talents, he sews his own costume to conceal his identity",
  "he constructs many devices that complement his powers, most notably mechanical web-shooters.",
  "After his parents died, Peter Parker was raised by his loving aunt, May Parker, and his uncle and father figure, Ben Parker.",
  "After Uncle Ben is murdered by a burglar, Aunt May is virtually Peter's only family, and she and Peter are very close.",
  "Eugene \"Flash\" Thompson is commonly depicted as Parker's high school bully, but in later comic issues he becomes a friend to Peter.",
  "Spider-Man has become Marvel's flagship character, and has often been used as the company mascot.",
  "Since 1962, hundreds of millions of comics featuring the character have been sold around the world.",
  "Spider-Man is the world's most profitable superhero.",
  "The culmination of nearly every superhero that came before him, Spider-Man is the hero of heroes",
  "Wizard magazine placed Spider-Man as the third greatest comic book character on their website.",
  "the title of the series was changed to Peter Parker: Spider-Man to establish that the original Spider-Man was being depicted."
  )

corpus <- VCorpus(VectorSource(text)) #With this, biGrams are created as desired

corpus <- tm_map(corpus, removeNumbers, lazy = FALSE)
corpus <- tm_map(corpus, removePunctuation, lazy = FALSE)
#corpus <- tm_map(corpus, tolower, lazy = FALSE) #not needed with VCorpus. 
corpus <- tm_map(corpus, removeWords, stopwords("english"), lazy = FALSE)
corpus <- tm_map(corpus, stemDocument, language = "english") 
corpus <- tm_map(corpus, stripWhitespace, lazy = FALSE)

unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

unigramTDM = TermDocumentMatrix(corpus, control = list(tokenize = unigramTokenizer))
bigramTDM = TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))
trigramTDM = TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))

inspect(trigramTDM[20:30,1:10])
findFreqTerms(unigramTDM, lowfreq=8)

 

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *