N-Gram Tutorial in R

By | July 13, 2017

What are n-grams?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward. For example, for the sentence: “The cow jumps over the moon”. If N=2 (known as a bigram), then the ngrams would be:

  • the cow
  • cow jumps
  • jumps over
  • over the
  • the moon

So you have 5 n-grams in this case.

A unigram is basically a list of all unique words in the corpus, which are called “tokens” or “terms” in machine learning parlance. Since a word may appear more than once, their frequencies should be stored as well. We’ll use special term matrices for this.

What are N-grams used for?

N-grams are used for a variety of different tasks. For example, when developing a language model, n-grams are used to develop not just unigram models but also bigram and trigram models. These models can be used in a variety of tasks such as language translation, spelling correction, word breaking, and text summarizing. N-grams are often used for simple word prediction models, such as Google’s autocomplete. Unigrams can be used to find the most common terms in a bunch of text, which can be useful for summarizing or historical analysis. See Google’s N-Gram Viewer for an example.

Coding – Start with the packages

We’ll need three packages for this tutorial. The Tm package is mostly for cleaning text data then creating document term matrices (DTM). The RWeka package creates an n-gram from a DTM. The Tm package is also used for extra cleaning — especially stemming. Stemming is an optional step, and if you use your n-grams for, say, word prediction, stemming does more harm than good. Take the following example:

“A penny saved is a penny _____”

The model should offer the word “earned”, but if the n-gram is stemmed, it will offer “earn” — the stemmed root — which isn’t helpful if you want to output the exact idiom. Stemming makes your database smaller, which is an important consideration, but reduces grammatical soundness. For this simple model, we won’t break up contractions, so a term like “we’ll” will not be broken up into “we” and “will”.

We’ll start by including the libraries, assuming the packages have already been installed.

Now we’ll be working with text. I’ll assume that you obtained a bunch of unstructured text, likely by scraping or just acquiring one of the many free samples of formal, informal and semi-formal text sources online. I took some phrases I found on the web and structured them into a simple vector for easy input. In reality, converting “unstructured” text to arrays of sentences, known as segmentation, is surprisingly difficult to roll yourself, but packages like NLTK and Snowballc can segment.

The data

Here’s a very small sample of sentences put into an vector, then put into  corpus object:

Now, we want to create a simple corpus class. This is a special class similar to a matrix that has meta data. It’s important to use VCorpus() so that there will be multiple columns; if you use Corpus(), you’ll end up with just unigrams.

Cleaning the corpus

We’ll now standardize and clean our data, making values consistent and uniform for better instance matching. Names and other string data should be converted to either upper or lower case but VCorpus() seems to do that automatically, so I commented that out since it messes with VCorpus(); contractions like “don’t” should be changed to “do” and “not”; Words that frequently occur yet have little meaning (“stopwords” such as “the”) are removed to reduce our dataset; Numbers, which are treated as individual words, are removed because there might be too many of them. Also, many kinds of numbers, like SSNs or comment IDs are useless in most contexts; Punctuation like commas and periods are removed because they aren’t words and don’t greatly change the meaning of a sentence; Finally, any series of white spaces are merged into one. This is done last because stemming can result in extra spaces.

A more sophisticated model would keep compound nouns like “truck driver” or “blue jay” together, possibly with an underscore replacing the space — otherwise, they’re treated as independent tokens. Also, smoothing of the data can help reduce the effect of stopwords.

Notice that I’m stemming the document. This may greatly increase the processing time for data cleaning, so try to do the tm_map() data reduction methods before this step.

Lazy must be set to FALSE when working with VCorpus().

Tokenizing

We’ll make three functions that pass control parameters for Weka’s n-gram tokenizer to the TermDocumentMatrix() function. This allows us to define the length of our n-grams. The source will be the corpus and the length is determined by the minimum and maximum variables: (3, 3) will create a trigram. These functions will be called as parameters later.

Creating the Matrices

We’ll next create a TermDocumentMatrix for each n-gram, passing the control parameters we just made:

The TermDocumentMatrix() function takes a corpus and returns a TDM, which is a matrix that describes the frequency of terms that occur in a collection of documents. In such a matrix, rows correspond to documents in the collection and columns correspond to terms.

Let’s have a look at a sample of a TDM with the following function:

This gives us the output below:

Notice how the stemming is somewhat overzealous. Rather than try to form a proper root, it cheats and gets the longest root it can that can’t be inflected — even if it isn’t a word. Thus, “battles” and “battling” aren’t stemmed to “battle”, but to “battl” because the “e” can be affected by inflection (it’s deleting with an -ing suffix), but the “l” can stay because it’s never affected by inflection. The same goes for “becom” — the root of “becoming” and “becomes”. Note that, with this simple stemmer, “became” would be treated as another word rather than undergoing an ablaut transformation.

Also notice that “Spider-Man” was changed to “spiderman” with no hyphen or uppercase letters. This standardizes the word, as there may be many spellings. E.g., “Spiderman” or “Spider-man”. “Spider Man”, however, would be treated as two tokens in our simple model. But all is not lost: since bigrams and trigrams keep colocations, “spider” will likely be returned if “man” is given to the model.

Let’s get the most frequent terms in our unigram TDM. We’ll use findFreqTerms(), which takes a TDM and the minimum number of instances of a term.

Here is the output:

That’s it! You now have three TDMs that you can use for analysis or modelling.

Here’s the entire code:

 

 

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *