Tips on cleaning English text data for analysis

By | April 22, 2019

Here’s some advice on how to clean natural text for data analysis. These suggestions are meant for English. These are in order of how useful I think they are, not the order that you should apply them. For example, you would need to do safe reduction before deleting stop words.

1. Keep copies

Keep a copy of your original corpus and all major changes… with annotation it the file or folder name. It’s an easy step that allows you to have a simple version control system. It’s also good to keep a log of major and minor mutations. If you don’t have the space for cleansed corpora, use offline backups or delete that version — but record the change in a log.

2. Extra-Linguistic Cleaning

This is the no-brainer. Training, testing, and errors will all be better when you work with only useful text. Extra-linguistic cleaning is basically improving the data in a way that won’t harm the text, or may even slightly improve it. Here’s a list of basics that I recommend for most text analysis projects:

  1. Delete extra white spaces, including double spaces, double new lines or line returns, etc.
  2. Delete double punctuation!!!! Should this be done iteratively or by using regex???? It’s up to you!!!
  3. Remove duplicate words. These won’t be common, but you might might find a few and the coding would be easily done.
  4. Convert numbers stored as words into actual numbers. E.g, “two” becomes “2”. This will reduce dimensionality and a model can use that number as a numeric data type rather than text. If you do this, remember input vector normalization — most input values are between -10 and 10, so a date like 1996 would have to be dealt with. One way could be to keep large numbers as text data types. Another way could be to use eras or centuries, such as “late twenty-first century” or “late modernity”. Removing large numbers may be the simplest option. “In 1996” would be deleted altogether.
  5. Change all text to lower case. If you are doing named-entity recognition or if proper nouns are useful to your algorithm, ignore this step.
  6. Remove unnecessary punctuation. Change exclamation points and ellipses to periods, and dashes and parentheses to commas (yes, this could create a run-on sentence). You might want to consider changing question marks to periods. If your particularly zealous, you can move all prepositional phrases to the end of a sentence. E.g., “Up the hill, ran Jack and Jill” would become “Jack and Jill ran up the hill,” and the comma wouldn’t be needed. This would also relate to embedded clauses where “It was in the hall where Col. Mustard killed Professor Plum” becomes a simpler “Col. Mustard killed Professor Plum in the hall.” This is quite a task but the rewards would be significant.
  7. Changing abbreviations to words. “Mr.” should be “Mister”, “Sgt.” should be “Sergeant”, and “etc.” should be “etcetera”. This reduces sparsity and vocabulary size, but will slightly increase corpus size. Since sentences are usually delimited by a period, this step will make segmenting a corpus into sentences much easier.
  8. Spell check. Correcting spelling will reduce dimensionality, sparsity, and make for a smaller, faster model. Most importantly, a recommender model won’t recommend typos! You’ll need to decide on how distant the typo can be. For example, “decend” probably should be “descend”, but what would you do with “dcedn”? The threshold can be measured by the Levenshtein Distance. It’s the number of changes (edits) a word must undergo to create an actual English word. Changing “decend” to “descent” has a distance of 1, but changing “dcedn” to “descend” has a distance of 3 (two insertions and one switch). Even better would be a coefficient, where the longer the word, the more edits are allowed.
  9. Delete all formatting. This includes bold text, italics, etc. You may want to keep paragraphs if you’re doing topic, sub-topic, subject, theme, or code analysis. Generally, in quantitative analysis, a document has a single topic and several sub-topics. A subject is much like a sub-topic. For example, if you’re writing a paper on single-term American Presidents (the topic), you can have a paragraph for each President (the subject or sub-topic). Themes and codes are usually used in qualitative analysis and beyond the scope of this document.

3. Delete quoted text

Quoted text often offers little to no information and often doesn’t obey grammatical rules. Here are just a few reasons why quoted text should be ignored when analyzing a sentence for such purposes as syntactic parsing, word prediction, and many others.

Redundant information

John rolled down the window and told Mary, “I can’t do this anymore.” He drove off and she never saw him again.

The quoted part above is redundant when trying to understand what’s going on. The gist of the sentences is that John permanently left Mary, which can still be obtained with the quote removed. If you want the fundamental meaning, you have to ignore the noise.

Arbitrary Grammar

Quoted text is almost lawless. Imagine quoting a toddler or alien. The text in those quotes is almost useless for an English analysis. Consider the following paragraph:

George sat back and continued recounting his childhood dream. “I was in a field where a giant rabbit was running after me and I was terrified and I didn’t know what to do and I hid behind a tree and suddenly the rabbit popped out and that’s why I’m afraid of rabbits even today.”

The run-on sentence above is acceptable because that’s how George talks. However, syntactically parsing such a sentence is almost impossible. Semantics, at least in part, comes from syntax. Imagine building a prediction model based on word frequencies using the text above: the word “and” would result in skewness that would be difficult for a simple algorithm, such as a simple regression model, to overcome.

Unique grammar

“Gramma ain’t never said ya been lyin’.”

Although in some dialects, this sentence is grammatical, any model trained on the General American (GA) dialect, or most other English dialects, would be unreliable when parsing this sentence. Many linguists argue that there’s no such thing as poor grammar, and a good model trained with GA would also perform poorly for proper Middle English, for example.

Difficult cleaning

Marry was worried about John betting all of their money. “No,” said Mary. “That ‘spending money’ is our savings.”

It’s not particularly onerous, but to use the quoted material, one would have to deal with the artifacts of removing quotes, such as dealing with the commas, dialog tags (“said Mary”), and quotes within quotes.

4. Stemming

Inflection is basically appending prefixes and suffixes, collectively called affixing. Sometimes affixing changes the word’s syntactic category, such as how adding -ment to the verb “govern” changes it to a noun. Sometimes we use ablauts instead of inflection, such as in “dove” rather than “dived.” Ablauts can be treated as exceptions in simple models, where you can consider -ed to be the default past participle verb ending. Removing inflection is referred to as “stemming.” We must be cautious to not significantly change the meaning of a word when we stem it. For example, changing “apples” to “apple” by removing the -s is useful for reasons explained later. However, changing “consume” to “consuming” changes a noun to a verb, which changes its role in the sentence. However, this change isn’t as disastrous as it might seem at first. Suffice to say that stemming simplifies words, which reduces the vocabulary you use. Consequently, you end up working with far less data with only minimal loss of meaning. If the part of speech (lexical category) isn’t changed, stemming is known as “safe reduction.”

Here are some advantages of stemming

Reduced Dimensionality and Sparsity

Apply the phrase to John

John applied the phrase

John is applying the phrase

“Apply,” “Applied,” and “applying,” all have similar meaning. Since they all have the same root, after stemming them, we end up with this:

Apply the phrase to John

John apply the phrase

John is apply the phrase

It may be hard to see, but there isn’t much of a semantic difference between each sentence. The three variants of “apply” have become one. In informal English, there are even more variants that can be safely reduced

Denser N-Gram Sets

We can use n-grams to reduce the dimensionality of our model, however, these n-grams must not be sparse or they defeat the purpose. Sparse data is basically data that isn’t seen much in the database. If you’re creating, say, trigrams, you would have far more counts of “the, relaxing, dog” than “the, relaxin’, dog”. In a corpus of 1 million words, the second trigram my occur only once. What good is a trigram that only occurs once when training a model? Reducing both bigrams to “the, relax, dog” will allow that trigram to be seen far more often in a corpus, assuming the corpus is also stemmed.

5. Removing Punctuation

Removing punctuation and special characters can greatly improve your model as long as it’s done consistently.  In some cases, it’s obvious that semantics are mostly retained when removing punctuation, such as “shut up!” versus “shut up” — the meanings are almost identical.

Changing “don’t” to “do not” creates a useful bigram. Changing “nothin’” to “nothing” can reduce sparsity. However, even grammatical punctuation can be deleted. Periods after titles, all commas, quotes, hyphens, and upper-case can all be removed. This is no small task, however. If you want to retain semantic fidelity, it might be best to change “ain’t” to “is not” rather than “aint”. Such exceptions would need to be accounted for during cleaning. On the other hand, removing unnecessary punctuation may make things easier down the road. For example, a corpus is often segmented into sentences using a period, but what happens when it encounters “Mr. Smith”? Delimiting sentences by periods will create two sentences, but if you removed excess punctuation, “Mr Smith” would not have that delimiter. Nominal compounds, such as key-stroke.

Lastly, it is useful to add an “<s>” or other symbol to indicate the end of a sentence. This is very useful for lexical category tagging and models that look for patterns, such as neural networks.

Creating Nominal Compounds

There are three kinds of nominal compounds:

Closed compounds: keyboard, notebook, mousepad

Hyphenated compounds: mother-in-law, merry-go-round

Open compounds: truck driver, school bus, living room

A nominal compound is treated like a single word, even if there’s a space between them. Tokenizing is when we divide sentences into words. If we use spaces, then we end up breaking up nominal compounds, which can create vageness. For example, “The truck driver drove quickly” can become the following trigrams:

“<s>, the, truck”

“the, truck, driver”

“truck, driver drove”

“driver drove quickly”

“drove, quickly,</s>”

However, “truck driver” should be treated as a single word. An easy way to prevent nominal compounds from being broken up is to replace the space with a hyphen, resulting in “The truck_driver drove quickly”. This would create the following trigrams:

“<s>, the, truck_driver”

“the, truck_driver, drove”

“truck_driver, drove, quickly”

“drove, quickly,</s>”

Not only are there fewer trigrams, but nominal compounds remain intact. But what about nominal phrases, such as “The enemy’s destruction of the city”? It’s basically a big nominal phrase. But there’s a difference between a nominal phrase and a nominal compound: a nominal compound is a commonly-used term that cannot be altered. For example, “truck fast driver” or “trucker driver” are not grammatical. Only expletives can be inserted into nominal compounds. On the other hand, nominal phrases can be greatly altered. “The enemy’s destruction of the city” can be altered to “The enemy’s horrible destruction of the great city” without losing grammaticality.

6. Removing Stop Words

Stop words are basically words that have little meaning and may skew your model. If you were to make a very simple frequency-based prediction model, the word “the” would be recommended far too often. TF–IDF matrices can discount words that are over-represented, but it’s a naive approach: sometimes very common words are necessary for semantics (“the koala” and “a koala” are completely different in predicate calculus and must be retained). Also, keep in mind that “a” and “the” may sometimes have similar meaning. For example, the following three sentences all have the same meaning:

The penguin is a flightless bird.

A penguin is a flightless bird.

Penguins are flightless birds.

Remove Words With Little Meaning

Imagine you’re an algorithm trying to analyze text, and you come across the following sentence:

John loves Susan, but Susan does not love John.

Is the “but” useful? If so, why can it be replaced with a semicolon? Conjunctions are not often useful in text analysis unless logic is being examined. Otherwise, just take them out! The model can compensate.

Jack and Jill ran up the hill.

In the sentence above, the word “the” seems important — they didn’t run up just any hill. However, on an abstract level, the fact it’s just any hill may be good enough. If I were to type “who is richest” in Google’s search engine, the first autocomplete suggestion, as of this date, is “who is the richest man in the world”. Google filled in the “the” for me; it was totally unnecessary in my input. Removing function words, determiners, conjunctions, and many other words that may at first seem important can make your training vocabulary and corpus smaller and faster.

Remove Frequent Words

Frequent words skew any model that’s based on frequencies (counts). Some algorithms, like neural networks, can learn to ignore them by adjusting weights during training. Greedy algorithms, like decision trees, may fall apart completely unless frequent words are dealt with.

We’ve discussed words like “an”, “the”, “but”, and other words that by themselves have almost no meaning. But there are also words that are common, and have meaning, but are not usually useful in text analysis.

Here is a suggested list of stop words:

“a”, “about”, “above”, “after”, “again”, “against”, “all”, “am”, “an”, “and”, “any”, “are”, “aren’t”, “as”, “at”, “be”, “because”, “been”, “before”, “being”, “below”, “between”, “both”, “but”, “by”, “can’t”, “cannot”, “could”, “couldn’t”, “did”, “didn’t”, “do”, “does”, “doesn’t”, “doing”, “don’t”, “down”, “during”, “each”, “few”, “for”, “from”, “further”, “had”, “hadn’t”, “has”, “hasn’t”, “have”, “haven’t”, “having”, “he”, “he’d”, “he’ll”, “he’s”, “her”, “here”, “here’s”, “hers”, “herself”, “him”, “himself”, “his”, “how”, “how’s”, “i”, “i’d”, “i’ll”, “i’m”, “i’ve”, “if”, “in”, “into”, “is”, “isn’t”, “it”, “it’s”, “its”, “itself”, “let’s”, “me”, “more”, “most”, “mustn’t”, “my”, “myself”, “no”, “nor”, “not”, “of”, “off”, “on”, “once”, “only”, “or”, “other”, “ought”, “our”, “ours”, “ourselves”, “out”, “over”, “own”, “same”, “shan’t”, “she”, “she’d”, “she’ll”, “she’s”, “should”, “shouldn’t”, “so”, “some”, “such”, “than”, “that”, “that’s”, “the”, “their”, “theirs”, “them”, “themselves”, “then”, “there”, “there’s”, “these”, “they”, “they’d”, “they’ll”, “they’re”, “they’ve”, “this”, “those”, “through”, “to”, “too”, “under”, “until”, “up”, “very”, “was”, “wasn’t”, “we”, “we’d”, “we’ll”, “we’re”, “we’ve”, “were”, “weren’t”, “what”, “what’s”, “when”, “when’s”, “where”, “where’s”, “which”, “while”, “who”, “who’s”, “whom”, “why”, “why’s”, “with”, “won’t”, “would”, “wouldn’t”, “you”, “you’d”, “you’ll”, “you’re”, “you’ve”, “your”, “yours”, “yourself”, “yourselves”, “zero”

Words like “he” and “your” may seem too useful to discard, but if you look that the context in which they usually appear, you’ll see how often they add little meaning to the sentence. On an abstract level, there is little difference in meaning between “Jack took his umbrella” and “Jack took umbrella.”

Remove adjectives and adverbs

This one is less common. You can see “very” in the list of stop words above, but, in fact, most adjectives are not useful in text analysis. However, we only need to remove certain kinds of adjectives. An obvious start are adjectives that have similar meaning, such as in: “I just saw a tiny little frog.” In this case, deleting “little” would be prudent.

“Susan screamed loudly” and “John ran quickly,” because that’s the only way to scream or run. Can one scream quietly? These are obvious examples of redundant information. However, what about adverbs that seem more essential. If somebody tells you that “horse number five runs fast,” and you’re betting on horse number six, this information is important.

Remove Prepositional phrases

I get it. Removing hard-earned data is like cutting off your arm or giving up your first born. But don’t hold on to sympathy stock. Prepositional phrases may make up more than half a sentence, but they often contribute almost no meaning to the sentence. Take the following sentence:

“Susan ate soup with a new spoon from the drawer.”

Like most prepositional phrases in English, the one above starts with a preposition (“with”). Removing “with” and everything after it greatly shortens the sentence, but it doesn’t take away much information. We already assumed she used a spoon, and that spoons come from drawers.

You’ll notice many prepositions (“until”, “before”, “after”, “by”) in the list of stop words above. Not only are these commonly removed when cleaning text, but you should also consider removing entire prepositional phrases. English is fairly consistent in starting prepositional phrases with a preposition, so you can basically delete the part of the sentence that begins with a preposition.

Again, it seems like we’re removing a great deal of information, but we’re usually not. Consider the following sentences with and without their prepositional phrase:

I ran the marathon until I quit.

I walked home along the sidewalk.

I played football on the football field.

Of course, I’m cheating somewhat with my examples. There will be instances where there will be important information in prepositional phrases, such as in the following sentence:

“I accidentally dropped your baby at the nursery a few minutes early.”

The best way to know if removing prepositional phrases, or anything else, from your data is advantageous is simple: compare error rates in testing. If removing prepositional phrases increases your error, reconsider it. Keep in in mind that deleting prepositional phrases may increase your error, but the data is becoming more general, not the algorithm. Overtraining your model can result in overfitting, but more specific data can result in overfitting as well.

If you’re cleaning data that goes into a graph database, keep in mind that prepositions and conjunctions are not only useful, but often form new edges to triples. In other words, I would keep them in this case. For text prediction models, however, they may offer little value. Here are two examples of how triples can be adjoined:

(John) — [ate] –> (soup) –> [at] –> (home)

(John) — [ate] –> (soup) –> [and] –> (bread)

7. Safe Reduction

Safe reduction is the act of merging sentences. As an example, take the following sentences:

That big elephant was amazing!

That large elephant was amazing!

That big elephant was fantastic!

These can all be reduced to “That big elephant was amazing.” (Notice the exclamation point is changed to a period). Since the meaning is retained, the reduction is considered “safe”. Sentence embedding is a newer way to find similar sentences, but a simpler way is to simply use the most common synonym. For every word, look for a synonym. If there is a word with the same meaning and is more common, replace it with that word. If this is iterated over the sentences above, you’d end up with “That big elephant was amazing.”

Safe reduction can go even further, however. Sentences can be shortened or simplified using a similar kind of distributing (the opposite of factoring). Some words together have the same meaning. Take the following bigrams:

“amazingly big”

“really big”

“incredibly big”

These can all be replaced with “very big”. However, there are words that mean “very big,” such as “huge” or “enormous”. Since “huge” is more common, we could change bigrams “very big”, “really big”, “incredibly big” too to the unigram “huge”.

Safe reduction is quite an undertaking, especially since many synonyms aren’t really interchangeable. For example, we can try switching “ate” with “consumed”, but the latter may be too formal for the given context. This is where word and sentence embeddings can help out because they account for formality. Of course, safe reduction should occur before you remove stop words, especially adjectives and adverbs.

Notes

Maintaining Semantics

Meaning is found in all morphological units. This includes affixes. If we remove affixes, we lose some meaning. What can be done about this problem? A more sophisticated model can both stem and maintain inflectional information like tense, case, and possession.

If we stem “Susan’s coat” to “Susan coat”, the relationship between Susan and her coat is lost. To account for this, we can add a few dimensions to nouns and verbs to indicate their role or form in the sentence. Used wisely, this dimension can be fully dense. Here’s an example:

“John walked home” → John, walk [past], home

“John is walking home → John walk [present] home

“John will walk home → John walk [future] home

Notice that the tense feature is never empty. Also notice that all of the results are trigrams, which can be easily stored in triple store and graph databases.

Parsing sentences into triples can be tedious, involving deletions, stemming, etc., but have a dramatic impact on the size and speed of your model.

Databases like WordNet contain synonyms and senses. A sense is basically a group of words that all have a similar meaning, but aren’t necessarily interchangeable — unlike synonyms. WordNet and other databases.

Word embeddings are yet another alternative, where a fuzzy vector can represent the general meaning of a word, and similar words can be clustered together into a single vector. The advantage is that this vector can evolve as the model is trained, and can be language agnostic.

Senses each have an ID. These id’s can replace words in a sentence, which can both reduce sentences and simplify the production of word embeddings. You can get the sense ID of the root and keep inflectional information like tense in a matrix.

For example, the WordNet ID for “dog” is 02086723 and “walk” is 01908923. The word “the” has no entry, so we can turn “The dog walked” into:

[Noun:02086723[quantifier:Ø ][determiner:demonstrative][number:1][case:nominative][person:second]], [Verb:01908923[tense:past]]

Quite a lot must be done here: slots are reserved for features of nouns and verbs, and those nouns and verbs are represented by a code. Tense can be obtained from inflection, person verb agreement, etc. This would allow you to strip all inflection, turning it into a matrix, and use senses rather than words. Since a sense is unique, they can double as hashes. If the order is maintained in the matrices, this kind of data is ideal for word embedding algorithms. A JSON example would is preferable. Here’s a possible example:

Specificity can combine determiners and quantifiers. For example, “all the boys” would have a specificity of 10 (10 is the maximum), “the boy” would have a specificity of 1, “a few boys” would have a specificity of 2, “some boys” would have a specificity of 3, “half the boys” would have a specificity of 5 (out of 10), “most boys” would be 6 to 9, “many of the boys” would also be 6 to 9, “all boys” and “all the boys” and “the boys” would all be 10.

Word embeddings are created by looking at a target word and its neighboring words, and repeating this for every word in the corpus. The meaning of the target word is obtained by its neighboring words. By extracting inflection, we may end up with a more accurate model. Here’s an example of a sentence without extracted inflection:

“The dog’s owner was running quickly to his keys”

And here it is with some extracted inflection:

The dog -‘s own -er is[past] run[pres] quick -ly to he[acc] key[pl]

It’s quite a lot of morphological work to get to the second sentence, including I-umlaut processes, and creating new tokens like “‘s”, which are treated like words, but the latter sentence would result in denser data that is easier to work with. Due to the nature of input vectors for models, a feature “slot” can be reserved for inflection where applicable. For example, a noun could have

To take things further, we could change inflection to relationships to be stored in graph databases. For example, “John’s dog” could become:

(dog) — owned_by → (John)

This would also work:

(John) — owner_of → (dog)

Both examples are equally valid, but the second is more useful since dogs often have only one owner, so “owned_by” may be far less common than “owner_of”. Again, we want to reduce sparsity. The above is pseudo code for graph query language. We can use the same relationship twice from the sentence above:

(John) — owner_of → (dog)

(John) — owner_of → (keys)

Collectively, these triples can be turned into vector representations, and eventually sophisticated concepts can be quantified into ever-evolving vectors.

Share this article