March 22, 2020

Text Normalization for Natural Language Processing in Python

Text preprocessing is an important part of Natural Language Processing (NLP), and normalization of text is one step of preprocessing.

The goal of normalizing text is to group related tokens together, where tokens are usually the words in the text.

Depending on the text you are working with and the type of analysis you are doing, you might not need all of the normalization techniques in this post.

Normalization Techniques

In this post we will go over some of the common ways to normalize text.

Tokenization
Removing stopwords
Handling whitespace
Converting text to lowercase
Expanding contractions (don't -> do not)
Handling unicode characters - accented letters and some punctuation
Number words -> numeric
Stemming and/or Lemmatization
British vs. American English

Setup

We're going to use the Natural Language Toolkit (NLTK), as well as a few other packages that I will go over in the rest of the post.

First create a virtual environment for the project and install the packages we will be using.

pip install nltk
pip install unidecode
pip install pycontractions
pip install word2number

These are the modules we will use in this post - go ahead and create a new file and import them.

import re
import string
import unidecode
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import gensim.downloader as api
from pycontractions import Contractions
from word2number import w2n

For the examples in this post, I'm using text from this article in Vogue magazine.

text = "“Everything we’re doing is about going forward,” Phoebe Philo told Vogue in 2009, shortly before showing her debut Resort collection for Céline. Although the label had garnered headlines when it was revived by Michael Kors in the late ’90s, it was Philo who truly brought the till then somewhat somnambulant luxury house to the forefront. Critics credited her with pushing fashion in a new direction, toward a more spare, stripped-down kind of sophistication. What Céline now offered women was, as the magazine put it, “a grown-up and hip way to put themselves together.”"

Tokenization

Tokenization means splitting up strings of text into smaller pieces.

NLTK has a sentence tokenizer, as well as a word tokenizer.

First let's look at the sentence tokenizer.

You may have guessed that the sentence tokenizer will split a paragraph into sentences.

sentences = sent_tokenize(text)
print(sentences)

['“Everything we’re doing is about going forward,” Phoebe Philo told Vogue in 2009, shortly before showing her debut Resort collection for Céline.',
 'Although the label had garnered headlines when it was revived by Michael Kors in the late ’90s, it was Philo who truly brought the till then somewhat somnambulant luxury house to the forefront.',
 'Critics credited her with pushing fashion in a new direction, toward a more spare, stripped-down kind of sophistication.',
 'What Céline now offered women was, as the magazine put it, “a grown-up and hip way to put themselves together.”']

The result here is a list of four sentences.

Now let's use the word tokenizer to split the first sentence into word tokens.

test_sentence = sentences[0]
words = word_tokenize(test_sentence)
print(words)

['“', 
'Everything', 
'we', 
'’', 
're', 
'doing', 
'is', 
'about', 
'going', 
'forward', 
',', 
'”', 
'Phoebe', 
'Philo', 
'told', 
'Vogue', 
'in', 
'2009', 
',', 
'shortly', 
'before', 
'showing', 
'her', 
'debut', 
'Resort', 
'collection', 
'for', 
'Céline', 
'.']

If you're a Python developer you've probably used the split() operation to split up a string of text into words.

words = test_sentence.split()
print(words)

['“Everything',
 'we’re',
 'doing',
 'is',
 'about',
 'going',
 'forward,”',
 'Phoebe',
 'Philo',
 'told',
 'Vogue',
 'in',
 '2009,',
 'shortly',
 'before',
 'showing',
 'her',
 'debut',
 'Resort',
 'collection',
 'for',
 'Céline.']

Can you spot the difference between the result using split() and using the NLTK tokenizer?

Using split(), the string is broken up in words based on whitespace, and the punctuation is grouped in with the words instead of broken up as its own token.

Here, the token 'Céline.' - with the period on the end - is considered as a separate token than 'Céline' without it, but we want these tokens to be grouped together in any analysis because they refer to the same entity.

This could mess up your results in ways you might not immediately notice.

Removing stopwords

Stopwords are common words such as 'the' or 'a' in a language that might not be useful features when building NLP models.

Here I'm using a list of English stopwords from NLTK.

stop_words = set(stopwords.words('english'))
filtered = [word for word in word_tokenize(test_sentence) if word not in stop_words]
print(filtered)

['“', 'Everything', 'going', 'forward', '”', 'Phoebe', 'Philo', 'told', 'Vogue', '2009', 'shortly', 'showing', 'debut', 'Resort', 'collection', 'Céline', '.']

Whitespace

You can use the strip() method on strings in Python to remove leading and trailing whitespace.

But sometimes raw text data can look pretty awful in the middle as well.

Maybe you've scraped data from the internet and it came from a website with clunky design where the developers used extra spaces to position text on the page.

The text sample I've been using didn't have any obvious problems with extra spaces, but if it did it might look like this:

test_sentence = '“Everything  we’re doing  is about going      forward,” Phoebe Philo  told    Vogue  in 2009,    shortly before showing her debut Resort collection for Céline.'

One way to remove the extra whitespaces is with a regular expression to replace any group of more than one space in the text string with a single space.

test_sentence = re.sub(' +',' ', test_sentence)
print(test_sentence)

“Everything we’re doing is about going forward,” Phoebe Philo told Vogue in 2009, shortly before showing her debut Resort collection for Céline.

Converting to lowercase

This is probably an obvious one, but converting everything to lowercase is an easy way to standardize text data.

test_sentence = test_sentence.lower()
print(test_sentence)

“everything we’re doing is about going forward” phoebe philo told vogue in 2009 shortly before showing her debut resort collection for céline

Expanding contractions

One way to do this is with regular expressions.

For example in this sentence, we would want to expand we're.

pattern = r'we[\’\']re'
replacement = 'we are'
test_sentence = re.sub(pattern,replacement,test_sentence)
print(test_sentence)

“Everything we are doing is about going forward” Phoebe Philo told Vogue in 2009 shortly before showing her debut Resort collection for Céline.

That worked for this sentence, but you will likely have many other contractions, so you would need to write a regular expression for all of them.

There are also a lot of cases that require contextual knowledge of the contraction to pick the right expansion.

There is a Python library pycontractions that uses semantic vector models such as Word2Vec, GloVe, FastText, or others, to determine the correct expansion.

#downloads the model we will use
model = api.load("glove-twitter-25")
cont = Contractions(kv_model=model)
text = list(cont.expand_texts([test_sentence],precise=True))
print(text)

['“Everything we are doing is about going forward” Phoebe Philo told Vogue in 2009 shortly before showing her debut Resort collection for Céline.']

Removing punctuation

Removing punctuation can be done with the built-in string module in Python.

punctuation_table = str.maketrans('','',string.punctuation)
test_sentence = test_sentence.translate(punctuation_table)
print(test_sentence)

“Everything we’re doing is about going forward” Phoebe Philo told Vogue in 2009 shortly before showing her debut Resort collection for Céline

Notice that in this sentence, all of the punctuation has not been removed.

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

These are the punctuation marks that should have been removed.

The period at the end was removed, but not the quotation marks.

Why?

The quotation marks in this sentence are unicode characters, which we will deal with next.

Accented characters and other unicode issues

In addition to the quotation marks, there are accented characters in Céline that are also unicode characters.

Céline and Celine would be considered separate tokens, and it is written both ways in different fashion articles, even though they are all referring to the same designer.

Removing accents helps to normalize the words in your text data.

With the punctuation that didn't get removed before, we were trying to remove something that did not exist!

In this sentence, the “ mark is not the same as the " mark, which was in string.punctuation.

'"' in test_sentence

False

We can strip all of these unicode characters from the sentence with the unidecode module.

The module takes a unicode string and tries to represent it in ASCII characters. Read more here.

test_sentence = unidecode.unidecode(test_sentence)
print(test_sentence)

"Everything we're doing is about going forward," Phoebe Philo told Vogue in 2009, shortly before showing her debut Resort collection for Celine.

And check the quotation marks.

'"' in test_sentence

True

Convert number words to numeric

You might want to convert numbers from words to numeric.

Or you might want to remove them altogether.

It depends on your project and goals.

Earlier we imported a Python package, word2number, that will convert number words to numbers.

text = "five hundred twenty five thousand six hundred"
print(w2n.word_to_num(text))

525600

Stemming and Lemmatization

Lemmatization converts a word to its base form, removing grammatical inflection.

We imported the WordNetLemmatizer from NLTK earlier, but there are several options for lemmatizers in NLTK.

lemmatizer = WordNetLemmatizer
d = lemmatizer.lemmatize('dresses')
print(d)

'dress'

The lemmatizer takes a word, and an optional part of speech - if you do not specify the part of speech, the default is noun.

Stemming is similar to lemmatization, but it mainly chops off a prefix or suffix, while the lemmatizer takes into account parts of speech and is more sophisticated in determining the base form of a word.

stemmer = PorterStemmer()
d = stemmer.stem('dresses')
print(d)

'dress'

An example where you can see the difference is the word 'better', which is an adjective, so pass pos='a' to the lemmatizer.

better_lemmatized = lemmatizer.lemmatize('better', pos='a')
better_stemmed = stemmer.stem('better')

print(better_lemmatized)
'good'

print(better_stemmed)
'better'

The lemmatizer returns the actual root of this word, 'good'.

You can find the NLTK docs for lemmatizing and stemming here.

British vs. American English

One other thing to consider is if your text has a mix of different types of English.

I haven't found a package for converting these from one type to the other, but there is a comprehensive list of spelling differences.

You could make a dictionary out of these to convert words in your text.

Putting it all together

I've put most of these techniques together in a class to quickly demonstrate, which you can find here.

First it operates on the entire text block

removes extra whitespace within text
converts unicode to ascii
converts to lowercase
expands contractions
tokenizes sentences

Then on each sentence

tokenizes words
removes punctuation
remove leading or trailing whitespace
lemmatizes words
removes stopwords

You could also just do the operations on the whole text block, but sometimes you might want to keep sentences separate.

normalized = TextNormalizer().normalize_text(text)
for sentence in normalized:
    print(sentence)

everything going forward phoebe philo told vogue 2009 shortly showing debut resort collection celine
although label garnered headline revived michael kor late 90 philo truly brought till somewhat somnambulant luxury house forefront
critic credited pushing fashion new direction toward spare strippeddown kind sophistication
celine offered woman magazine put grownup hip way put together

The one issue here is that Michael Kors' name was changed with the lemmatization to Kor.

So it's always important to check your data as you process it so that you can make adjustments.

Lemmatization might not even be necessary, depending on your project.

You could try training your NLP model without doing that type of processing and see how it performs.

Thanks for reading!

If you have any questions or comments, write them below or reach out on Twitter @LVNGD.

blog comments powered by Disqus

About Me

Hi, I'm Christina!

I'm a Python developer and data enthusiast, and mostly blog about things I've done or learned related to both of those.
I'm also available for consulting projects.
Reach out to me below.

Text Normalization for Natural Language Processing in Python

Normalization Techniques

Setup

Tokenization

Removing stopwords

Whitespace

Converting to lowercase

Expanding contractions

Removing punctuation

Accented characters and other unicode issues

Convert number words to numeric

Stemming and Lemmatization

British vs. American English

Putting it all together

First it operates on the entire text block

Then on each sentence

Thanks for reading!

Related Posts

Named Entity Recognition in Python with Stanford-NER and Spacy

Coreference resolution in Python with Spacy + NeuralCoref

How to train a custom Named Entity Recognizer with Stanford NLP

How to train a custom Named Entity Recognizer with Spacy

Feature Engineering with Python + Pandas: An Introduction

Tagged in

About Me

What can I help you with?

Recent Posts

Computing Morton Codes with a WebGPU Compute Shader

WebGPU: Building a Particle Simulation with Collision Detection

Solving the Lowest Common Ancestor Problem in Python

rss

Privacy

Unsubscribe

Text Normalization for Natural Language Processing in Python

Normalization Techniques

Setup

Tokenization

Removing stopwords

Whitespace

Converting to lowercase

Expanding contractions

Removing punctuation

Accented characters and other unicode issues

Convert number words to numeric

Stemming and Lemmatization

British vs. American English

Putting it all together

First it operates on the entire text block

Then on each sentence

Thanks for reading!

Tagged In

Share On

Related Posts

Tagged in

Subscribe

About Me

What can I help you with?

Recent Posts

Get the latest posts as soon as they come out!