Coreference resolution in Python with Spacy + NeuralCoref

Rihanna coreference resolution paragraph graphic

Inspiration credit: the text in this graphic, as well as in another example in this post, is from this article from WhoWhatWear.

Coreference resolution is a task in Natural Language Processing (NLP) where the aim is to group together linguistic expressions that refer to the same entity.

I thought Naomi's dress looked great on her and wanted to find out where she bought it

I thought Naomi's dress looked great on her and wanted to find out where she bought it.

There are two entities in this sentence:

  1. Naomi
  2. Naomi's dress

Naomi, her and she all refer to a single entity.

Naomi's dress and it both refer to another entity.

Our brains will instantly recognize this, but a computer will not.


With just one sentence it might not seem too difficult to figure out which entity each expression refers to, but in a lot of cases you have paragraphs of information with references to various entities scattered throughout.

In this review of applications of coreference resolution in clinical medicine, there is an example of medical notes for a patient where coreference resolution helped to establish that the patient used Tylenol to treat his shoulder pain.


Spacy + NeuralCoref

NeuralCoref is a pipeline extension for Spacy to resolve coreferences, and is straightforward to use.

You will need to install Spacy if you don't have it already.

Setup + Installation

First create a virtual environment for the project.

mkvirtualenv corefres

Install Spacy, along with an English model.

pip install spacy
python -m spacy download en_core_web_sm

Installing NeuralCoref is pretty straightforward, but if your Spacy installation version is > 2.1.0, you will need to install from source.

If your version is less than 2.1.0, you can just install like this.

pip install neuralcoref

Later when we try to import the package, if you get an error about binary incompatibility you will need to come back and reinstall it.

pip uninstall neuralcoref
pip install neuralcoref --no-binary neuralcoref

Installing NeuralCoref from source

If your Spacy version is greater than 2.1.0, you will need to install from source.

git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
pip install -e .

Resolving coreferences

Now we're ready to resolve some coreferences.

import spacy
import neuralcoref

If you got the error related to binary incompatibility, go back and reinstall with the --no-binary flag.

Load the English model - you can use other English models as well.

nlp = spacy.load('en_core_web_sm')

Add neuralcoref to the pipeline.

neuralcoref.add_to_pipe(nlp)

The text I'm using for this example is from an article from WhoWhatWear.

Text illustration

text = "Rihanna is basically master of the fashion universe right now, so we're naturally going to pay attention to what trends she is and isn't wearing whenever she steps out of the door (or black SUV). She's having quite the epic week, first presenting her Savage x Fenty lingerie runway show then hosting her annual Diamond Ball charity event last night. Rihanna was decked out in Givenchy for the big event, but upon arrival at the venue, she wore a T-shirt, diamonds (naturally), and a scarf, leather pants, and heels in fall's biggest color trend: pistachio green."

Now we can use Spacy as usual, with neuralcoref as part of the pipeline.

doc = nlp(text)

Pass the text to the model, which initiates a number of steps, first tokenizing the document and then starting the processing pipeline which processes the document with a tagger, a parser, an entity recognizer, and coreference resolution, since we added it to the pipeline.

Check if there were resolved coreferences

We can check if there were any resolved coreferences in the text.

doc._.has_coref

This returns True if there were coreference resolutions.

Look at each cluster of coreferences

The clusters are the groups of references to an original entity.

In this text example there is only one cluster for Rihanna.

The clusters can be found in doc._.coref_clusters.

for cluster in doc._.coref_clusters:
    for reference in cluster:
    #each of these is a Span object in Spacy
        print(reference)
        #starting index of this reference in the text
        print(reference.start) 
        #ending index of this reference in the text
        print(reference.end)

Each reference here is a Span object in Spacy, and you get can the start and end indices in the original doc for each of these references.

Text with resolved coreferences

In this text we only had one cluster for the main entity Rihanna, and then the coreferences that refer to Rihanna are she, her, etc.

So you might want to replace the coreferences in the text with the original entity.

Luckily. NeuralCoref has already done this for us!

resolved_doc = doc._.coref_resolved
print(resolved_doc)

Rihanna is basically master of the fashion universe right now, so we're naturally going to pay attention to what trends Rihanna is and isn't wearing whenever Rihanna steps out of the door (or black SUV). Rihanna's having quite the epic week, first presenting Rihanna Savage x Fenty lingerie runway show then hosting Rihanna annual Diamond Ball charity event last night. Rihanna was decked out in Givenchy for the big event, but upon arrival at the venue, Rihanna wore a T-shirt, diamonds (naturally), and a scarf, leather pants, and heels in fall's biggest color trend: pistachio green.

rihanna_coref_after.png

It takes each cluster of coreferences and replaces the coreferences with the main entity.

So she and her have been replaced with Rihanna in this text.

Rare words or names

If your text has rare words or names, the coreference resolutions might not initially turn out quite right.

You can provide a conversion dictionary to NeuralCoref that will help to resolve coreferences related to the rare word.

Consider the following text.

Saoirse has a dog. She enjoys going running with him.

doc = nlp(text)

print(doc._.coref_clusters)
print(doc._.coref_resolved)

[Saoirse: [Saoirse, She, him]]
Saoirse has a dog. Saoirse enjoys going running with Saoirse.

The reference to him is meant to refer to the dog, but instead it was grouped with Saoirse.

The conversion dictionary will have the rare name as a key and then the value(s) will be more common words that could replace the rare word.

conv_dict = {'Saoirse': ['woman']}

NeuralCoref uses word embeddings to resolve coreferences, and will use an average of the embeddings for the common words provided, instead of the embedding for the rare name to resolve coreferences.

So now we want to remove NeuralCoref from the pipeline and then add it back with the conversion dictionary.

nlp.remove_pipe("neuralcoref")
neuralcoref.add_to_pipe(nlp, conv_dict=conv_dict)

Now try resolving coreferences again.

doc = nlp(text)
print(doc._.coref_clusters)
print(doc._.coref_resolved)

[Saoirse: [Saoirse, She], a dog: [a dog, him]]
Saoirse has a dog. Saoirse enjoys going running with a dog.

Now it identified two clusters, Saoirse and a dog, and correctly associated the references to each cluster.

Sources

Check out the docs for more options on how you can tweak the settings for the coreference resolution model.

If you want to read more about the neural network behind NeuralCoref, check out this blog post from HuggingFace.

Thanks for reading!

Let me know if you have questions or comments by writing them here or reaching out on Twitter @LVNGD.

blog comments powered by Disqus

Recent Posts

main_graphic.jpg
Image Similarity with Python Part II: Nearest Neighbor Search
Feb. 18, 2022

This is Part II of my post on image similarity in Python with perceptual hashing. In this post, we will use Spotify's Annoy library to perform nearest neighbors search on a collection of images to find similar images to a query image.

Read More
kruskal animation shot
Kruskal's Algorithm Animation + Maze Generation
Feb. 7, 2022

Kruskal's algorithm finds a minimum spanning tree in an undirected, connected and weighted graph. We will use a union-find algorithm to do this, and generate a random maze from a grid of points.

Read More
lip_main.png
Puckering Lips Animation in D3
Jan. 28, 2022

Just in time for Valentine's day, create a puckering lips animation in D3 from an SVG path, using interpolations and .attrTween(). We will go through the steps from generating points from an SVG path, to interpolating lines in D3 to animate them.

Read More
Get the latest posts as soon as they come out!