Coreference resolution in Python with Spacy + NeuralCoref

Rihanna coreference resolution paragraph graphic

Inspiration credit: the text in this graphic, as well as in another example in this post, is from this article from WhoWhatWear.

Coreference resolution is a task in Natural Language Processing (NLP) where the aim is to group together linguistic expressions that refer to the same entity.

I thought Naomi's dress looked great on her and wanted to find out where she bought it

I thought Naomi's dress looked great on her and wanted to find out where she bought it.

There are two entities in this sentence:

  1. Naomi
  2. Naomi's dress

Naomi, her and she all refer to a single entity.

Naomi's dress and it both refer to another entity.

Our brains will instantly recognize this, but a computer will not.


With just one sentence it might not seem too difficult to figure out which entity each expression refers to, but in a lot of cases you have paragraphs of information with references to various entities scattered throughout.

In this review of applications of coreference resolution in clinical medicine, there is an example of medical notes for a patient where coreference resolution helped to establish that the patient used Tylenol to treat his shoulder pain.


Spacy + NeuralCoref

NeuralCoref is a pipeline extension for Spacy to resolve coreferences, and is straightforward to use.

You will need to install Spacy if you don't have it already.

Setup + Installation

First create a virtual environment for the project.

mkvirtualenv corefres

Install Spacy, along with an English model.

pip install spacy
python -m spacy download en_core_web_sm

Installing NeuralCoref is pretty straightforward, but if your Spacy installation version is > 2.1.0, you will need to install from source.

If your version is less than 2.1.0, you can just install like this.

pip install neuralcoref

Later when we try to import the package, if you get an error about binary incompatibility you will need to come back and reinstall it.

pip uninstall neuralcoref
pip install neuralcoref --no-binary neuralcoref

Installing NeuralCoref from source

If your Spacy version is greater than 2.1.0, you will need to install from source.

git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
pip install -e .

Resolving coreferences

Now we're ready to resolve some coreferences.

import spacy
import neuralcoref

If you got the error related to binary incompatibility, go back and reinstall with the --no-binary flag.

Load the English model - you can use other English models as well.

nlp = spacy.load('en_core_web_sm')

Add neuralcoref to the pipeline.

neuralcoref.add_to_pipe(nlp)

The text I'm using for this example is from an article from WhoWhatWear.

Text illustration

text = "Rihanna is basically master of the fashion universe right now, so we're naturally going to pay attention to what trends she is and isn't wearing whenever she steps out of the door (or black SUV). She's having quite the epic week, first presenting her Savage x Fenty lingerie runway show then hosting her annual Diamond Ball charity event last night. Rihanna was decked out in Givenchy for the big event, but upon arrival at the venue, she wore a T-shirt, diamonds (naturally), and a scarf, leather pants, and heels in fall's biggest color trend: pistachio green."

Now we can use Spacy as usual, with neuralcoref as part of the pipeline.

doc = nlp(text)

Pass the text to the model, which initiates a number of steps, first tokenizing the document and then starting the processing pipeline which processes the document with a tagger, a parser, an entity recognizer, and coreference resolution, since we added it to the pipeline.

Check if there were resolved coreferences

We can check if there were any resolved coreferences in the text.

doc._.has_coref

This returns True if there were coreference resolutions.

Look at each cluster of coreferences

The clusters are the groups of references to an original entity.

In this text example there is only one cluster for Rihanna.

The clusters can be found in doc._.coref_clusters.

for cluster in doc._.coref_clusters:
    for reference in cluster:
    #each of these is a Span object in Spacy
        print(reference)
        #starting index of this reference in the text
        print(reference.start) 
        #ending index of this reference in the text
        print(reference.end)

Each reference here is a Span object in Spacy, and you get can the start and end indices in the original doc for each of these references.

Text with resolved coreferences

In this text we only had one cluster for the main entity Rihanna, and then the coreferences that refer to Rihanna are she, her, etc.

So you might want to replace the coreferences in the text with the original entity.

Luckily. NeuralCoref has already done this for us!

resolved_doc = doc._.coref_resolved
print(resolved_doc)

Rihanna is basically master of the fashion universe right now, so we're naturally going to pay attention to what trends Rihanna is and isn't wearing whenever Rihanna steps out of the door (or black SUV). Rihanna's having quite the epic week, first presenting Rihanna Savage x Fenty lingerie runway show then hosting Rihanna annual Diamond Ball charity event last night. Rihanna was decked out in Givenchy for the big event, but upon arrival at the venue, Rihanna wore a T-shirt, diamonds (naturally), and a scarf, leather pants, and heels in fall's biggest color trend: pistachio green.

rihanna_coref_after.png

It takes each cluster of coreferences and replaces the coreferences with the main entity.

So she and her have been replaced with Rihanna in this text.

Rare words or names

If your text has rare words or names, the coreference resolutions might not initially turn out quite right.

You can provide a conversion dictionary to NeuralCoref that will help to resolve coreferences related to the rare word.

Consider the following text.

Saoirse has a dog. She enjoys going running with him.

doc = nlp(text)

print(doc._.coref_clusters)
print(doc._.coref_resolved)

[Saoirse: [Saoirse, She, him]]
Saoirse has a dog. Saoirse enjoys going running with Saoirse.

The reference to him is meant to refer to the dog, but instead it was grouped with Saoirse.

The conversion dictionary will have the rare name as a key and then the value(s) will be more common words that could replace the rare word.

conv_dict = {'Saoirse': ['woman']}

NeuralCoref uses word embeddings to resolve coreferences, and will use an average of the embeddings for the common words provided, instead of the embedding for the rare name to resolve coreferences.

So now we want to remove NeuralCoref from the pipeline and then add it back with the conversion dictionary.

nlp.remove_pipe("neuralcoref")
neuralcoref.add_to_pipe(nlp, conv_dict=conv_dict)

Now try resolving coreferences again.

doc = nlp(text)
print(doc._.coref_clusters)
print(doc._.coref_resolved)

[Saoirse: [Saoirse, She], a dog: [a dog, him]]
Saoirse has a dog. Saoirse enjoys going running with a dog.

Now it identified two clusters, Saoirse and a dog, and correctly associated the references to each cluster.

Sources

Check out the docs for more options on how you can tweak the settings for the coreference resolution model.

If you want to read more about the neural network behind NeuralCoref, check out this blog post from HuggingFace.

Thanks for reading!

Let me know if you have questions or comments by writing them here or reaching out on Twitter @LVNGD.

blog comments powered by Disqus

Recent Posts

mortonzcurve.png
Computing Morton Codes with a WebGPU Compute Shader
May 29, 2024

Starting out with general purpose computing on the GPU, we are going to write a WebGPU compute shader to compute Morton Codes from an array of 3-D coordinates. This is the first step to detecting collisions between pairs of points.

Read More
webgpuCollide.png
WebGPU: Building a Particle Simulation with Collision Detection
May 13, 2024

In this post, I am dipping my toes into the world of compute shaders in WebGPU. This is the first of a series on building a particle simulation with collision detection using the GPU.

Read More
abstract_tree.png
Solving the Lowest Common Ancestor Problem in Python
May 9, 2023

Finding the Lowest Common Ancestor of a pair of nodes in a tree can be helpful in a variety of problems in areas such as information retrieval, where it is used with suffix trees for string matching. Read on for the basics of this in Python.

Read More
Get the latest posts as soon as they come out!