May 6, 2020
Coreference resolution in Python with Spacy + NeuralCoref
I thought Naomi's dress looked great on her and wanted to find out where she bought it.
There are two entities in this sentence:
- Naomi
- Naomi's dress
Naomi, her and she all refer to a single entity.
Naomi's dress and it both refer to another entity.
Our brains will instantly recognize this, but a computer will not.
With just one sentence it might not seem too difficult to figure out which entity each expression refers to, but in a lot of cases you have paragraphs of information with references to various entities scattered throughout.
In this review of applications of coreference resolution in clinical medicine, there is an example of medical notes for a patient where coreference resolution helped to establish that the patient used Tylenol to treat his shoulder pain.
Spacy + NeuralCoref
NeuralCoref is a pipeline extension for Spacy to resolve coreferences, and is straightforward to use.
You will need to install Spacy if you don't have it already.
Setup + Installation
First create a virtual environment for the project.
Install Spacy, along with an English model.
pip install spacy
python -m spacy download en_core_web_sm
Installing NeuralCoref is pretty straightforward, but if your Spacy installation version is > 2.1.0, you will need to install from source.
If your version is less than 2.1.0, you can just install like this.
Later when we try to import the package, if you get an error about binary incompatibility you will need to come back and reinstall it.
pip uninstall neuralcoref
pip install neuralcoref --no-binary neuralcoref
Installing NeuralCoref from source
If your Spacy version is greater than 2.1.0, you will need to install from source.
git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
pip install -e .
Resolving coreferences
Now we're ready to resolve some coreferences.
import spacy
import neuralcoref
If you got the error related to binary incompatibility, go back and reinstall with the --no-binary
flag.
Load the English model - you can use other English models as well.
nlp = spacy.load('en_core_web_sm')
Add neuralcoref to the pipeline.
neuralcoref.add_to_pipe(nlp)
The text I'm using for this example is from an article from WhoWhatWear.
text = "Rihanna is basically master of the fashion universe right now, so we're naturally going to pay attention to what trends she is and isn't wearing whenever she steps out of the door (or black SUV). She's having quite the epic week, first presenting her Savage x Fenty lingerie runway show then hosting her annual Diamond Ball charity event last night. Rihanna was decked out in Givenchy for the big event, but upon arrival at the venue, she wore a T-shirt, diamonds (naturally), and a scarf, leather pants, and heels in fall's biggest color trend: pistachio green."
Now we can use Spacy as usual, with neuralcoref as part of the pipeline.
Pass the text to the model, which initiates a number of steps, first tokenizing the document and then starting the processing pipeline which processes the document with a tagger, a parser, an entity recognizer, and coreference resolution, since we added it to the pipeline.
Check if there were resolved coreferences
We can check if there were any resolved coreferences in the text.
This returns True if there were coreference resolutions.
Look at each cluster of coreferences
The clusters are the groups of references to an original entity.
In this text example there is only one cluster for Rihanna.
The clusters can be found in doc._.coref_clusters
.
for cluster in doc._.coref_clusters:
for reference in cluster:
#each of these is a Span object in Spacy
print(reference)
#starting index of this reference in the text
print(reference.start)
#ending index of this reference in the text
print(reference.end)
Each reference here is a Span object in Spacy, and you get can the start and end indices in the original doc for each of these references.
Text with resolved coreferences
In this text we only had one cluster for the main entity Rihanna, and then the coreferences that refer to Rihanna are she, her, etc.
So you might want to replace the coreferences in the text with the original entity.
Luckily. NeuralCoref has already done this for us!
resolved_doc = doc._.coref_resolved
print(resolved_doc)
Rihanna is basically master of the fashion universe right now, so we're naturally going to pay attention to what trends Rihanna is and isn't wearing whenever Rihanna steps out of the door (or black SUV). Rihanna's having quite the epic week, first presenting Rihanna Savage x Fenty lingerie runway show then hosting Rihanna annual Diamond Ball charity event last night. Rihanna was decked out in Givenchy for the big event, but upon arrival at the venue, Rihanna wore a T-shirt, diamonds (naturally), and a scarf, leather pants, and heels in fall's biggest color trend: pistachio green.
It takes each cluster of coreferences and replaces the coreferences with the main entity.
So she and her have been replaced with Rihanna in this text.
Rare words or names
If your text has rare words or names, the coreference resolutions might not initially turn out quite right.
You can provide a conversion dictionary to NeuralCoref that will help to resolve coreferences related to the rare word.
Consider the following text.
Saoirse has a dog. She enjoys going running with him.
doc = nlp(text)
print(doc._.coref_clusters)
print(doc._.coref_resolved)
[Saoirse: [Saoirse, She, him]]
Saoirse has a dog. Saoirse enjoys going running with Saoirse.
The reference to him is meant to refer to the dog, but instead it was grouped with Saoirse.
The conversion dictionary will have the rare name as a key and then the value(s) will be more common words that could replace the rare word.
conv_dict = {'Saoirse': ['woman']}
NeuralCoref uses word embeddings to resolve coreferences, and will use an average of the embeddings for the common words provided, instead of the embedding for the rare name to resolve coreferences.
So now we want to remove NeuralCoref from the pipeline and then add it back with the conversion dictionary.
nlp.remove_pipe("neuralcoref")
neuralcoref.add_to_pipe(nlp, conv_dict=conv_dict)
Now try resolving coreferences again.
doc = nlp(text)
print(doc._.coref_clusters)
print(doc._.coref_resolved)
[Saoirse: [Saoirse, She], a dog: [a dog, him]]
Saoirse has a dog. Saoirse enjoys going running with a dog.
Now it identified two clusters, Saoirse and a dog, and correctly associated the references to each cluster.
Sources
Check out the docs for more options on how you can tweak the settings for the coreference resolution model.
If you want to read more about the neural network behind NeuralCoref, check out this blog post from HuggingFace.
Thanks for reading!
Let me know if you have questions or comments by writing them here or reaching out on Twitter @LVNGD.