Building a custom Scikit-learn Transformer using GloVe vectors from Spacy as features

using word vector features with scikit-learn (featuring spacy)

If you're working with text data and building a Natural Language Processing (NLP) model, one important task you will be confronted with is extracting features from the text.

This usually means transforming the text into a numerical format that machine learning algorithms can understand.

In this post

  1. We will look at representing text with GloVe word vectors, and how to easily get the vectors using Spacy.
  2. Then we will go over how to create a dataset using word vectors as features, that is formatted for use with Scikit-learn algorithms. [Skip to section.]
  3. Finally, we will take the code from step 2 and write a custom Scikit-learn transformer class to transform raw text samples into a word vector feature matrix. [Skip to section.]

You might be familiar with using word counts or TF-IDF to extract features from text data.

Scikit-learn provides classes out-of-the-box for both of these to transform text data samples into a feature matrix that can be fed to machine learning algorithms.

These transformers can be easily incorporated into a Scikit-learn pipeline, which might look something like this:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', SGDClassifier()),
            ])

Substitute any other chosen algorithm for SGDClassifier.

Then you can just pass the list of raw text documents, along with corresponding training labels, to the pipeline.

Using word counts as features can often be useful in training a machine learning model, but sometimes word counts do not provide enough information.

With certain problems like translation from one language to another, information about the meaning of words and their context is helpful.

Enter word vectors

word vector graphic

In this post we will look at representing text documents with word vectors, which are vectors of numbers that represent the meaning of a word.

Then we will write a custom Scikit-learn transformer class for the word vector features - similar to TfidfVectorizer or CountVectorizer - which can be plugged into a pipeline.

What are word vectors?

Word vectors, or word embeddings, are vectors of numbers that provide information about the meaning of a word, as well as its context.

You can get the semantic similarity of two words by comparing their word vectors.

Even if you're not familiar with word vectors, you may have heard of a couple of popular algorithms for obtaining vectors for various words.

There are pre-trained models that you can download to access word vectors, and if you are using Spacy, GloVe vectors are made available in the larger models.

Accessing word vectors in Spacy

With Spacy you can easily get vectors of words, as well as sentences.

I'm assuming at least some familiarity with Spacy in this post.

Note that a small Spacy model - ending in sm, such as en_core_web_sm, will not have built-in vectors, so you will need a larger model to use them.

python -m spacy download en_core_web_lg

Vectors are made available in Spacy Token, Doc and Span objects.

import spacy

nlp = spacy.load("en_core_web_lg")

With Spacy, you can get vectors for individual words, as well as sentences.

The vector will be a one-dimensional Numpy array of float numbers.

For example, take the word hat.

First you could check if the word has a vector.

hat = nlp("hat")
hat.has_vector 

True

If it has a vector, you can retrieve it from the vector attribute.

hat.vector

array([ 0.25681  , -0.35552  , -0.18733  , -0.16592  , -0.68094  ,
        0.60802  ,  0.16501  ,  0.17907  ,  0.17855  ,  1.2894   ,
       -0.46481  , -0.22667  ,  0.035198 , -0.45087  ,  0.71845  ,
                ...
       -0.94376  , -0.10265  ,  0.4415   ,  0.37775  , -0.24274  ,
       -0.42695  ,  0.18544  ,  0.16044  , -0.63395  , -0.074032 ,
       -0.038969 ,  0.30813  , -0.069243 ,  0.13493  ,  0.37585  ],
      dtype=float32)

The full vector has 300 dimensions.

hat.shape

(300,)

A vector for a sentence is similar and has the same shape.

sent = nlp("He wore a red shirt with gray pants.").vector

array([ 8.16512257e-02, -8.81854445e-02, -1.21790558e-01, -7.65599236e-02,
        8.34635943e-02,  5.33326678e-02, -1.63263362e-02, -3.44585180e-01,
       -1.27936229e-01,  1.74646115e+00, -1.88558996e-01,  6.99177757e-02,
                ...
        1.32453769e-01, -1.40210897e-01, -5.84307760e-02,  3.93804982e-02,
        1.89477772e-01, -1.38648778e-01, -1.60174996e-01,  2.84267794e-02,
        2.16686666e-01,  1.05772227e-01,  1.48718446e-01,  9.56766680e-02],
      dtype=float32)

The sentence vector is the same shape as the word vector because it is made up of the average of the word vectors over each word in the sentence.

Formatting the input data for Scikit-learn

Ultimately the goal is to turn a list of text samples into a feature matrix, where there is a row for each text sample, and a column for each feature.

A word vector is initially a 1 x 300 column, but we want to transform it into a 300 x 1 row.

So the first step is to reshape the word vector.

sent = sent.reshape(1,-1)
sent.shape

(300,)

Then the rows are all concatenated together to create the full feature matrix.

Let's look at an example

Say you have a corpus like the one below, with the goal of classifying the sentences as either talking about some item of clothing or not.

corpus = [
"I went outside yesterday and picked some flowers.",
"She wore a red hat with a dress to the party.", 
"I think he was wearing athletic clothes and sneakers of some sort.", 
"I took my dog for a walk at the park.", 
"I found a hot pink hat on sale over the weekend.",
"The dog has brown fur with white spots."
]

labels = [0,1,1,0,1,0]

Training labels - two classes

  • 0 if not talking about clothing.
  • 1 if talking about clothing.

Turning the data into a feature matrix

In just a few steps, we can create the feature matrix from these data samples.

  1. Get the vector of each sentence from Spacy.
  2. Reshape each vector.
  3. Concatenating the sentence vectors all together with numpy.concatenate.
import numpy as np

data_list = [nlp(doc).vector.reshape(1,-1) for doc in corpus]
data = np.concatenate(data_list)

array([[ 0.08162278,  0.15696655, -0.32472467, ...,  0.01618122,
         0.01810523,  0.2212121 ],
       [ 0.1315948 , -0.0819225 , -0.08803785, ..., -0.01854067,
         0.09653309,  0.1096675 ],
       [ 0.07139538,  0.09503647, -0.14292692, ...,  0.01818248,
         0.10714766,  0.07863422],
       [ 0.14246173,  0.18372808, -0.18847175, ...,  0.174818  ,
        -0.07943812,  0.20305632],
       [ 0.08148216,  0.09574908, -0.13909541, ..., -0.10646044,
        -0.03817916,  0.22827934],
       [-0.09829144, -0.02671766, -0.07231866, ..., -0.00786566,
         0.00078378,  0.12298879]], dtype=float32)

At this point the data is in the correct input format for many Scikit-learn algorithms.

Now we will package up this code into a reusable class that can be used in a pipeline.

Writing a Scikit-learn transformer class

We can write a custom transformer class to be used just as Scikit-learn's TfidfVectorizer or CountVectorizer that we saw earlier.

WordVectorTransformer class

import numpy as np
import spacy
from sklearn.base import BaseEstimator, TransformerMixin

class WordVectorTransformer(TransformerMixin,BaseEstimator):
    def __init__(self, model="en_core_web_lg"):
        self.model = model

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        nlp = spacy.load(self.model)
        return np.concatenate([nlp(doc).vector.reshape(1,-1) for doc in X])
  • The class inherits from a couple of Scikit-learn base classes, which you can read about here in the docs.
  • It needs a fit and a transform method.

This transformer initializes the Spacy model that we're using, and then I have pretty much copied and pasted the code from earlier to create the feature matrix from the raw text samples.

One important thing to keep in mind, is that the parameters that you pass to __init__ should not be altered or changed.

In this case, I just passed the name of the Spacy model to be used, en_core_web_lg, and then the model is actually loaded in thetransform method.

At first (before reading the docs more thoroughly...) I tried to load the model in __init__ and assigned that to self.model, but that won't work if you are using GridSearchCV with multiprocessing.

This has to do with cloning.

You can read the coding guidelines to properly build Scikit-learn components here.

So now the transformer is ready to use.

transformer = WordVectorTransformer()
transformer.fit_transform(corpus)

Using the WordVectorTransformer in a Scikit-learn Pipeline

The transformer can also be used in a pipeline.

text_clf = Pipeline([
            ('vect', WordVectorTransformer()),
            ('clf', SGDClassifier()),
            ])

This is the exact same pipeline as we saw earlier in the post, only with WordVectorTransformer instead of TfidfVectorizer.

text_clf.fit(corpus,labels)

Then call .fit() with the data samples and labels, and otherwise go about your training and testing process as usual.

Thanks for reading!

Let me know if you have questions or comments!

Write them below or feel free to reach out on Twitter @LVNGD.

Tagged In
blog comments powered by Disqus

Recent Posts

mortonzcurve.png
Computing Morton Codes with a WebGPU Compute Shader
May 29, 2024

Starting out with general purpose computing on the GPU, we are going to write a WebGPU compute shader to compute Morton Codes from an array of 3-D coordinates. This is the first step to detecting collisions between pairs of points.

Read More
webgpuCollide.png
WebGPU: Building a Particle Simulation with Collision Detection
May 13, 2024

In this post, I am dipping my toes into the world of compute shaders in WebGPU. This is the first of a series on building a particle simulation with collision detection using the GPU.

Read More
abstract_tree.png
Solving the Lowest Common Ancestor Problem in Python
May 9, 2023

Finding the Lowest Common Ancestor of a pair of nodes in a tree can be helpful in a variety of problems in areas such as information retrieval, where it is used with suffix trees for string matching. Read on for the basics of this in Python.

Read More
Get the latest posts as soon as they come out!