Building a custom Scikit-learn Transformer using GloVe vectors from Spacy as features

using word vector features with scikit-learn (featuring spacy)

If you're working with text data and building a Natural Language Processing (NLP) model, one important task you will be confronted with is extracting features from the text.

This usually means transforming the text into a numerical format that machine learning algorithms can understand.

In this post

  1. We will look at representing text with GloVe word vectors, and how to easily get the vectors using Spacy.
  2. Then we will go over how to create a dataset using word vectors as features, that is formatted for use with Scikit-learn algorithms. [Skip to section.]
  3. Finally, we will take the code from step 2 and write a custom Scikit-learn transformer class to transform raw text samples into a word vector feature matrix. [Skip to section.]

You might be familiar with using word counts or TF-IDF to extract features from text data.

Scikit-learn provides classes out-of-the-box for both of these to transform text data samples into a feature matrix that can be fed to machine learning algorithms.

These transformers can be easily incorporated into a Scikit-learn pipeline, which might look something like this:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', SGDClassifier()),

Substitute any other chosen algorithm for SGDClassifier.

Then you can just pass the list of raw text documents, along with corresponding training labels, to the pipeline.

Using word counts as features can often be useful in training a machine learning model, but sometimes word counts do not provide enough information.

With certain problems like translation from one language to another, information about the meaning of words and their context is helpful.

Enter word vectors

word vector graphic

In this post we will look at representing text documents with word vectors, which are vectors of numbers that represent the meaning of a word.

Then we will write a custom Scikit-learn transformer class for the word vector features - similar to TfidfVectorizer or CountVectorizer - which can be plugged into a pipeline.

What are word vectors?

Word vectors, or word embeddings, are vectors of numbers that provide information about the meaning of a word, as well as its context.

You can get the semantic similarity of two words by comparing their word vectors.

Even if you're not familiar with word vectors, you may have heard of a couple of popular algorithms for obtaining vectors for various words.

There are pre-trained models that you can download to access word vectors, and if you are using Spacy, GloVe vectors are made available in the larger models.

Accessing word vectors in Spacy

With Spacy you can easily get vectors of words, as well as sentences.

I'm assuming at least some familiarity with Spacy in this post.

Note that a small Spacy model - ending in sm, such as en_core_web_sm, will not have built-in vectors, so you will need a larger model to use them.

python -m spacy download en_core_web_lg

Vectors are made available in Spacy Token, Doc and Span objects.

import spacy

nlp = spacy.load("en_core_web_lg")

With Spacy, you can get vectors for individual words, as well as sentences.

The vector will be a one-dimensional Numpy array of float numbers.

For example, take the word hat.

First you could check if the word has a vector.

hat = nlp("hat")


If it has a vector, you can retrieve it from the vector attribute.


array([ 0.25681  , -0.35552  , -0.18733  , -0.16592  , -0.68094  ,
        0.60802  ,  0.16501  ,  0.17907  ,  0.17855  ,  1.2894   ,
       -0.46481  , -0.22667  ,  0.035198 , -0.45087  ,  0.71845  ,
       -0.94376  , -0.10265  ,  0.4415   ,  0.37775  , -0.24274  ,
       -0.42695  ,  0.18544  ,  0.16044  , -0.63395  , -0.074032 ,
       -0.038969 ,  0.30813  , -0.069243 ,  0.13493  ,  0.37585  ],

The full vector has 300 dimensions.



A vector for a sentence is similar and has the same shape.

sent = nlp("He wore a red shirt with gray pants.").vector

array([ 8.16512257e-02, -8.81854445e-02, -1.21790558e-01, -7.65599236e-02,
        8.34635943e-02,  5.33326678e-02, -1.63263362e-02, -3.44585180e-01,
       -1.27936229e-01,  1.74646115e+00, -1.88558996e-01,  6.99177757e-02,
        1.32453769e-01, -1.40210897e-01, -5.84307760e-02,  3.93804982e-02,
        1.89477772e-01, -1.38648778e-01, -1.60174996e-01,  2.84267794e-02,
        2.16686666e-01,  1.05772227e-01,  1.48718446e-01,  9.56766680e-02],

The sentence vector is the same shape as the word vector because it is made up of the average of the word vectors over each word in the sentence.

Formatting the input data for Scikit-learn

Ultimately the goal is to turn a list of text samples into a feature matrix, where there is a row for each text sample, and a column for each feature.

A word vector is initially a 1 x 300 column, but we want to transform it into a 300 x 1 row.

So the first step is to reshape the word vector.

sent = sent.reshape(1,-1)


Then the rows are all concatenated together to create the full feature matrix.

Let's look at an example

Say you have a corpus like the one below, with the goal of classifying the sentences as either talking about some item of clothing or not.

corpus = [
"I went outside yesterday and picked some flowers.",
"She wore a red hat with a dress to the party.", 
"I think he was wearing athletic clothes and sneakers of some sort.", 
"I took my dog for a walk at the park.", 
"I found a hot pink hat on sale over the weekend.",
"The dog has brown fur with white spots."

labels = [0,1,1,0,1,0]

Training labels - two classes

  • 0 if not talking about clothing.
  • 1 if talking about clothing.

Turning the data into a feature matrix

In just a few steps, we can create the feature matrix from these data samples.

  1. Get the vector of each sentence from Spacy.
  2. Reshape each vector.
  3. Concatenating the sentence vectors all together with numpy.concatenate.
import numpy as np

data_list = [nlp(doc).vector.reshape(1,-1) for doc in corpus]
data = np.concatenate(data_list)

array([[ 0.08162278,  0.15696655, -0.32472467, ...,  0.01618122,
         0.01810523,  0.2212121 ],
       [ 0.1315948 , -0.0819225 , -0.08803785, ..., -0.01854067,
         0.09653309,  0.1096675 ],
       [ 0.07139538,  0.09503647, -0.14292692, ...,  0.01818248,
         0.10714766,  0.07863422],
       [ 0.14246173,  0.18372808, -0.18847175, ...,  0.174818  ,
        -0.07943812,  0.20305632],
       [ 0.08148216,  0.09574908, -0.13909541, ..., -0.10646044,
        -0.03817916,  0.22827934],
       [-0.09829144, -0.02671766, -0.07231866, ..., -0.00786566,
         0.00078378,  0.12298879]], dtype=float32)

At this point the data is in the correct input format for many Scikit-learn algorithms.

Now we will package up this code into a reusable class that can be used in a pipeline.

Writing a Scikit-learn transformer class

We can write a custom transformer class to be used just as Scikit-learn's TfidfVectorizer or CountVectorizer that we saw earlier.

WordVectorTransformer class

import numpy as np
import spacy
from sklearn.base import BaseEstimator, TransformerMixin

class WordVectorTransformer(TransformerMixin,BaseEstimator):
    def __init__(self, model="en_core_web_lg"):
        self.model = model

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        nlp = spacy.load(self.model)
        return np.concatenate([nlp(doc).vector.reshape(1,-1) for doc in X])
  • The class inherits from a couple of Scikit-learn base classes, which you can read about here in the docs.
  • It needs a fit and a transform method.

This transformer initializes the Spacy model that we're using, and then I have pretty much copied and pasted the code from earlier to create the feature matrix from the raw text samples.

One important thing to keep in mind, is that the parameters that you pass to __init__ should not be altered or changed.

In this case, I just passed the name of the Spacy model to be used, en_core_web_lg, and then the model is actually loaded in thetransform method.

At first (before reading the docs more thoroughly...) I tried to load the model in __init__ and assigned that to self.model, but that won't work if you are using GridSearchCV with multiprocessing.

This has to do with cloning.

You can read the coding guidelines to properly build Scikit-learn components here.

So now the transformer is ready to use.

transformer = WordVectorTransformer()

Using the WordVectorTransformer in a Scikit-learn Pipeline

The transformer can also be used in a pipeline.

text_clf = Pipeline([
            ('vect', WordVectorTransformer()),
            ('clf', SGDClassifier()),

This is the exact same pipeline as we saw earlier in the post, only with WordVectorTransformer instead of TfidfVectorizer.,labels)

Then call .fit() with the data samples and labels, and otherwise go about your training and testing process as usual.

Thanks for reading!

Let me know if you have questions or comments!

Write them below or feel free to reach out on Twitter @LVNGD.

Tagged In
blog comments powered by Disqus

Recent Posts

Point in Polygon search with MongoDB
Nov. 12, 2022

In a recent project, we had a large number of points on a canvas, where a user could draw a region of interest to see only the points within that area. Here is a demo of how to do that using MongoDB with a geospatial 2D-index. Visualized using D3.

Read More
Image Similarity with Python Part II: Nearest Neighbor Search
Feb. 18, 2022

This is Part II of my post on image similarity in Python with perceptual hashing. In this post, we will use Spotify's Annoy library to perform nearest neighbors search on a collection of images to find similar images to a query image.

Read More
kruskal animation shot
Kruskal's Algorithm Animation + Maze Generation
Feb. 7, 2022

Kruskal's algorithm finds a minimum spanning tree in an undirected, connected and weighted graph. We will use a union-find algorithm to do this, and generate a random maze from a grid of points.

Read More
Get the latest posts as soon as they come out!