In this post we will look at representing text documents with word vectors, which are vectors of numbers that represent the meaning of a word.
Then we will write a custom Scikit-learn transformer class for the word vector features - similar to TfidfVectorizer or CountVectorizer - which can be plugged into a pipeline.
What are word vectors?
Word vectors, or word embeddings, are vectors of numbers that provide information about the meaning of a word, as well as its context.
You can get the semantic similarity of two words by comparing their word vectors.
Even if you're not familiar with word vectors, you may have heard of a couple of popular algorithms for obtaining vectors for various words.
There are pre-trained models that you can download to access word vectors, and if you are using Spacy, GloVe vectors are made available in the larger models.
Accessing word vectors in Spacy
With Spacy you can easily get vectors of words, as well as sentences.
I'm assuming at least some familiarity with Spacy in this post.
Note that a small Spacy model - ending in sm, such as en_core_web_sm
, will not have built-in vectors, so you will need a larger model to use them.
python -m spacy download en_core_web_lg
Vectors are made available in Spacy Token, Doc and Span objects.
import spacy
nlp = spacy.load("en_core_web_lg")
With Spacy, you can get vectors for individual words, as well as sentences.
The vector will be a one-dimensional Numpy array of float numbers.
For example, take the word hat.
First you could check if the word has a vector.
hat = nlp("hat")
hat.has_vector
True
If it has a vector, you can retrieve it from the vector
attribute.
hat.vector
array([ 0.25681 , -0.35552 , -0.18733 , -0.16592 , -0.68094 ,
0.60802 , 0.16501 , 0.17907 , 0.17855 , 1.2894 ,
-0.46481 , -0.22667 , 0.035198 , -0.45087 , 0.71845 ,
...
-0.94376 , -0.10265 , 0.4415 , 0.37775 , -0.24274 ,
-0.42695 , 0.18544 , 0.16044 , -0.63395 , -0.074032 ,
-0.038969 , 0.30813 , -0.069243 , 0.13493 , 0.37585 ],
dtype=float32)
The full vector has 300 dimensions.
A vector for a sentence is similar and has the same shape.
sent = nlp("He wore a red shirt with gray pants.").vector
array([ 8.16512257e-02, -8.81854445e-02, -1.21790558e-01, -7.65599236e-02,
8.34635943e-02, 5.33326678e-02, -1.63263362e-02, -3.44585180e-01,
-1.27936229e-01, 1.74646115e+00, -1.88558996e-01, 6.99177757e-02,
...
1.32453769e-01, -1.40210897e-01, -5.84307760e-02, 3.93804982e-02,
1.89477772e-01, -1.38648778e-01, -1.60174996e-01, 2.84267794e-02,
2.16686666e-01, 1.05772227e-01, 1.48718446e-01, 9.56766680e-02],
dtype=float32)
The sentence vector is the same shape as the word vector because it is made up of the average of the word vectors over each word in the sentence.
Formatting the input data for Scikit-learn
Ultimately the goal is to turn a list of text samples into a feature matrix, where there is a row for each text sample, and a column for each feature.
A word vector is initially a 1 x 300 column, but we want to transform it into a 300 x 1 row.
So the first step is to reshape the word vector.
sent = sent.reshape(1,-1)
sent.shape
(300,)
Then the rows are all concatenated together to create the full feature matrix.
Let's look at an example
Say you have a corpus like the one below, with the goal of classifying the sentences as either talking about some item of clothing or not.
corpus = [
"I went outside yesterday and picked some flowers.",
"She wore a red hat with a dress to the party.",
"I think he was wearing athletic clothes and sneakers of some sort.",
"I took my dog for a walk at the park.",
"I found a hot pink hat on sale over the weekend.",
"The dog has brown fur with white spots."
]
labels = [0,1,1,0,1,0]
Training labels - two classes
- 0 if not talking about clothing.
- 1 if talking about clothing.
Turning the data into a feature matrix
In just a few steps, we can create the feature matrix from these data samples.
- Get the vector of each sentence from Spacy.
- Reshape each vector.
- Concatenating the sentence vectors all together with
numpy.concatenate
.
import numpy as np
data_list = [nlp(doc).vector.reshape(1,-1) for doc in corpus]
data = np.concatenate(data_list)
array([[ 0.08162278, 0.15696655, -0.32472467, ..., 0.01618122,
0.01810523, 0.2212121 ],
[ 0.1315948 , -0.0819225 , -0.08803785, ..., -0.01854067,
0.09653309, 0.1096675 ],
[ 0.07139538, 0.09503647, -0.14292692, ..., 0.01818248,
0.10714766, 0.07863422],
[ 0.14246173, 0.18372808, -0.18847175, ..., 0.174818 ,
-0.07943812, 0.20305632],
[ 0.08148216, 0.09574908, -0.13909541, ..., -0.10646044,
-0.03817916, 0.22827934],
[-0.09829144, -0.02671766, -0.07231866, ..., -0.00786566,
0.00078378, 0.12298879]], dtype=float32)
At this point the data is in the correct input format for many Scikit-learn algorithms.
Now we will package up this code into a reusable class that can be used in a pipeline.
Writing a Scikit-learn transformer class
We can write a custom transformer class to be used just as Scikit-learn's TfidfVectorizer
or CountVectorizer
that we saw earlier.
WordVectorTransformer class
import numpy as np
import spacy
from sklearn.base import BaseEstimator, TransformerMixin
class WordVectorTransformer(TransformerMixin,BaseEstimator):
def __init__(self, model="en_core_web_lg"):
self.model = model
def fit(self,X,y=None):
return self
def transform(self,X):
nlp = spacy.load(self.model)
return np.concatenate([nlp(doc).vector.reshape(1,-1) for doc in X])
- The class inherits from a couple of Scikit-learn base classes, which you can read about here in the docs.
- It needs a
fit
and a transform
method.
This transformer initializes the Spacy model that we're using, and then I have pretty much copied and pasted the code from earlier to create the feature matrix from the raw text samples.
One important thing to keep in mind, is that the parameters that you pass to __init__
should not be altered or changed.
In this case, I just passed the name of the Spacy model to be used, en_core_web_lg
, and then the model is actually loaded in thetransform
method.
At first (before reading the docs more thoroughly...) I tried to load the model in __init__
and assigned that to self.model
, but that won't work if you are using GridSearchCV
with multiprocessing.
This has to do with cloning.
You can read the coding guidelines to properly build Scikit-learn components here.
So now the transformer is ready to use.
transformer = WordVectorTransformer()
transformer.fit_transform(corpus)
Using the WordVectorTransformer in a Scikit-learn Pipeline
The transformer can also be used in a pipeline.
text_clf = Pipeline([
('vect', WordVectorTransformer()),
('clf', SGDClassifier()),
])
This is the exact same pipeline as we saw earlier in the post, only with WordVectorTransformer
instead of TfidfVectorizer
.
text_clf.fit(corpus,labels)
Then call .fit()
with the data samples and labels, and otherwise go about your training and testing process as usual.
Thanks for reading!
Let me know if you have questions or comments!
Write them below or feel free to reach out on Twitter @LVNGD.