Overview
Text classification is a common problem that we have been dealing with since the beginning of text itself. Putting things into groups is a way to try to make sense of the chaotic world around us.
We use text classification in a lot of ways.
- Categorizing books - history, politics, cookbooks, fictions vs non-fiction, etc.
- Spam filters - classifying an email as spam or not spam.
- When Gmail labels your emails as important or not.
- Sentiment analysis - is the tone of this Tweet positive or negative?
The list goes on.
Today I will be creating a text classifier with Python and Scikit-learn.
The Problem
I collected a bunch of fashion articles from various sources like blogs and magazines around the Internet.
I wanted articles about fashion trends, designers, influencers, what people are wearing etc.
I collected the types of articles I wanted, but in the mix there were also articles about other things like beauty and makeup, which were not what I wanted for this project. These articles were tagged similarly as articles about fashion trends(for example in the 'style' section you might find makeup articles) so it was hard to filter them during collection.
The Goal
The goal was to sort the articles into two categories: fashion articles and non-fashion articles.
So I started experimenting with building a classifier to filter out the non-fashion articles, and it worked out pretty well.
I can use the classifier when I collect new articles to automatically filter out the non-fashion related ones.
Setup
I'm just working in a regular Python virtualenv.
Using several packages.
pip install scikit-learn
pip install pandas
pip install NLTK
Installing scikit-learn also installs NumPy.
The Data
I scraped a bunch of fashion articles from several websites and blogs.
There are about 2600 fashion articles and about 2600 non-fashion articles.
Text Corpus
This collection of scraped articles makes up my text corpus.
I knew that I was going to be training the model using supervised learning, so I needed a labeled dataset.
For this project I manually went through and labeled the articles as fashion or non.
Training and Testing data
I split my data into two datasets for training and testing.
The split is 80% for training and 20% for testing.
My training data is currently in a Pandas dataframe, train_data
, and my testing data is in another Pandas dataframe, test_data
.
The example row below shows the url of the article, the date, the content - my text data - and the label of fashion or non that make up the dataset.
For this classification task I'm only concerned with the content and fashion columns in the dataframe.
url date content fashion
https://www.vogue... 2018-12-13 Temperley London. From... fashion
I'm not going to go into how I split up the training and testing datasets for now. That might be a topic for another post.
Feature Extraction
A classification algorithm can't process the raw text data, so first the text needs to be converted into a numerical form.
How can we extract numerical features from the data?
Intuitively, it would seem that we can mostly figure out whether or not an article is fashion-related based on the kinds of words you find in the article.
If the article talks about blazers and midi-dresses and stilettos you can be pretty sure it has something to do with fashion.
So I will be extracting features based on the words in the articles.
TF-IDF
TF-IDF stands for term frequency - inverse document frequency.
What is TF-IDF?
It's a statistic that calculates how important a word is to a document within a corpus or group of documents.
A document just refers to a text data sample - in this case each article is a document.
The importance of a word increases proportionally to the number of times it appears in a document, the term frequency, but then is offset by how often it appears in the entire corpus, the inverse document frequency.
No need to calculate all of that ourselves.
I used the TF-IDF vectorizer from scikit-learn, which will generate a matrix of TF-IDF features.
Preparing the text data for the TF-IDF vectorizer
First I will take the 'content' column from my dataframe train_data
, which has the article text data.
import pandas as pd
...
training_data = train_data.content.astype('str')
This gives me a list of text documents.
I'm also setting aside the corresponding labels from the train_data
dataframe.
labels = train_data['fashion']
Normalizing the text
Before appying TF-IDF vectorization to the text data, I will normalize the text.
I'm going to loop through each document and apply the following techniques to normalize the text.
- remove punctuation
- remove extra whitespace
- convert the text to lowercase
- remove stop words
Stopwords are common words like 'a' or 'the' that are not important for this project's goal of classifying whether an article is fashion-related or not.
I used NLTK's list of stopwords.
There are many other things you could do to normalize the data, depending on your situation and goals.
I've written the function normalize_documents
to normalize the list of documents.
from nltk.corpus import stopwords
def normalize_documents(documents_list):
processed_documents = []
for document in documents_list:
punct_to_remove = '‘’”“—!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~'
punctuation_table = str.maketrans('','',punct_to_remove)
document = re.sub(' +',' ', document) #removes extra whitespace
document = document.split() #splits document into word tokens
stop_words = set(stopwords.words('english'))
processed = [word.lower().translate(punctuation_table) for word in document if word not in stop_words]
processed_documents.append(processed)
return processed_documents
So now I can just pass in my list of training documents training_data
.
processed_documents = normalize_documents(training_data)
Vectorize the data.
Now the list of processed documents is ready to be used to create the TF-IDF feature matrix.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,3))
tf_idf_vector = vectorizer.fit_transform(processed_documents)
First initialize the TfidfVectorizer
class and then call its fit_transform
method, passing in the processed documents.
The output is a sparse matrix of TF-IDF features, which will be passed to the classifier, along with the list of corresponding labels.
tf_idf_vector
<7945x79853 sparse matrix of type '<class 'numpy.float64'>'
with 1944124 stored elements in Compressed Sparse Row format>
I did some experimenting and found that specifying the ngram_range
parameter improved the model. I performed a Grid Search to do this, which I will talk about later.
Support Vector Machine(SVM)
Support vector machines are supervised learning models that are often used for classification.
I'm using the SGDClassifier from scikit-learn.
It's a linear classifier that uses stochastic gradient descent learning, and fits a linear SVM by default by using the hinge loss function.
Set up the SGDClassifier
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier()
I experimented with tuning different parameters by doing a Grid Search but ended up sticking with the default parameters for the SGDClassifier.
A few of the default parameters.
- hinge loss function - this implements linear SVM
- 'l2'regularization term - the standard for linear SVM models
- alpha 0.0001
- shuffle=True - shuffles the data after each training iteration
Train the classifier.
Pass the TF-IDF feature matrix, along with the labels, to the classifier's fit
method.
clf.fit(tf_idf_vector,labels)
The SGDClassifier fits two arrays. The first is the training data samples that we converted into the tf_idf_vector
feature matrix, and the second is the labels or target values, labels
.
The labels are the 'correct answers' for the classifier to learn from.
Using a Pipeline
It's much easier to put all of this together with a pipeline and feed in the data once, rather than manually having to perform the steps of vectorization and fitting the model like we did above.
Scikit-learn has a Pipeline class to achieve this.
from sklearn.pipeline import Pipeline
All you have to do is create a Pipeline instance, and add the TF-IDF vectorizer and SGDClassifier configurations to it.
text_clf = Pipeline([
('vect', TfidfVectorizer()),
('clf', SGDClassifier()),
])
I'm using the same list of processed documents from before, along with the corresponding labels.
And pass these to the pipeline's fit
method.
text_clf.fit(processed_documents,labels)
Great! The classifier has been trained and now we need to see how well it works.
Testing the classifier
Now it's time to use the testing data I mentioned earlier that is in the Pandas dataframe test_data
to see how the model performs.
It's pretty much the same process to test the data as it was earlier training it.
First, the test data needs to be processed and vectorized in the same way as the training data.
As before, first I need take the 'content' column from test_data
, which has the article text data. And also the labels.
testing_data = test_data.content.astype('str')
test_labels = test_data['fashion']
Normalize the documents in testing_data
just like I did with the training data earlier.
processed_test_documents = normalize_documents(testing_data)
Using the pipeline text_clf
from before, pass the processed documents to its predict
method instead of fit
.
predicted = text_clf.predict(processed_test_documents)
Which returns the predicted labels for each test data sample, in predicted
.
How did the model do?
Let's look at some metrics.
Scikit-learn has a handy module, metrics
, which has a variety of reports on prediction error.
from sklearn import metrics
Accuracy Score
How accurate are the predictions of this model?
metrics.accuracy_score(labels,predicted)
0.9317073170731708
The accuracy score is 93.2%, which is not bad.
Later we will look at some ways to improve this.
Classification Report
This report shows the quality of the predictions made by the model.
metrics.classification_report(labels, predicted)
precision recall f1-score support
fashion 0.93 0.93 0.93 521
non 0.93 0.93 0.93 504
micro avg 0.93 0.93 0.93 1025
macro avg 0.93 0.93 0.93 1025
weighted avg 0.93 0.93 0.93 1025
There are two classes: 'fashion' and 'non'.
The 'fashion' class refers to articles that are fashion articles, and 'non' refers to articles that are not fashion articles.
The precision, recall and f1-score show how well the model performed at correctly predicting each class.
Statistical Errors
First let's talk about true and false positives, and true and false negatives.
- A true positive is when the positive case is correctly predicted.
- A false positive is when the model incorrectly predicts the positive case.
- A true negative is when the model correctly predicts the negative case.
- A false negative is when the model incorrectly predicts the negative case.
Precision
-
Shows how accurate the model was at making positive predictions for each class.
-
It is the ratio of true positives to all positives, true and false.
-
precision = true positives / (true positives + false positives)
Recall
-
Shows how many of the positive cases the model got correct.
-
It is calculated by dividing the true positives by the true positives and false negatives added together.
-
recall = true positives / (true positives + false negatives)
F1-score
- The F1-score is another accuracy score that takes into account the precision and recall.
- "It is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0." - taken from wikipedia.
Confusion Matrix
metrics.confusion_matrix(labels,predicted)
[[487 34]
[ 36 468]]
Improving the model
Trying out different classifiers.
The SGDClassifier
is far from the only option for this type of project.
For text classification you could also try a naive bayes classifier.
from sklearn.naive_bayes import MultinomialNB
You can just pop this right in the pipeline from earlier in place of the SGDClassifier.
text_clf = Pipeline([
('vect', TfidfVectorizer(ngram_range=(1,3))),
('clf', MultinomialNB()),
])
There are other options as well if you explore the docs.
Conduct a Grid Search to find the best combination of parameters.
Earlier I mentioned conducting a Grid Search to tune the various parameters of the SGDClassifier
, as well as the TfidfVectorizer
.
from sklearn.model_selection import GridSearchCV
The GridSearchCV
instance will conduct an exhaustive search of all of the combinations of parameters and return the best one.
This can be a time and memory-consuming process.
Peruse the docs docs for the SGDClassifier, and the docs for the TfidfVectorizer to decide which parameters you want to look at.
parameters = {
'vect__ngram_range': [(1, 1), (1, 2), (1,3)],
'vect__use_idf': (True, False),
'clf__alpha': (1e-2, 1e-3),
'vect__min_df': (4,5,6),
'vect__max_df': (0.6,0.7,0.8)
}
The parameters I want to evaluate are for the TfidfVectorizer
, which is vect
in the pipeline, and the SGDClassifier
is clf
in the pipeline.
Initialize a GridSearchCV
instance that takes the Pipeline text_clf
and the parameters
.
gs_clf = GridSearchCV(text_clf, parameters,n_jobs=-1)
The parameter n_jobs
is the number of jobs to run in parallel. When set to -1 it will use all available processors on the machine.
Using the same training data processed_documents
and labels
that were used to train the model with earlier.
gridsearch_clf = gs_clf.fit(processed_documents,labels)
This can run for a long time if you are comparing a lot of parameters, which is why running the jobs in parallel can help speed things up.
Best Score
gridsearch_clf.best_score_
0.9556126342277382
The best score for the model from the grid search improved to 95.6%.
Best Parameters
The best parameters are in the gridsearch_clf.best_params
output.
for param_name in sorted(parameters.keys()):
print("{}: {}".format(param_name, gridsearch_clf.best_params_[param_name]))
These are parameter values that improved the overall accuracy score of the model.
clf__alpha: 0.001
vect__max_df: 0.7
vect__min_df: 4
vect__ngram_range: (1, 1)
vect__use_idf: True
Saving the model
Once you've trained a model and are happy with how it performs, you will want to save it for later use.
Saving
You can use pickle to save the model.
I will save the model text_clf
from earlier.
import pickle
with open('model_name.pickle','wb') as model_file:
pickle.dump(text_clf,model)
Loading the saved model
Later when you want to use it you can load it like this.
with open('model_name.pickle','rb') as read:
model = pickle.load(read)
And now you're ready to use the model!
Thanks for reading!
That's it for today. If you have any questions or comments or suggestions, please write them below or reach out to me on twitter @LVNGD.