Dec. 21, 2019

Classifying Fashion Articles with Python and Scikit-learn

Overview

Text classification is a common problem that we have been dealing with since the beginning of text itself. Putting things into groups is a way to try to make sense of the chaotic world around us.

We use text classification in a lot of ways.

Categorizing books - history, politics, cookbooks, fictions vs non-fiction, etc.
Spam filters - classifying an email as spam or not spam.
When Gmail labels your emails as important or not.
Sentiment analysis - is the tone of this Tweet positive or negative?

The list goes on.

Today I will be creating a text classifier with Python and Scikit-learn.

The Problem

I collected a bunch of fashion articles from various sources like blogs and magazines around the Internet.

I wanted articles about fashion trends, designers, influencers, what people are wearing etc.

I collected the types of articles I wanted, but in the mix there were also articles about other things like beauty and makeup, which were not what I wanted for this project. These articles were tagged similarly as articles about fashion trends(for example in the 'style' section you might find makeup articles) so it was hard to filter them during collection.

The Goal

The goal was to sort the articles into two categories: fashion articles and non-fashion articles.

So I started experimenting with building a classifier to filter out the non-fashion articles, and it worked out pretty well.

I can use the classifier when I collect new articles to automatically filter out the non-fashion related ones.

Setup

I'm just working in a regular Python virtualenv.

mkvirtualenv classenv

Using several packages.

pip install scikit-learn
pip install pandas
pip install NLTK

Installing scikit-learn also installs NumPy.

The Data

I scraped a bunch of fashion articles from several websites and blogs.

There are about 2600 fashion articles and about 2600 non-fashion articles.

Text Corpus

This collection of scraped articles makes up my text corpus.

I knew that I was going to be training the model using supervised learning, so I needed a labeled dataset.

For this project I manually went through and labeled the articles as fashion or non.

Training and Testing data

I split my data into two datasets for training and testing.

The split is 80% for training and 20% for testing.

My training data is currently in a Pandas dataframe, train_data, and my testing data is in another Pandas dataframe, test_data.

The example row below shows the url of the article, the date, the content - my text data - and the label of fashion or non that make up the dataset.

For this classification task I'm only concerned with the content and fashion columns in the dataframe.

url                     date              content           fashion
https://www.vogue...   2018-12-13  Temperley London. From...  fashion

I'm not going to go into how I split up the training and testing datasets for now. That might be a topic for another post.

Feature Extraction

A classification algorithm can't process the raw text data, so first the text needs to be converted into a numerical form.

How can we extract numerical features from the data?

Intuitively, it would seem that we can mostly figure out whether or not an article is fashion-related based on the kinds of words you find in the article.

If the article talks about blazers and midi-dresses and stilettos you can be pretty sure it has something to do with fashion.

So I will be extracting features based on the words in the articles.

TF-IDF

TF-IDF stands for term frequency - inverse document frequency.

What is TF-IDF?

It's a statistic that calculates how important a word is to a document within a corpus or group of documents.

A document just refers to a text data sample - in this case each article is a document.

The importance of a word increases proportionally to the number of times it appears in a document, the term frequency, but then is offset by how often it appears in the entire corpus, the inverse document frequency.

No need to calculate all of that ourselves.

I used the TF-IDF vectorizer from scikit-learn, which will generate a matrix of TF-IDF features.

Preparing the text data for the TF-IDF vectorizer

First I will take the 'content' column from my dataframe train_data, which has the article text data.

import pandas as pd
...

training_data = train_data.content.astype('str')

This gives me a list of text documents.

I'm also setting aside the corresponding labels from the train_data dataframe.

labels = train_data['fashion']

Normalizing the text

Before appying TF-IDF vectorization to the text data, I will normalize the text.

I'm going to loop through each document and apply the following techniques to normalize the text.

remove punctuation
remove extra whitespace
convert the text to lowercase
remove stop words

Stopwords are common words like 'a' or 'the' that are not important for this project's goal of classifying whether an article is fashion-related or not.

I used NLTK's list of stopwords.

There are many other things you could do to normalize the data, depending on your situation and goals.

I've written the function normalize_documents to normalize the list of documents.

from nltk.corpus import stopwords

def normalize_documents(documents_list):
    processed_documents = []
    for document in documents_list:
        punct_to_remove = '‘’”“—!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~'
        punctuation_table = str.maketrans('','',punct_to_remove)
        document = re.sub(' +',' ', document) #removes extra whitespace
        document = document.split() #splits document into word tokens
        stop_words = set(stopwords.words('english'))
        processed = [word.lower().translate(punctuation_table) for word in document if word not in stop_words]
        processed_documents.append(processed)
    return processed_documents

So now I can just pass in my list of training documents training_data.

processed_documents = normalize_documents(training_data)

Vectorize the data.

Now the list of processed documents is ready to be used to create the TF-IDF feature matrix.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,3))
tf_idf_vector = vectorizer.fit_transform(processed_documents)

First initialize the TfidfVectorizer class and then call its fit_transform method, passing in the processed documents.

The output is a sparse matrix of TF-IDF features, which will be passed to the classifier, along with the list of corresponding labels.

tf_idf_vector

<7945x79853 sparse matrix of type '<class 'numpy.float64'>'
    with 1944124 stored elements in Compressed Sparse Row format>

I did some experimenting and found that specifying the ngram_range parameter improved the model. I performed a Grid Search to do this, which I will talk about later.

Support Vector Machine(SVM)

Support vector machines are supervised learning models that are often used for classification.

I'm using the SGDClassifier from scikit-learn.

It's a linear classifier that uses stochastic gradient descent learning, and fits a linear SVM by default by using the hinge loss function.

Set up the SGDClassifier

from sklearn.linear_model import SGDClassifier

clf = SGDClassifier()

I experimented with tuning different parameters by doing a Grid Search but ended up sticking with the default parameters for the SGDClassifier.

A few of the default parameters.

hinge loss function - this implements linear SVM
'l2'regularization term - the standard for linear SVM models
alpha 0.0001
shuffle=True - shuffles the data after each training iteration

Train the classifier.

Pass the TF-IDF feature matrix, along with the labels, to the classifier's fit method.

clf.fit(tf_idf_vector,labels)

The SGDClassifier fits two arrays. The first is the training data samples that we converted into the tf_idf_vector feature matrix, and the second is the labels or target values, labels.

The labels are the 'correct answers' for the classifier to learn from.

Using a Pipeline

It's much easier to put all of this together with a pipeline and feed in the data once, rather than manually having to perform the steps of vectorization and fitting the model like we did above.

Scikit-learn has a Pipeline class to achieve this.

from sklearn.pipeline import Pipeline

All you have to do is create a Pipeline instance, and add the TF-IDF vectorizer and SGDClassifier configurations to it.

text_clf = Pipeline([
            ('vect', TfidfVectorizer()),
            ('clf', SGDClassifier()),
            ])

I'm using the same list of processed documents from before, along with the corresponding labels.

And pass these to the pipeline's fit method.

text_clf.fit(processed_documents,labels)

Great! The classifier has been trained and now we need to see how well it works.

Testing the classifier

Now it's time to use the testing data I mentioned earlier that is in the Pandas dataframe test_data to see how the model performs.

It's pretty much the same process to test the data as it was earlier training it.

First, the test data needs to be processed and vectorized in the same way as the training data.

As before, first I need take the 'content' column from test_data, which has the article text data. And also the labels.

testing_data = test_data.content.astype('str')
test_labels = test_data['fashion']

Normalize the documents in testing_data just like I did with the training data earlier.

processed_test_documents = normalize_documents(testing_data)

Using the pipeline text_clf from before, pass the processed documents to its predict method instead of fit.

predicted = text_clf.predict(processed_test_documents)

Which returns the predicted labels for each test data sample, in predicted.

How did the model do?

Let's look at some metrics.

Scikit-learn has a handy module, metrics, which has a variety of reports on prediction error.

from sklearn import metrics

Accuracy Score

How accurate are the predictions of this model?

metrics.accuracy_score(labels,predicted)

0.9317073170731708

The accuracy score is 93.2%, which is not bad.

Later we will look at some ways to improve this.

Classification Report

This report shows the quality of the predictions made by the model.

metrics.classification_report(labels, predicted)

              precision    recall  f1-score   support

     fashion       0.93      0.93      0.93       521
         non       0.93      0.93      0.93       504

   micro avg       0.93      0.93      0.93      1025
   macro avg       0.93      0.93      0.93      1025
weighted avg       0.93      0.93      0.93      1025

There are two classes: 'fashion' and 'non'.

The 'fashion' class refers to articles that are fashion articles, and 'non' refers to articles that are not fashion articles.

The precision, recall and f1-score show how well the model performed at correctly predicting each class.

Statistical Errors

First let's talk about true and false positives, and true and false negatives.

A true positive is when the positive case is correctly predicted.
A false positive is when the model incorrectly predicts the positive case.
A true negative is when the model correctly predicts the negative case.
A false negative is when the model incorrectly predicts the negative case.

Precision

Shows how accurate the model was at making positive predictions for each class.
It is the ratio of true positives to all positives, true and false.
precision = true positives / (true positives + false positives)

Recall

Shows how many of the positive cases the model got correct.
It is calculated by dividing the true positives by the true positives and false negatives added together.
recall = true positives / (true positives + false negatives)

F1-score

The F1-score is another accuracy score that takes into account the precision and recall.
"It is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0." - taken from wikipedia.

Confusion Matrix

metrics.confusion_matrix(labels,predicted)

[[487  34]
 [ 36 468]]

Improving the model

Trying out different classifiers.

The SGDClassifier is far from the only option for this type of project.

For text classification you could also try a naive bayes classifier.

from sklearn.naive_bayes import MultinomialNB

You can just pop this right in the pipeline from earlier in place of the SGDClassifier.

text_clf = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1,3))),
            ('clf', MultinomialNB()),
            ])

There are other options as well if you explore the docs.

Conduct a Grid Search to find the best combination of parameters.

Earlier I mentioned conducting a Grid Search to tune the various parameters of the SGDClassifier, as well as the TfidfVectorizer.

from sklearn.model_selection import GridSearchCV

The GridSearchCV instance will conduct an exhaustive search of all of the combinations of parameters and return the best one.

This can be a time and memory-consuming process.

Peruse the docs docs for the SGDClassifier, and the docs for the TfidfVectorizer to decide which parameters you want to look at.

parameters = {
            'vect__ngram_range': [(1, 1), (1, 2), (1,3)],
            'vect__use_idf': (True, False),
            'clf__alpha': (1e-2, 1e-3),
            'vect__min_df': (4,5,6),
            'vect__max_df': (0.6,0.7,0.8)
            }

The parameters I want to evaluate are for the TfidfVectorizer, which is vect in the pipeline, and the SGDClassifier is clf in the pipeline.

Initialize a GridSearchCV instance that takes the Pipeline text_clf and the parameters.

gs_clf = GridSearchCV(text_clf, parameters,n_jobs=-1)

The parameter n_jobs is the number of jobs to run in parallel. When set to -1 it will use all available processors on the machine.

Using the same training data processed_documents and labels that were used to train the model with earlier.

gridsearch_clf = gs_clf.fit(processed_documents,labels)

This can run for a long time if you are comparing a lot of parameters, which is why running the jobs in parallel can help speed things up.

Best Score

gridsearch_clf.best_score_

0.9556126342277382

The best score for the model from the grid search improved to 95.6%.

Best Parameters

The best parameters are in the gridsearch_clf.best_params output.

for param_name in sorted(parameters.keys()):
            print("{}: {}".format(param_name, gridsearch_clf.best_params_[param_name]))

These are parameter values that improved the overall accuracy score of the model.

clf__alpha: 0.001
vect__max_df: 0.7
vect__min_df: 4
vect__ngram_range: (1, 1)
vect__use_idf: True

Saving the model

Once you've trained a model and are happy with how it performs, you will want to save it for later use.

Saving

You can use pickle to save the model.

I will save the model text_clf from earlier.

import pickle
with open('model_name.pickle','wb') as model_file:
    pickle.dump(text_clf,model)

Loading the saved model

Later when you want to use it you can load it like this.

with open('model_name.pickle','rb') as read:
    model = pickle.load(read)

And now you're ready to use the model!

Thanks for reading!

That's it for today. If you have any questions or comments or suggestions, please write them below or reach out to me on twitter @LVNGD.

blog comments powered by Disqus

About Me

Hi, I'm Christina!

I'm a Python developer and data enthusiast, and mostly blog about things I've done or learned related to both of those.
I'm also available for consulting projects.
Reach out to me below.