Vectorization

5. Vectorization#

This chapter introduces vectorization, a technique for encoding qualitative data (like words) into numeric values. We will use a data structure, the document-term matrix, to work with vectorized texts, discuss weighting strategies for managing high-frequency tokens, and train a classification model to distinguish style.

Data: 20 Henry James novels, collected by Jonathan Reeve and broken into chapters with Reeve’s chapterization tool. Labels are from David L. Hoover’s clustering of James’s novels
Credits: Portions of this chapter are adapted from the UC Davis DataLab’s Natural Language Processing for Data Science

5.1. Preliminaries#

We will need the following libraries:

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA
import seaborn as sns
import matplotlib.pyplot as plt

Corpus documents are stored in a DataFrame alongside other metadata.

corpus = pd.read_parquet("data/datasets/james_chapters.parquet")
print(corpus.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 563 entries, 0 to 562
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   novel      563 non-null    object
 1   year       563 non-null    int64 
 2   directory  563 non-null    object
 3   file       563 non-null    object
 4   chapter    563 non-null    int64 
 5   style      563 non-null    int64 
 6   hoover     563 non-null    int64 
 7   tokens     563 non-null    object
 8   masked     563 non-null    object
dtypes: int64(4), object(5)
memory usage: 39.7+ KB
None

Novels are divided into their component chapters. Use .groupby() to count how many chapters there are for each novel.

grouped = corpus.groupby("novel")
chapters_per_novel = grouped["chapter"].count()
chapters_per_novel.to_frame(name = "chapters")

	chapters
novel
Ambassadors	36
Awkward Age	32
Bostonians	42
Confidence	30
Golden Bowl	16
Ivory Tower	13
Outcry	20
Portrait Of A Lady	55
Princess Casamassima	49
Reverberator	15
Roderick Hudson	13
Sacred Found	14
Spoils Poynton	22
The American	30
The Europeans	12
Tragic Muse	51
Washington Square	35
Watch And Ward	11
What Maisie Knew	31
Wings Of The Dove	36

The style and hoover columns contain labels. The first demarcates early James from late with the publication of What Maisie Knew in 1897.

style_counts = grouped[["year", "style"]].value_counts()
style_counts = style_counts.to_frame(name = "chapters")
style_counts.sort_values("style")

			chapters
novel	year	style
Reverberator	1888	0	15
Watch And Ward	1871	0	11
Bostonians	1886	0	42
Confidence	1879	0	30
Washington Square	1880	0	35
Tragic Muse	1890	0	51
The Europeans	1878	0	12
Portrait Of A Lady	1881	0	55
Princess Casamassima	1886	0	49
The American	1877	0	30
Roderick Hudson	1875	0	13
Spoils Poynton	1897	1	22
Ambassadors	1903	1	36
What Maisie Knew	1897	1	31
Outcry	1911	1	20
Ivory Tower	1917	1	13
Golden Bowl	1904	1	16
Awkward Age	1899	1	32
Sacred Found	1901	1	14
Wings Of The Dove	1902	1	36

The second uses Hoover’s grouping of James’s novels into four distinct phases.

hoover_counts = grouped[["year", "hoover"]].value_counts()
hoover_counts = hoover_counts.to_frame(name = "chapters")
hoover_counts.sort_values("hoover")

			chapters
novel	year	hoover
Confidence	1879	0	30
Watch And Ward	1871	0	11
Washington Square	1880	0	35
Portrait Of A Lady	1881	0	55
Roderick Hudson	1875	0	13
The Europeans	1878	0	12
The American	1877	0	30
Reverberator	1888	1	15
Bostonians	1886	1	42
Tragic Muse	1890	1	51
Princess Casamassima	1886	1	49
Awkward Age	1899	2	32
What Maisie Knew	1897	2	31
Spoils Poynton	1897	2	22
Ambassadors	1903	3	36
Outcry	1911	3	20
Ivory Tower	1917	3	13
Golden Bowl	1904	3	16
Sacred Found	1901	3	14
Wings Of The Dove	1902	3	36

Tokens for each chapter are stored as strings in tokens and masked. Chapters have been tokenized with nltk.word_tokenize(). Why masked? That column has had its proper noun tokens masked out with PN. You will see why later on.

5.2. The Document-Term Matrix#

So far we have worked with lists of tokens. That works for some tasks, but to compare documents with one another, it would be better to represent our corpus as a two-dimensional array, or matrix. In this matrix, each row is a document and each column is a token; cells record the number of times that token appears in a document. The resultant matrix is known as the document-term matrix, or DTM.

It isn’t difficult to convert lists of tokens into a DTM, but scikit-learn can do it. Unless you have a reason to convert your token lists manually, just rely on that.

Many classes in scikit-learn have the same use pattern: first, you initialize the class by assigning it to a variable (and optionally set parameters), then you fit it on your data. The CountVectorizer, which makes a DTM, does just this. It will even tokenize strings while it fits, though watch out: it has a simple tokenization pattern, so it’s often best to do this step yourself.

Below, we initialize the CountVectorizer and set the following parameters:

token_pattern: a regex pattern for which tokens to keep (here: any alphabetic characters of three or more characters)
stop_words: remove function words in English
strip_accents: normalize accents to ASCII

cv_parameters = {
    "token_pattern": r"\b[a-zA-Z]{3,}\b",
    "stop_words": "english",
    "strip_accents": "ascii"
}

count_vectorizer = CountVectorizer(**cv_parameters)
count_vectorizer.fit(corpus["tokens"])

CountVectorizer(stop_words='english', strip_accents='ascii',
                token_pattern='\\b[a-zA-Z]{3,}\\b')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

With the vectorizer fitted, transform the data you fitted it on.

dtm = count_vectorizer.transform(corpus["tokens"])

DTMs are sparse. That is, they are mostly made up of zeros.

dtm

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 504302 stored elements and shape (563, 27822)>

This sparsity is significant. Comparing documents with each other requires taking into account all unique tokens in the corpus, not just those in a particular document. This means we must count the number of times a token appears in a document even if that count is zero. What those zero counts also mean is that the documents in a DTM are not strictly those documents that are in the corpus. They are potential texts: possible distributions of tokens across the corpus.

The output of CountVectorizer is optimized for keeping the memory footprint of a DTM low. But for a small corpus like this, use .toarray() to convert the matrix into a NumPy array.

dtm = dtm.toarray()

Now, wrap this as a DataFrame and set the column names with the output of the vectorizer’s .get_feature_names_out() method.

dtm = pd.DataFrame(
    dtm, columns = count_vectorizer.get_feature_names_out()
)
dtm.head()

	aback	abandon	abandoned	abandoning	abandonment	abandons	abase	abased	abasement	abash	...	zero	zest	zigzags	zola	zone	zones	zoo	zoological	zouaves	zurich
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 27822 columns

This DTM is indexed in the same order as the corpus documents. But for readability’s sake, set the index to the novel and chapter columns of our corpus DataFrame. Be sure to change the .names attribute of the index, or your index will conflict with possible column values in the DTM.

dtm.index = pd.MultiIndex.from_arrays(
    [corpus["novel"], corpus["chapter"]],
    names = ["novel_name", "chapter_num"]
)
dtm.head()

		aback	abandon	abandoned	abandoning	abandonment	abandons	abase	abased	abasement	abash	...	zero	zest	zigzags	zola	zone	zones	zoo	zoological	zouaves	zurich
novel_name	chapter_num
Watch And Ward	1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
	2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
	3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
	4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
	5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 27822 columns

5.2.1. Document-term matrix analysis#

Numeric operations across the DTM now work the same as they would for any other DataFrame. Here, total tokens per novel:

chapter_token_count = np.sum(dtm, axis = 1).to_frame(name = "token_count")
chapter_token_count.groupby("novel_name").sum()

	token_count
novel_name
Ambassadors	56551
Awkward Age	48957
Bostonians	63102
Confidence	29888
Golden Bowl	74098
Ivory Tower	23593
Outcry	21398
Portrait Of A Lady	86844
Princess Casamassima	79966
Reverberator	21043
Roderick Hudson	54094
Sacred Found	25920
Spoils Poynton	26696
The American	53494
The Europeans	24071
Tragic Muse	81287
Washington Square	24444
Watch And Ward	25583
What Maisie Knew	36369
Wings Of The Dove	66442

On average, which three chapters are the longest across all of James’s novels?

chapter_avg_tokens = chapter_token_count.groupby("chapter_num").mean()
chapter_avg_tokens.sort_values("token_count", ascending = False).head(3)

	token_count
chapter_num
5	2237.000000
46	2223.666667
14	2081.687500

What is the average chapter length?

chapter_avg_tokens.mean()

token_count    1610.572704
dtype: float64

Top ten chapters with the highest type counts:

num_types = (dtm > 0).sum(axis = 1).to_frame(name = "num_types")
num_types.nlargest(10, "num_types")

		num_types
novel_name	chapter_num
Golden Bowl	14	3991
	5	3990
	9	3565
	11	3129
Roderick Hudson	3	2468
	1	2242
	11	2242
Golden Bowl	7	2240
Roderick Hudson	10	2174
Golden Bowl	10	2085

Most frequent word in each novel:

token_freq = dtm.groupby("novel_name").sum()
token_freq.idxmax(axis = 1).to_frame(name = "most_frequent_token")

	most_frequent_token
novel_name
Ambassadors	strether
Awkward Age	mrs
Bostonians	verena
Confidence	bernard
Golden Bowl	maggie
Ivory Tower	gray
Outcry	lord
Portrait Of A Lady	isabel
Princess Casamassima	hyacinth
Reverberator	francie
Roderick Hudson	rowland
Sacred Found	mrs
Spoils Poynton	fleda
The American	newman
The Europeans	said
Tragic Muse	nick
Washington Square	catherine
Watch And Ward	roger
What Maisie Knew	mrs
Wings Of The Dove	kate

All are proper nouns, which often happens with fiction. In one way, this is valuable information: if you were modeling a corpus with different kinds of documents, you might use names’ frequency to distinguish fiction. But we only have James’s novels, and the high frequency of names can make it difficult to identify similarities across documents.

5.2.2. Using masked tokens#

This is where the text stored in masked comes in. That text masks over proper nouns and treats them all like the same token. More, due to the way we’ve currently configured our DTM generation, those masks will be dropped because they’re only two characters long. That’s perfectly fine for our purposes. But it again underscores the fact that documents in the DTM are not documents as they are in the corpus. Indeed, through these preprocessing decisions we have already constructed a model of our corpus.

Time to rebuild the DTM with text in masked.

count_vectorizer = CountVectorizer(**cv_parameters)
count_vectorizer.fit(corpus["masked"])
dtm = count_vectorizer.transform(corpus["masked"])

Convert to a DataFrame.

dtm = pd.DataFrame(
    dtm.toarray(), columns = count_vectorizer.get_feature_names_out()
)
dtm.index = pd.MultiIndex.from_arrays(
    [corpus["novel"], corpus["chapter"]],
    names = ["novel_name", "chapter_num"]
)

We won’t step through the above metrics again, except we will look at top token counts to confirm that masking made a difference.

token_freq = dtm.groupby("novel_name").sum()
token_freq.idxmax(axis = 1).to_frame(name = "most_frequent_token")

	most_frequent_token
novel_name
Ambassadors	little
Awkward Age	know
Bostonians	said
Confidence	said
Golden Bowl	little
Ivory Tower	said
Outcry	said
Portrait Of A Lady	said
Princess Casamassima	said
Reverberator	said
Roderick Hudson	said
Sacred Found	little
Spoils Poynton	said
The American	said
The Europeans	said
Tragic Muse	said
Washington Square	said
Watch And Ward	said
What Maisie Knew	little
Wings Of The Dove	said

Names are gone but the output looks even worse. There’s no differentiation among the most frequent tokens in each novel, even when controlling for common deictic words with stopword removal. Given the nature of Zipfian distributions, this shouldn’t be surprising.

One way to control for this would be to remove tokens from consideration when building the DTM using some cutoff metric. That would work okay but it may remove valuable information from the documents. Consider, for example, the fact that James’s penchant for extended psychological descriptions could be usefully counterposed with chapters with more dialogue. Removing “said” would make it difficult to do this. More, setting the cutoff point could take a fair bit of back and forth. A better strategy would be to re-weight token counts by some method so that frequent tokens have less impact in aggregate analyses like the above.

5.3. Weighting with TF–IDF#

This is where TF–IDF, or term frequency–inverse document frequency, comes in. It re-weights tokens according to their specificity in a document. Tokens that frequently appear in many documents will have low TF–IDF scores, while those that are less frequent, or appear frequently in only a few documents, will have high TF–IDF scores.

Scores are the product of two statistics: term frequency and inverse document frequency. There are several variations for calculating both but generally they work like so:

Term frequency is the relative frequency of a token \(t\) in a document \(d\).

\[ TF(t, d) = \frac{f_{t,d}}{\sum_{i=1}^n f_{i,d}} \]

Where:

\(f_{t,d}\) is the frequency of token \(t\) in document \(d\)
\(\sum_{i=1}^nf_{i,d}\) is the sum of all token frequencies in document \(d\)

In code, that looks like the following:

TF = dtm.div(dtm.sum(axis = 1), axis = 0)

Inverse document frequency measures how common or rare a token is.

\[ IDF(t, D) = log\left({\frac{N}{|\{d \in D : t \in d\}|}} \right) \]

Where:

\(N\) is the total number of documents in a corpus \(D\)
For each document \(d\) in \(D\), we count which ones contain token \(t\)

The code for this calculation is below. Note that we typically add one to the document frequency to avoid zero-division errors. Adding one outside the logarithm ensures that any terms that appear across all documents do not completely zero-out.

N = len(dtm)
DF = (dtm > 0).sum(axis = 0)
IDF = np.log(1 + N / (1 + DF)) + 1

The product of these two statistics is TF–IDF.

\[ TFIDF(t, d, D) = TF(t, d, D) \cdot IDF(t, D) \]

Or, in code:

TFIDF = TF.multiply(IDF, axis = 1)

Don’t want to go through all those steps? Use TfidfVectorizer. But note that scikit-learn has set some defaults for smoothing/normalizing TF–IDF scores that could make the result slightly different than your own calculations.

Fitting TfidfVectorizer works with the same use pattern.

tfidf_vectorizer = TfidfVectorizer(**cv_parameters)
tfidf_vectorizer.fit(corpus["masked"])
tfidf = tfidf_vectorizer.transform(corpus["masked"])

Convert to a DataFrame:

tfidf = pd.DataFrame(
    tfidf.toarray(), columns = tfidf_vectorizer.get_feature_names_out()
)
tfidf.index = pd.MultiIndex.from_arrays(
    [corpus["novel"], corpus["chapter"]],
    names = ["novel_name", "chapter_num"]
) 

And now, finally, the highest scoring tokens for every novel. Again, these are the most specific tokens.

max_tfidf_per_novel = tfidf.groupby("novel_name").max()
max_tfidf_per_novel.idxmax(axis = 1).to_frame(name = "top_token")

	top_token
novel_name
Ambassadors	little
Awkward Age	know
Bostonians	policeman
Confidence	vivian
Golden Bowl	bowl
Ivory Tower	uncle
Outcry	outcry
Portrait Of A Lady	dance
Princess Casamassima	fiddler
Reverberator	germain
Roderick Hudson	prince
Sacred Found	grow
Spoils Poynton	negotiation
The American	marquis
The Europeans	said
Tragic Muse	boat
Washington Square	father
Watch And Ward	cousin
What Maisie Knew	ladyship
Wings Of The Dove	marian

Top-scoring tokens for each chapter in What Maisie Knew. Use an empty slice to get all entries in the second of the two DataFrame indexes.

maisie = tfidf.loc[("What Maisie Knew", slice(None))]
maisie_max = pd.DataFrame({
    "token": maisie.idxmax(axis = 1),
    "tfidf": maisie.max(axis = 1)
})
maisie_max

	token	tfidf
chapter_num
1	nurse	0.186544
2	lies	0.204308
3	papa	0.285493
4	diadem	0.219282
5	brougham	0.173980
6	governess	0.239340
7	papa	0.196547
8	papa	0.238696
9	ladyship	0.364359
10	mamma	0.275737
11	ladyship	0.293165
12	ladyship	0.303566
13	child	0.153145
14	child	0.155951
15	papa	0.156633
16	mother	0.276220
17	squared	0.147945
18	papa	0.141638
19	father	0.161065
20	ladyship	0.194401
21	ladyship	0.209321
22	fishwife	0.134534
23	ladyship	0.143543
24	afraid	0.141128
25	pays	0.159513
26	sands	0.128149
27	come	0.149019
28	divorce	0.212882
29	salon	0.216161
30	waiter	0.223491
31	pupil	0.163950

5.4. Document Classification#

Each document in the weighted DTM is now a feature vector: a sequence of values that encode information about token distributions. These vectors allow us to estimate joint probabilities between features, which enables classification tasks.

5.4.1. The Multinomial Naive Bayes classifier#

We use a Multinomial Naive Bayes model to classify documents according to their assigned label, or class. The model trains by calculating the prior probability for each class. Then, it computes the posterior probability of each token given a class. The class with the highest posterior probability is selected as the label for a document.

Term	Definition
Naive Bayes	Assumes conditionally independent features
Multinomial distribution	Models probabilities of counts across categories
Prior probability	Probability of an event before observing new data
Posterior probability	Probability of an event after observing new data
Argmax (maximum likelihood estimation)	Predicts class with highest posterior probability

The formula for our classifier is as follows:

\[ P(C_k|x) \propto P(C_k) \prod_{i=1}^n P(x_i|C_k) \]

Where:

\(P(C_k)\): prior probability of class \(C_k\)
\(P(x_i|C_k)\): posterior probability of feature \(x_i\) given class \(C_k\)
\(P(C_k|x)\): probability of feature vector \(x\) being class \(C_k\)

5.4.2. Training a classifier#

No need to do this math ourselves; scikit-learn can do it. But first, we split our data and their corresponding labels into training and testing datasets. The model will train on the former, and we will validate that model on the latter (which is data it hasn’t yet seen).

X_train, X_test, y_train, y_test = train_test_split(
    tfidf, corpus["hoover"], test_size = 0.3, random_state = 357
)
print(f"Train set size: {len(X_train)}\nTest set size: {len(X_test)}")

Train set size: 394
Test set size: 169

Train the model using the same initialization/fitting pattern from before.

classifier = MultinomialNB(alpha = 0.005)
classifier.fit(X_train, y_train)

MultinomialNB(alpha=0.005)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

5.4.3. Model diagnostics#

Use the .score() method to return the mean accuracy for all labels given test data. This is the number of correct predictions divided by the total number of true labels.

accuracy = classifier.score(X_test, y_test)
print(f"Model accuracy: {accuracy:.4f}%")

Model accuracy: 0.9645%

Generate a classification report to get a class-by-class summary of the classifier’s performance. This requires you to make predictions on the test size, which you then compare against the true labels.

preds = classifier.predict(X_test)

Now make and print the report.

periods = ["1871-81", "1886-90", "1896-99", "1901-17"]
report = classification_report(y_test, preds, target_names = periods)
print(report)

              precision    recall  f1-score   support

     1871-81       1.00      1.00      1.00        58
     1886-90       0.98      1.00      0.99        51
     1896-99       1.00      0.81      0.89        26
     1901-17       0.87      0.97      0.92        34

    accuracy                           0.96       169
   macro avg       0.96      0.94      0.95       169
weighted avg       0.97      0.96      0.96       169

The above scores describe trade-offs between the following kinds of predictions:

Prediction	Explanation	Shorthand
True positive	Correctly predicts class of interest	TP
True negative	Correctly predicts all other classes	TN
False positive	Incorrectly predicts class of interest	FP
False negative	Incorrectly predicts all other classes	FN

Here is a breakdown of score types:

Score	Explanation	Formula
Precision	Accuracy of positive predictions	\(P = \frac{TP}{TP + FP}\)
Recall	Ability to find all relevant instances	\(R = \frac{TP}{TP + FN}\)
F1	A balance of precision and recall	\(F1 = 2\times \frac{P\times R}{P + R}\)

A weighted average of these scores offsets each class by its proportion in the testing set; the macro average reports scores with no weighting. The support for each class is the number of documents labeled with that class.

This model does extremely well. In fact, it may be a touch overfitted: too closely matched with its training data and therefore incapable of generalizing beyond that data. For our purposes this is less of a concern because the corpus and analysis are both constrained, but you might be suspicious of high-scoring results like this in other cases.

5.4.4. Top tokens per class#

Recall that the classifier makes its decisions based on the posterior probability of a feature vector. That means there are certain tokens in the corpus that are most likely to appear for each class. What are they?

First, extract the feature names, the class labels, and the log probabilities for each feature.

feature_names = tfidf_vectorizer.get_feature_names_out()
class_labels = classifier.classes_
log_probs = classifier.feature_log_prob_

Now iterate through every class and extract the top_n tokens (most probable tokens).

top_n = 25
for idx, label in enumerate(class_labels):
    # Get the probabilities and sort them. Sorting is in ascending order, so
    # flip the array
    probs = log_probs[idx]
    sorted_probs = np.argsort(probs)[::-1]

    # The above array contains the indexes that would sort the probabilities.
    # Take the `top_n` indices, then get the corresponding tokens by indexing
    # `feature_names`
    top_probs = sorted_probs[:top_n]
    top_tokens = feature_names[top_probs]

    # Print the results
    print(f"Top tokens for {label}:")
    print("\n".join(top_tokens), end = "\n\n")

Top tokens for 0:
said
little
don
know
say
think
good
young
like
great
man
come
time
moment
father
asked
looked
make
girl
went
shall
looking
eyes
vivian
old

Top tokens for 1:
said
little
know
like
don
young
come
say
man
didn
time
good
think
great
way
make
moment
old
girl
people
want
went
lady
thought
mother

Top tokens for 2:
know
said
little
don
say
mother
time
child
just
like
come
way
quite
mean
think
friend
make
really
things
moment
dear
good
looked
old
thing

Top tokens for 3:
said
know
little
time
don
quite
come
moment
just
really
way
mean
say
like
friend
question
make
man
things
fact
didn
thing
good
think
want

These are pretty general. Even with tf-idf, common tokens in fiction persist. But we can compute the difference between log probabilities for one class and the mean log probabilities of all other classes. That will give us more distinct tokens for each class.

for idx, label in enumerate(class_labels):
    # Remove the current classes's log probabilities, then calculate their mean
    other_classes = np.delete(log_probs, idx, axis = 0)
    mean_log_probs = np.mean(other_classes, axis = 0)

    # Find the difference between this mean and the current class's log
    # probabilities
    difference = log_probs[idx] - mean_log_probs

    # Sort as before
    sorted_probs = np.argsort(difference)[::-1]
    top_probs = sorted_probs[:top_n]
    top_tokens = feature_names[top_probs]

    # And print
    print(f"Distinctive tokens for {label}:")
    print("\n".join(top_tokens), end = "\n\n")

Distinctive tokens for 0:
vivian
osmond
marquis
honor
favor
recognized
humor
townsend
color
colored
parlor
catherine
monsieur
ardor
countess
evers
chateau
confectioner
neighbors
neighboring
isabel
misfortunes
dishonor
rowland
bruises

Distinctive tokens for 1:
burrage
agnes
bookbinder
univers
abbey
olive
fiddler
farrinder
proberts
embassy
dressmaker
millicent
poppa
tarrant
plebeian
canvases
electors
stile
verena
boulevard
mediocrity
intonations
heroes
comedian
daresay

Distinctive tokens for 2:
straighteners
farange
maltese
fleda
negotiation
diadem
stepfather
owen
rolls
avez
hug
banister
vanderbank
schoolroom
pelisse
pupil
curate
miser
maisie
wix
profitably
frightening
tishy
texts
overmore

Distinctive tokens for 3:
strether
crimble
milly
lowder
connections
chad
pococks
insistent
clearance
pounce
enhance
sagely
entresol
murmurous
psychologic
embroider
reaffirmed
intensified
clues
cessation
tortoise
densher
disclaimed
showily
hillside

5.5. Visualization#

Let’s visualize our documents in a scatterplot so we can inspect the corpus at scale.

5.5.1. Dimensionality reduction#

To do this, we’ll need to transform our TF–IDF vectors into simplified representations. Right now, these vectors are extremely high dimensional:

_, num_dimensions = tfidf.shape
print(f"Number of dimensions: {num_dimensions:,}")

Number of dimensions: 26,053

This number far exceeds the two or three dimensions of plots.

We use principal component analysis, or PCA, to reduce the dimensionality of our vectors so we can plot them. PCA identifies axes (principal components) that maximize variance in data and then projects that data onto the components. This reduces the number of features in the data but retains important information about each vector. Take a look at Margaret Fleck’s lecture notes if you’d like to see how this process works in detail.

pca = PCA(0.95, random_state = 357)
pca.fit(tfidf)

PCA(n_components=0.95, random_state=357)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The PCA reducer’s .explained_variance_ratio_ attribute contains the proportion of the total variance captured by each principal component. Their sum should equal the number we set above.

exp_variance = np.sum(pca.explained_variance_ratio_)
print(f"Explained variance: {exp_variance:.2f}%")

Explained variance: 0.95%

Slice out segments of these components to identify how much of the variance is explained by the \(k\)-th component. scikit-learn sorts components automatically, so the first ones always contain the most variance..

k = 25
exp_variance = np.sum(pca.explained_variance_ratio_[:k])
print(f"Explained variance of {k} components: {exp_variance:.2f}%")

Explained variance of 25 components: 0.15%

The first two components do not explain very much variance, but they will be enough for visualization.

k = 2
exp_variance = np.sum(pca.explained_variance_ratio_[:k])
print(f"Explained variance of {k} components: {exp_variance:.2f}%")

Explained variance of 2 components: 0.04%

5.5.2. Plotting documents#

To plot, transform the TF–IDF scores and format the reduced data as a DataFrame.

reduced = pca.transform(tfidf)
vis_data = pd.DataFrame(reduced[:, 0:2], columns = ["x", "y"])
vis_data["label_idx"] = corpus["hoover"].copy()
vis_data["label"] = vis_data["label_idx"].map(lambda x: periods[x])

Create a plot.

plt.figure(figsize = (10, 10))
g = sns.scatterplot(
    x = "x",
    y = "y",
    hue = "label",
    palette = "tab10",
    alpha = 0.8,
    data = vis_data,
)
g.set(title = "James chapters", xlabel = "Dim. 1", ylabel = "Dim. 2")
plt.show()

../_images/5918829d5b0612e0d825c3c8580345e55d77c893a6a28d4575178f36c079436d.png

There isn’t perfect separation here. Might some of the overlapping points be mis-classified documents? We run predictions across all documents and re-plot with those.

all_preds = classifier.predict(tfidf)

Where are labels incorrect?

vis_data["incorrect"] = np.where(
    all_preds == vis_data["label_idx"], False, True
)

Re-plot.

plt.figure(figsize = (10, 10))
g = sns.scatterplot(
    x = "x",
    y = "y",
    hue = "label",
    style = "incorrect",
    size = "incorrect",
    sizes = (300, 35),
    palette = "tab10",
    data = vis_data,
    legend = "full"
)
g.set(title = "James chapters", xlabel = "Dim. 1", ylabel = "Dim. 2")
plt.show()

../_images/65220a484bedbe740f66b2a2cf4c3bb12f9654524988f3abdfc5479da1e2a4f6.png

It does indeed seem to be the case that mis-classified documents sit right along the border of two classes. Though keep in mind that dimensionality reduction often results in visual distortions, so looking at data might sometimes be misleading.

Finally, which documents are these?

idx = vis_data[vis_data["incorrect"] == True].index
model_pred = all_preds[idx]
corpus.loc[idx, ["novel", "chapter", "hoover"]].assign(model_pred = model_pred)

	novel	chapter	hoover	model_pred
355	Spoils Poynton	13	2	3
384	What Maisie Knew	20	2	3
400	Awkward Age	5	2	3
422	Awkward Age	27	2	3
424	Awkward Age	29	2	3
562	Ivory Tower	13	3	1

	aback	abandon	abandoned	abandoning	abandonment	abandons	abase	abased	abasement	abash	...	zero	zest	zigzags	zola	zone	zones	zoo	zoological	zouaves	zurich
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	aback	abandon	abandoned	abandoning	abandonment	abandons	abase	abased	abasement	abash	...	zero	zest	zigzags	zola	zone	zones	zoo	zoological	zouaves	zurich
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

Vectorization

Contents

5. Vectorization#

5.1. Preliminaries#

5.2. The Document-Term Matrix#

5.2.1. Document-term matrix analysis#

5.2.2. Using masked tokens#

5.3. Weighting with TF–IDF#

5.4. Document Classification#

5.4.1. The Multinomial Naive Bayes classifier#

5.4.2. Training a classifier#

5.4.3. Model diagnostics#

5.4.4. Top tokens per class#

5.5. Visualization#

5.5.1. Dimensionality reduction#

5.5.2. Plotting documents#

	aback	abandon	abandoned	abandoning	abandonment	abandons	abase	abased	abasement	abash	...	zero	zest	zigzags	zola	zone	zones	zoo	zoological	zouaves	zurich
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0