10. Cross Entropy#

This chapter uses toy data to demonstrate how machine learners use cross entropy as a metric for evaluating model fit.

10.1. Preliminaries#

We need the following libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

10.2. Data Generation#

We’re using toy data for this chapter. Below, we randomly sample 100 values from three distributions with different means (loc) and standard deviations (scale). We’ll use these values to simulate bigram transitions in text data.

corpus = np.random.normal(loc = 5, scale = 1.0, size = 100)
A = np.random.normal(loc = 6.8, scale = 1.3, size = 100)
B = np.random.normal(loc = 4.5, scale = 0.3, size = 100)

Let’s plot these raw values. First: format into a DataFrame.

df = pd.DataFrame({"corpus": corpus, "A": A, "B": B})
vis = (
    df
    .stack()
    .to_frame("value")
    .reset_index(level = 1)
    .rename(columns = {"level_1": "distribution"})
)

Time to plot.

plt.figure(figsize = (6, 4))
g = sns.kdeplot(x = "value", hue = "distribution", data = vis)
g.set(
    title = "Raw values",
    xlabel = "Value",
    ylabel = "Est. frequency of samples w/ value"
)
plt.show()
../_images/b87b646e3c308f299e9a720cac5ae12b6317c73458d8f2bcc9c9721875c22cb3.png

10.3. Using Probabilities#

Now, we’ll convert these raw values into probabilities. The above values are continuous, which means they are decimal values. We will “bin” those values so that ones that are close together are counted together in our probability calculation.

def to_probability(samples, n_bins = 10):
    """Discretize samples from a distribution into bins and calculate their
    probabilities.

    Parameters
    ----------
    samples : np.ndarray
        Distribution samples
    n_bins : int
        Number of bins

    Returns
    -------
    probs : np.ndarray
        Bin probabilities
    """
    # Discretize the values into bins
    hist, edges = np.histogram(samples, bins = n_bins, density = True)

    # Calculate the width of each bin. Every sample that falls within this bin
    # is counted in the bin
    width = edges[1] - edges[0]

    # Calculate the probabilities
    probs = hist * width
    probs /= probs.sum()

    return probs
probs = df.apply(to_probability)

For the purposes of demonstration, we’ll label these probabilities as bigrams. In other words, they represent the likelihood of a bigram sequence. The first, corpus, represents probabilities from some corpus data. The second two, A and B, represent the probability of these bigrams according to two language models.

probs.index = [f"bigram-{i}" for i in range(len(probs))]
probs
corpus A B
bigram-0 0.01 0.08 0.02
bigram-1 0.04 0.09 0.02
bigram-2 0.06 0.14 0.05
bigram-3 0.14 0.21 0.06
bigram-4 0.21 0.16 0.12
bigram-5 0.20 0.16 0.19
bigram-6 0.14 0.09 0.16
bigram-7 0.14 0.04 0.16
bigram-8 0.05 0.01 0.15
bigram-9 0.01 0.02 0.07

Below, we plot the probabilities. Pay special attention to the shape of the curves. Does A or B seem to fit corpus better? If so, we could say that either A or B is a better model for corpus.

First, formatting.

vis = (
    probs
    .stack()
    .to_frame("prob")
    .reset_index(level = 1)
    .rename(columns = {"level_1": "model"})
)

And plot.

plt.figure(figsize = (6, 4))
g = sns.kdeplot(x = "prob", hue = "model", data = vis)
g.set(
    title = "Probability values",
    xlabel = "Value",
    ylabel = "Est. frequency of bigrams w/ value"
)
plt.show()
../_images/88884c815540d9bd1750947897cc2b65695beaebb561a711647212037bc7f54f.png

10.4. Cross Entropy#

But how can we know for sure whether one model is better than another? That’s where cross entropy comes in. It will tell us how well A or B fit corpus by providing a loss metric. When language models train, they try to minimize this metric by updating their weights during each step of the training process.

def calculate_cross_entropy(Pw, Qw):
    """Calculate the cross-entropy of distribution against another.

    Parameters
    ----------
    Pw : np.ndarray
        True values of the distribution
    Qw : np.ndarray
        Predicted distribution

    Returns
    -------
    cross_entropy : float
        The cross entropy
    """
    # Ensure we have no 0 values
    Qw = np.clip(Qw, 1e-10, 1.0)

    # Unlike in the n-gram modeling chapter, we use natural log in this example
    # because we are not using information elsewhere
    log_Qw = np.log(Qw)
    Sigma = np.sum(Pw * log_Qw)
    cross_entropy = -Sigma

    return cross_entropy

How well do A and B fit corpus?

A_ce = calculate_cross_entropy(probs["corpus"], probs["A"])
B_ce = calculate_cross_entropy(probs["corpus"], probs["B"])
print("A:", A_ce)
print("B:", B_ce)
A: 2.2665247071118846
B: 2.1811955804605505

Which is better, A or B?

choices = ["A", "B"]
idx = np.argmin([A_ce, B_ce])
print(choices[idx])
B