3. Data Analysis in Python#

This chapter will show you the basics of analyzing data in Python. We will load text files into memory, align them with corresponding metadata, and produce information about their contents. Also covered: preparing text for numerical operations and graphing data.

3.1. Preliminaries#

The packages we’ll need today will help us load text files (pathlib), process them into discrete tokens (nltk), conduct data analysis about those tokens (numpy, pandas), and plot the results (seaborn, matplotlib).

from pathlib import Path
import re

import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

3.2. Loading a Corpus#

The obituaries are stored in individual plain text files at the location below. We wrap this file path in a Path object to make interacting with our computers’ file systems more streamlined.

datadir = Path("data/texts/nyt/obituaries")

Use a glob pattern to retrieve paths to .txt files. The output of the .glob() method is a generator, so convert it to a list.

paths = list(datadir.glob("*.txt"))
print(paths[:5])
[PosixPath('data/texts/nyt/obituaries/289.txt'), PosixPath('data/texts/nyt/obituaries/262.txt'), PosixPath('data/texts/nyt/obituaries/276.txt'), PosixPath('data/texts/nyt/obituaries/060.txt'), PosixPath('data/texts/nyt/obituaries/074.txt')]

As we did in the last chapter, we can load one of these files. Note the slight difference in syntax when using Path.

random_path = np.random.choice(paths)
with random_path.open("r") as fin:
    doc = fin.read()
    print(doc)
November 16, 1978

 OBITUARY

 Margaret Mead Is Dead of Cancer at 76

 By ALDEN WHITMAN

 Margaret Mead, the anthropologist, author, lecturer and social critic, died yesterday at New York Hospital after a yearlong battle with cancer. She was 76 years old.

 Dr. Mead, who was curator emeritus of the department of anthropology at the American Museum of Natural History, had known that she had cancer but remained active at her work until she entered the hospital on Oct.3, according to a museum spokesman.

 President Carter mourned her death, saying in a statement that she had "brought the humane insights of cultural anthropology to a public of millions." There were other tributes from Kurt Waldheim, Secretary-General of the United Nations; from

 Mayor Koch; the Smithsonian Institution; Edward J. Lehman, the executive director of the American Anthropological Association, and Faye Wattleton, president of the Planned Parenthood Federation. "I'm in the middle of several different things," Dr. Mead said offhandedly a few years ago and reeled off to an inquiring friend a dozen projects that she was pursuing simultaneously. She was not boasting. She was just stating a fact of her life that had been true since early childhood. The slight but sturdy Dr. Mead was possessed of virtually boundless energy, an unquenchable curiosity, a tenacious memory and a genius for organizing her time.

 She often gave the impression of being ubiquitous because she was rarely at rest in any one place for very long and because she could not permit a moment to pass unutilized. In all this she had a zest that even in her 70's confounded friends and colleagues of lesser verve.

 The American Museum of Natural History, with which she was associated for most of her professional life, once drew up a list of subjects in which she was "a specialist." The list read: "Education and culture; relationship between character structure and social forms; personality and culture; cultural aspects of problems of nutrition; mental health; family life; ecology; ekistics; transnational relations; national character; cultural change, and cultural building."

 The museum might well have added "et cetera," for Dr. Mead was not only an anthropologist and ethnologist of the first rank but also something of a national oracle on other subjects ranging from atomic politics to feminism. She took on (and dismissed with disdain) Dr. Edward Teller, the hydrogen bomb advocate, and she was once described as "a general among the foot soldiers of modern feminism." Insofar as anyone can be a polymath,

 Dr. Mead was widely regarded as one.

 Headed Science Association

 One evidence of her formidable powers was her election, at the age of 72, to the presidency of the American Association for the Advancement of Science. She was the second woman to head this group, one of the ranking organizations of the country's scientific community. Her stature as a scientist has been assured for many years, albeit somewhat grudgingly because she was a woman in a male-dominated discipline.

 For those who saw Dr. Mead in middle age and on, she was a robust, 5-foot-2-inch figure who carried a forked walking stick (she broke her ankle years ago). Her head was topped with fluffy, slightly curly hair cut in bangs, and her feet were shod in plain leather sandals. Her voice was melodious, and her face, with its rimless glasses, was pleasant and open.

 Although she could be lacerating, she was more often gentle and witty. She believed civilized mankind to be often ill-informed and pigheaded, yet she usually displayed great compassion for its individual members.

 From the publication of her first book, "Coming of Age in Samoa," in 1928, in which she described the values of adolescent lovemaking in Samoan society, Dr. Mead's name became associated with sexual theory. A good deal of her subsequent writing contended that sexual repression worked against healthy maturation of the young and against successful marriages. 'Eclectic Circuitry'

 Her anthropological studies also covered other topics and were generally highly regarded as making her an expert in the sociocultural life of primitive peoples. Some, though, were reserved about the seeming contradictory nature of her material. "She illustrates the principle of eclectic circuitry," one critic said.

 The number of Dr. Mead's scientific and popular lectures was staggering--110 in one sample 12-month period--and each was different. Her popular lectures, delivered usually to overflow crowds, were sometimes on rather esoteric subjects. "Acculturation

 Among the Iatmul Tribe of New Guinea" was one of them. ("For years I have been able to guarantee audiences a good address by using words that aren't in home dictionaries," she once said.)

 Sometimes she got her audiences mixed up. Once, for example, she spoke learnedly on sex deviations among the Tchambuli to a group of theologians. They took it in good part, as did a men's luncheon club whose members applauded her talk on cultural stability in the South Seas.

 Over the years, Dr. Mead lectured, sometimes for no fee, on such subjects as air pollution, hunger, mental hygiene, sex, women's careers, population control, primitive art, the family, nutrition, city planning, military service, tribal customs, alcoholism, child development, architecture, drugs and civil liberties. No matter what the topic, she did her homework. After one talk on tribal customs, a questioner asked about consumption of betel nuts in the Admiralty Islands. There was a ready and long response, as if betel nut problems were her life work.

 An Active Advocate

 Dr. Mead's fellow anthropologists were often uneasy about her. "You wonder what she'll take off on next," one said some years ago. "We know what Dr. Blank will say--he's probably already distributed his paper. But we're never sure about Margaret Mead."

 Not only was Dr. Mead unpredictable; sometimes she also did not abide by the rules of behavior that most scientists set for themselves. Anthropology, essentially the study of adaptation, should refrain from influencing the events it observes and interprets, most scientists believe. But Dr. Mead, according to her critics was not only a student of adaptation but also an active advocate of many specific changes in modern society.

 The critics, however, almost universally admired her as a person, however much they were distressed by her as a scholar-activist. They thought she was too scattershot and sometime self-contradictory. "But then," one critic said, "we do owe a lot to Margaret for putting us on the map."

 Some social scientists thought that Dr. Mead was lacking in introspection on the human relations of her field work in the South Seas. "The remarkable thing about Margaret is that she's always been interested in the psychological end of anthropology and is, in fact, one of the leading contributors to the field," a critic said. "But her first love and primary interest is the study of culture, and she never gets to the person in the full sense." 'Oh, Piffle'

 To this and other criticisms, Dr. Mead's usual reaction was, "Oh, piffle." It was said with noticeable spunk, tinged with disdain.

 Spunkiness was, indeed, among Margaret Mead's earliest traits. Born in Philadelphia on Dec. 16, 1901, she was the daughter of Edward and Emily Fogg Mead. Her father, who taught economics at the University of Pennsylvania, had hoped for a son and once told his daughter, "It's a pity you aren't a boy; you'd have gone far."

 She determined to go to college and did, to De Pauw University, from which she went to Barnard College to get her Bachelor of Arts degree in 1923.

 At Barnard, the young student met Franz Boas, a magnetic man who was one of the world's ranking anthropologists. He became her mentor, and she became one of his four graduate students at Columbia, where she took her M.A. in 1924 and her Ph.D. in 1929. "Franz Boas had to plan--much as if he were a general," Dr. Mead recalled, "with only a handful of troops to save a whole country." Dr. Boas thought she ought to work among American Indians, his area of interest, but she wanted to investigate Polynesia.

 Spunkiness and Guile

 Her spunkiness won out, assisted by a bit of guile. She suggested to Dr. Boas that he was trying to manipulate her and suggested to her father that her mentor was trying to control his daughter. Dr. Boas gave in, and her father gave her $1,000 for a world trip.

 By this time, Dr. Mead was married to Luther S. Cressman, a young seminarian who often joked unhumorously of having to make an appointment to see his wife. They parted temporarily when she went to Samoa in 1926.

 On shipboard, there was a love affair with Reo F. Fortune, a New Zealand anthropologist, to whom she was married after a brief reconciliation with Dr. Cressman. Meanwhile, she did the field work for and wrote "Coming of Age in Samoa." From the start, it was enormously popular, especially among young people, some of whom were influenced by it to become anthropologists.

 The scientific question underlying "Coming of Age in Samoa" was whether "the disturbances which vex our adolescents [are] due to the nature of adolescence itself or the civilization." Her findings suggest that the answer was the civilization.

 The easygoing ways in Samoa minimized conflict and the incidence of neurotic personalities due to guilt feelings.

 Two Daring Chapters

 The book was descriptive rather than statistical. It also included two chapters that daringly applied her findings to modern society, in which she proposed that straitlaced sex attitudes might be relaxed without "accepting promiscuity."

 The book has often been attacked in scientific circles as too subjective and lacking the data for verifiable behavior. However, her conclusions were based on detailed observation, and if she did not conduct anthropometric tests or produce statistical surveys she did convey her subjects graphically. A typical sentence read, "Her grandmother is very old; the muscles in her neck are stringy like uncooked pork."

 Dr. Mead settled down with the people she was studying. She ate their wild boar, wild pigeon and dried fish; helped to care for ill children, and gained the confidence of her informants. At one time she built a wall-less house so she could observe everything around her.

 She possessed a trait unusual in anthropologists of her time, an ability to shed her Western preconceptions. She would sit on the ground for hours without moving as she watched tribal peoples. "She knows how to use her eyes, how to see," said

 Ken Heyman, a fellow scientist. "She has an uncanny perception for different cultural styles."

 Books Showed Intuition

 This finely attuned intuition was evident in her books on the seven cultures she studied-- Samoan, Manua, Arapesh, Mundugumor, Tchambuli, Iatmul and Balinese. Out of these inquiries came, in addition to "Coming of Age in Samoa," "Growing

 Up in New Guinea," "Sex and Temperament in Three Primitive Societies," "Balinese Character" and "New Lives for Old." Some of her most extensive studies were done with the Manua--she visited them several times--and they spoke her name as "Makrit Mit."

 Dr. Mead's association with tribal peoples was the subject of a notable New Yorker cartoon that depicted a tribal chief handing out books to boys about to be initiated into adolescence. "Rather than go into the details," he was saying. "I'm simply going to present each of you with a copy of this excellent book by Margaret Mead."

 The idea behind the cartoon was not far fetched, because the Iatmul peoples once met her at their dock singing "My Darling Clementine" and then carried her off to their village.

 Generalizing from her investigations, Dr. Mead said that each culture had its own distinct psychological profile. "Each society," ranquil. Dr. Mead and her husband Dr. Fortune, met Gregory Bateson, a British anthropologist, in New Guinea. There was a personal crisis among the three as a result of which there was a divorce, and Dr. Mead and Dr. Bateson were married. They had a daughter, Catherine. They were divorced after about fifteen years. "The Bateson years were probably the richest of her life, " a friend of Dr. Mead said, noting that she and her husband were "perfect partners in mind and temperament. " Recalling the union in her memoir, "Blackberry Winter, "

 Dr. Mead was wistful about her marriage and its years in Bali, saying: "I think it is a good thing to have such a model once [as Mr. Bateson] ing the union in her memoir, "Blackberry Winter, " Dr. Mead was wistful about her marriage and its years in Bali, saying: "I think it is a good thing to have such a model once [as Mr. Bateson] even if the model includes the kind of extra intensity in which a lifetime is condensed into a few short years. "

 In another recollection, she seemed to fault herself, saying "American women are good mothers, but they make poor wives; Americans are very poor at being attentive to anybody else. "

 Nevertheless, in their Bali years the couple took and annotated 25,000 photographs. This work, which was done in 1936-38, had a large impact on other anthropologists.

 Turned Dry: An Anthropologist Looks at America, " issued in 1942. The book dealt with American character outlined against the background of the seven other cultures she had studied. It increased the demand for her lectures and gave her the chance to speak out on current issues.

 One of the issues that she tackled was male-female relationships, her thoughts on which she gathered into "Male and Female: A Study of Sexes in a Changing World, " published in 1949. "A vast. turbulent book, " Rebecca West said of it. Among its observations was, "Differences in sex as they are known today are based on the bringing up by the mother--she is always pushing the female toward similarity and the male toward difference. "

 In more recent years, Dr. Mead became an outspoken leader of the feminist movement. Indeed, she felt it her duty to improve people's understanding of themselves and especially women's understanding of themselves. She liked to talk, often with scorching humor, about what she saw as the follies of conventional ways of loving, working, birthing, housing and aging. This sense of mission appeared to many to account for Dr. Mead's restless zeal. "She wanted to be a mother to the world, " a friend said. Taught at Columbia

 In addition to her post at the American Museum of Natural History, she was also adjunct professor of anthropology at Columbia and taught tho the world, " a friend said. Taught at Columbia

 In addition to her post at the American Museum of Natural History, she was also adjunct professor of anthropology at Columbia and taught the subject at Fordham.

 In addition to her daughter, Mary Catherine Bateson Kassarjian, dean of social sciences at Raza Shah Civar University in Iran, Dr. Mead is survived by a granddaughter, Sevanne, and a sister, Elizabeth Mead Steig of Cambridge, Mass.

 Funeral services will be private and burial will be in Buckingham, Pa. A memorial service will be held at 2 P.M. tomorrow in St. Paul's Chapel, Columbia University.

 

It will make our lives easier to define a function that loads all files at once. That way we only have to call that function, rather than rewriting some loading code every time we want files. Here is what the load_corpus() function does below:

  1. Steps through each path in paths

  2. Opens the file and appends it to a list

def load_corpus(paths):
    """Load a corpus from paths.

    Parameters
    ----------
    paths : list[Path]
        A list of paths

    Returns
    -------
    corpus : list[str]
        The corpus
    """
    # Initialize an empty list to store the corpus
    corpus = []
    
    # March through each path, open the file, and load it
    for path in paths:
        with path.open("r") as fin:
            doc = fin.read()
            # Then add the file to the list
            corpus.append(doc)

    # Return the result: a list of strings, where each string is the contents
    # of a file
    return corpus

With this function defined, we load our files.

corpus = load_corpus(paths)
print("Size of the corpus:", len(corpus), "files.")
Size of the corpus: 379 files.

3.3. Working with Tabular Data#

Note however that the file names do not tell us the title of these poems:

print(random_path.name)
252.txt

Were we to run analyses on these documents, we would have no guide telling us which data is about which document. This is where metadata comes in. Often when working with text data, you will find information about the contents of a corpus stored separately from the data itself. Part of your workflow will require aligning corpus contents with this metadata.

3.3.1. Loading tabular data#

In our case, metadata is stored in a comma-separated (CSV) file, a plain text format for tabular data. Tabular data arranges information in columns and rows, just like a spreadsheet. The pandas package helps us work with this kind of data. When we load it into Python, we create a DataFrame. Just as with a spreadsheet, a DataFrame has columns and rows. But it also offers a huge amount of functionality for working with its contents.

Below, we load our metadata.

manifest = pd.read_csv("data/texts/nyt/metadata.csv")

Here is a high-level overview of the metadata. It shows the columns and their names, the number of observations in each column that contain values, and the datatype of these columns.

manifest.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379 entries, 0 to 378
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    379 non-null    object
 1   year    379 non-null    int64 
 2   file    379 non-null    object
dtypes: int64(1), object(2)
memory usage: 9.0+ KB

Just want to know the columns? Use the .columns attribute.

manifest.columns
Index(['name', 'year', 'file'], dtype='object')

Use the .head() method to look at the first few rows of the data.

manifest.head()
name year file
0 Ada Lovelace 1852 000.txt
1 Robert E Lee 1870 001.txt
2 Andrew Johnson 1875 002.txt
3 Bedford Forrest 1877 003.txt
4 Lucretia Mott 1880 004.txt

Note the file column. Values stored there correspond to the names of files in the data directory. If we use file as a guide to construct a list of paths, the order of the files in corpus will be the same as the order in the DataFrame.

3.3.2. Indexing by column#

Accessing values in file requires us to index the DataFrame. In pandas, we use bracket notation in conjunction with a column’s name to index that column.

manifest["file"]
0      000.txt
1      001.txt
2      002.txt
3      003.txt
4      004.txt
        ...   
374    374.txt
375    375.txt
376    376.txt
377    377.txt
378    378.txt
Name: file, Length: 379, dtype: object

There are several ways to index rows, which we discuss below. But the simplest involves treating an indexed column like a list.

manifest["file"][10]
'010.txt'

The above hints at what we do next: we use a list comprehension to iterate through each value in file and combine it with the data directory path. Note this time that we do not need to use a glob pattern, since we build the final path directly from the value in file using the / operator.

ordered_paths = [datadir / fname for fname in manifest["file"]]

Now, when we load our corpus, it will be aligned to the metadata’s order.

corpus = load_corpus(ordered_paths)

From here, we could go about our analysis. But that would involve jumping across two different objects, manifest and corpus. This is a pain, so we will create a new column in our metadata sheet and assign the contents of our corpus to it.

manifest["text"] = corpus.copy()

Under the hood, every column in a DataFrame is a Series. A DataFrame, in other words, is a collection of Series objects. The latter have much of the same functionality as the former, but DataFrames provide us with the ability to do more faceted indexing and global analyses.

For example, now that our corpus contents are stored in the text column, we can index that data alongside other information in the DataFrame. To do so, use a list of column names.

manifest[["name", "text"]].head()
name text
0 Ada Lovelace A gifted mathematician who is now recognized a...
1 Robert E Lee October 13, 1870\n\n OBITUARY\n\n Gen. Robert ...
2 Andrew Johnson August 1, 1875\n\n OBITUARY\n\n Andrew Johnson...
3 Bedford Forrest October 30, 1877\n\n OBITUARY\n\n Death of Gen...
4 Lucretia Mott November 12, 1880\n\n OBITUARY\n\n Lucretia Mo...

3.3.3. Indexing by row#

Indexing by rows is more complicated than indexing by columns. This is because a DataFrame index serves three important roles:

  1. As metadata that provides more context about a dataset

  2. As a method of data alignment

  3. As a convenience function for subsetting data

Use the .index attribute to access the values of a DataFrame index. These values can be numbers, strings, dates, or other values.

manifest.index
RangeIndex(start=0, stop=379, step=1)

Like tuples, indexes are immutable. But you can change the index of a DataFrame. Below, we set the index to title, using inplace = True so we do not need to reassign the DataFrame back to the same variable.

manifest.set_index("name", inplace = True)
manifest.head()
year file text
name
Ada Lovelace 1852 000.txt A gifted mathematician who is now recognized a...
Robert E Lee 1870 001.txt October 13, 1870\n\n OBITUARY\n\n Gen. Robert ...
Andrew Johnson 1875 002.txt August 1, 1875\n\n OBITUARY\n\n Andrew Johnson...
Bedford Forrest 1877 003.txt October 30, 1877\n\n OBITUARY\n\n Death of Gen...
Lucretia Mott 1880 004.txt November 12, 1880\n\n OBITUARY\n\n Lucretia Mo...

There are three ways to index by row:

  1. By integer position

  2. By label/name

  3. By a condition

Indexing by integer position works with the .iloc property.

manifest.iloc[45]
year                                                 1922
file                                              045.txt
text    August 3, 1922\n\n OBITUARY\n\n Dr. Bell, Inve...
Name: Alexander Graham Bell, dtype: object

Use a sequence of values to return multiple rows:

manifest.iloc[[2, 4, 6, 8, 10]]
year file text
name
Andrew Johnson 1875 002.txt August 1, 1875\n\n OBITUARY\n\n Andrew Johnson...
Lucretia Mott 1880 004.txt November 12, 1880\n\n OBITUARY\n\n Lucretia Mo...
Ulysses Grant 1885 006.txt July 24, 1885\n\n OBITUARY\n\n The Career of a...
Emma Lazarus 1887 008.txt November 20, 1887\n\n OBITUARY\n\n Emma Lazaru...
P T Barnum 1891 010.txt April 8, 1891\n\n OBITUARY\n\n The Great Showm...

Or send a slice. Here, the first five rows:

manifest.iloc[0:5]
year file text
name
Ada Lovelace 1852 000.txt A gifted mathematician who is now recognized a...
Robert E Lee 1870 001.txt October 13, 1870\n\n OBITUARY\n\n Gen. Robert ...
Andrew Johnson 1875 002.txt August 1, 1875\n\n OBITUARY\n\n Andrew Johnson...
Bedford Forrest 1877 003.txt October 30, 1877\n\n OBITUARY\n\n Death of Gen...
Lucretia Mott 1880 004.txt November 12, 1880\n\n OBITUARY\n\n Lucretia Mo...

Here, every tenth row:

manifest.iloc[::10]
year file text
name
Ada Lovelace 1852 000.txt A gifted mathematician who is now recognized a...
P T Barnum 1891 010.txt April 8, 1891\n\n OBITUARY\n\n The Great Showm...
James M N Whistler 1903 020.txt July 18, 1903\n\n OBITUARY\n\n James M'N. Whis...
Joseph Pulitzer 1911 030.txt Monday, October 30, 1911\n\n OBITUARY\n\n Jose...
C J Walker 1919 040.txt May 26, 1919\n\n OBITUARY\n\n Wealthiest Negre...
Marie Curie 1929 050.txt PARIS, July 4.--Mme. Marie Curie, whose work a...
Florenz Ziegfeld 1932 060.txt July 23, 1932\n\n OBITUARY\n\n Florenz Ziegfel...
John W Heisman 1936 070.txt October 4, 1936\n\n OBITUARY\n\n John W. Heism...
Howard Carter 1939 080.txt March 3, 1939\n\n OBITUARY\n\n Howard Carter, ...
Alfred E Smith 1944 090.txt October 4, 1944\n\n OBITUARY\n\n Alfred E. Smi...
Lord Keynes 1946 100.txt April 22, 1946\n\n OBITUARY\n\n Lord Keynes Di...
Babe Ruth 1948 110.txt August 17, 1948\n\n OBITUARY\n\n Babe Ruth, Ba...
Charles Spaulding 1952 120.txt August 2, 1952\n\n OBITUARY\n\n Ex-Slave's Son...
Henri Matisse 1954 130.txt November 4, 1954\n\n OBITUARY\n\n Art World Mo...
Charles Merrill 1956 140.txt October 7, 1956\n\n OBITUARY\n\n Charles Merri...
John Dulles 1959 150.txt May 25, 1959\n\n OBITUARY\n\n Dulles Formulate...
Carl G Jung 1961 160.txt June 7, 1961\n\n OBITUARY\n\n Dr. Carl G. Jung...
Sean O Casey 1964 170.txt September 19, 1964\n\n OBITUARY\n\n Sean O'Cas...
Albert Schweitzer 1965 180.txt September 6, 1965\n\n OBITUARY\n\n Albert Schw...
Langston Hughes 1967 190.txt May 23, 1967\n\n OBITUARY\n\n Langston Hughes,...
Madhubala 1969 200.txt A Bollywood legend whose tragic life mirrored ...
Coco Chanel 1971 210.txt January 11, 1971\n\n OBITUARY\n\n Chanel, the ...
Mahalia Jackson 1972 220.txt January 28, 1972\n\n OBITUARY\n\n Mahalia Jack...
Nancy Mitford 1973 230.txt July 1, 1973\n\n OBITUARY\n\n Nancy Mitford, A...
Chiang Kai shek 1975 240.txt April 6, 1975\n\n OBITUARY\n\n The Life of Chi...
Maria Callas 1977 250.txt September 17, 1977\n\n OBITUARY\n\n Maria Call...
Jesse Owens 1980 260.txt April 1, 1980\n\n OBITUARY\n\n Jesse Owens Die...
Arthur Rubinstein 1982 270.txt December 21, 1982\n\n OBITUARY\n\n Arthur Rubi...
Ansel Adams 1984 280.txt April 24, 1984\n\n OBITUARY\n\n Ansel Adams, P...
Georgia O Keeffe 1986 290.txt March 7, 1986\n\n OBITUARY\n\n Georgia O' Keef...
James Baldwin 1987 300.txt December 2, 1987\n\n OBITUARY\n\n James Baldwi...
Andrei A Gromyko 1989 310.txt July 4, 1989\n\n OBITUARY\n\n Andrei A. Gromyk...
Erte 1990 320.txt April 22, 1990\n\n OBITUARY\n\n Erte, a Master...
John Cage 1992 330.txt August 13, 1992\n\n OBITUARY\n\n John Cage, 79...
Dizzy Gillespie 1993 340.txt January 7, 1993\n\n OBITUARY\n\n Dizzy Gillesp...
Jacqueline Kennedy 1994 350.txt May 20, 1994\n\n OBITUARY\n\n Death of a First...
Deng Xiaoping 1997 360.txt February 20, 1997\n\n OBITUARY\n\n Deng Xiaopi...
Fred W Friendly 1998 370.txt March 5, 1998\n\n OBITUARY\n\n Fred W. Friendl...

Indexing by label works with .loc.

name = "John Dewey"
manifest.loc[name]
year                                                 1952
file                                              118.txt
text    June 2, 1952\n\n OBITUARY\n\n Dr. John Dewey D...
Name: John Dewey, dtype: object

Use it conjunction with a column name to access the value in a cell.

print(manifest.loc[name, "text"])
June 2, 1952

 OBITUARY

 Dr. John Dewey Dead at 92; Philosopher a Noted Liberal

 By THE NEW YORK TIMES

 Dr. John Dewey, the philosopher from whose teachings has grown the school of progressive education and "learning by doing," died of pneumonia in his home, 1158 Fifth Avenue, at 7 o'clock last night. He was 92 years old.

 His wife, the former Mrs. Roberta Lowit Grant, who was with him when he died, said he had been ill for twenty-six hours. He had broken a hip last November, and had been confined to the apartment, except for occasional trips to the roof for sunning.

 The widow said Dr. Dewey had been carrying on various projects at home to the last, and had outlined several works. She had no idea how near to possible publication any of them might be.

 Surviving also are two adopted children, Adrienne, 12, and John, 9. Five other children of his first marriage also survive--Frederick A. Dewey of New York, Mrs. Evelyn Smith of Kansas City, Mo.; Mrs. Lucy A. Brandaur of Syracuse, N.Y.; Miss Jane U. Dewey of Baltimore and Sabino L. Dewey of Huntington, L.I., the last also having been adopted.

 Mrs. Dewey said the funeral service would be held at the Community Church of New York, 40 East Thirty-fifth Street, on Wednesday at 1 P.M.

 As a philosopher--and he was acknowledged by many as America's foremost philosopher of his time--Dr. Dewey was not content to bring forth theories; he came forward to emphasize his ideas of liberalism, and, with the courage of a crusader, was willing to lend his name and reputation to causes that were frowned upon by staid society.

 He was too big a man to be sneered at as an "armchair Bolshevist." His convictions were those of an essentially honest man, and although he might well have sat back to criticize the general order of things, he took an active part in the attempt to create a third political party, to lend his voice and influence to help the down-trodden, to do away with oppression in this country and elsewhere, and to strive for a finer universal education.

 In his quest for betterment he met--and was prepared to meet--not only opposition but defeat. Some of his plans were quixotic and much too good for this world, but he never wavered in a cause that he considered just and he commanded the respect of all who opposed him.

 As the champion of an ideal and liberal democracy, Dr. Dewey saw the good as well as the bad in countries where the masses were groping for new social systems. He visited Russia, China and Turkey; saw for himself, and maintained his views in the face of public opinion in this country. He condemned hasty judgment of the affairs of other peoples and pointed to the flaws at home in no uncertain terms.

 Dr. Dewey had become attached to liberalism in his student days at the University of Vermont and at Johns Hopkins, where he came under the influence of Coleridge, Emerson and T. H. Green, but what finally emancipated him from the cumbersome and academic systems of transcendentalism was his discovery in 1891 of William James' "Psychology." In this work, according to Prof. Herbert W. Schneider of Columbia, he not only found the "instrumental theory of concepts" on which Dewey's logic was based, but also experienced that contagious mental "loosening up" with which James influenced his generation and which made him the father of American philosophy.

 Noted for Educational Reform

 Dr. Dewey's principal achievement was perhaps his educational reform. He was the chief prophet of progressive education. After twenty years that movement--"learning by doing"--had become a major factor in American education in the late

 Thirties, and in 1941 the New York State Department of Education approved a six-year experiment in schools embodying the Dewey philosophy.

 But progressive education was long the center of controversy among educators, and in the early Forties criticism was becoming more outspoken. The revolt against Dewey and pragmatism in education was strongest in Chicago, the scene of his first and greatest triumphs. At the University of Chicago, where Dr. Dewey was head of the Department of Philosophy and for two years director of the School of Education, President Robert Hutchins has sponsored a system of "education for freedom" which seeks to separate the teaching of the "intellectual" from the "practical" arts. Both Dr. Hutchins and Dr. Nicholas Murray Butler, long president of Columbia University, sharply attacked progressive education in 1944.

 In a birthday interview that year Dr. Dewey dismissed as "a childish point of view" the criticism by Dr. Butler, in an address at the opening of the university, that progressive education, "a most reactionary philosophy," has led to undisciplined youth.

 And replying to Dr. Hutchins' attacks, he said: "President Hutchins calls for liberal education of a small, elite group and vocational education for the masses. I cannot think of any idea more completely reactionary and more fatal to the whole democratic outlook."

 While Professor of Philosophy at the University of Michigan in 1893 Dewey wrote: "If I were asked to name the most needed of all reforms in the spirit of education I should say: 'Cease conceiving of education as mere preparation for later life, and make of it the full meaning of the present life.' And to add that only in this case does it become truly a preparation for later life is not the paradox it seems. An activity which does not have worth enough to be carried on for its own sake cannot be very effective as a preparation for something else if the new spirit in education forms the habit of requiring that every act be an outlet of the whole self, and it provides the instruments of such complete functioning."

 Later in life Professor Dewey devoted much time and thought to reform of government. He declared that the "control of government must be redeemed from the special interests which have usurped it and restored to the people." Unless this were done, he warned, political democracy would be doomed.

 Championed New Thought

 He referred to the major political parties as "the errand boys of big business," and he championed new thought, actively through his connections with the People's Lobby, of which he was the president, and more indirectly by his writings.

 During 1946 Dr. Dewey participated with labor leaders in conferences at Chicago and Detroit designed to lay the groundwork for a third, or People's, party for 1948. At the Detroit conference, a National Educational Committee was formed. Leaders at the conferences were from the Congress of Industrial Organizations, the American Federation of Labor and the farmers' unions.

 John Dewey was born at Burlington, Vt., on Oct. 20, 1859, son of Archibald S. Dewey and Lucina A. Rich Dewey. His father was a merchant who traced his ancestry to 1640. His mother was the daughter of a prosperous Vermont farmer of Cape Cod ancestry.

 He studied in common schools and later attended the University of Vermont, being graduated in 1879. He then taught school at Oil City, Pa., and subsequently in general country schools in Vermont. One year he spent studying philosophy with Prof. H. A.

 P. Torrey of the University of Vermont.

 After this he went to Johns Hopkins, where he studied philosophy and psychology under Prof. G. S. Morris, and in 1884 he received his Ph.D. degree.

 That year he was appointed instructor and assistant Professor of Philosophy at the University of Michigan and remained as such until 1888, when he went to the University of Minnesota as Professor of Philosophy. After a year he returned to the University of Michigan where he remained for five years.

 From 1894 to 1904 Professor Dewey was head of the Department of Philosophy at the University of Chicago and for two years he was director of the School of Education of the same institution. In 1904 he was appointed Professor of Philosophy at Columbia

 University. Besides his regular work there Dr. Dewey taught at Teachers College.

 He retired with the title of Professor Emeritus on July 1, 1930.

 In 1886 his first work, "Psychology," was published.

 This was followed by "Liebnitz," "Critical Theory of Ethics," "Study of Ethics," "School and Society," "Studies in Logical Theory," "How to Think," "Influence of Darwin on Philosophy and

 Other Essays," "German Philosophy and Politics," "Democracy and Education," "Reconstruction in Philosophy," "Human Nature and Conduct," "Experience and Nature," "The Public and Its Problems," "The Quest for Certainty" and "Individualism, Old and New."

 Others were "Philosophy and Civilization," "Art as Experience," "Liberalism and Social Action," "Logic: The Theory of Inquiry" and "Culture and Freedom."

 In reviewing Dr. Dewey's "Problems of Men," published in June, 1946, Dr. Alvin Johnson, president emeritus of the New School for Social Research, said Dr. Dewey struck "straight at reactionary philosophers." In replying to his philosophical and educational critics, Dr. Johnson said that Dewey concluded: "Philosophy counts for next to nothing in the present world-wide crisis of human affairs and should count for less.

 It needs a thorough house-cleaning and the final, definitive abandonment of most of its traditional values. Those values are class values. They were established in a time when the masses of mankind lived in slavery, or near-slavery, and when a little body of the elect could occupy themselves with speculations on the divine and the absolute. The present world belongs to a democracy. And the democracy cannot waste time on recondite speculations that have nothing to do with life."

 Professor Dewey had a small but enthusiastic following of Socialists, moderate radicals and thinkers. There was a so-called "Dewey group," which was popularly known as a gathering of liberal-minded men and women. Among those who became his disciples, or rather associates in thought, were such men as Walter Lippmann, Charles A. Beard, Sinclair Lewis, Morris Hillquit, Oswald Garison Villard and Norman Thomas. Their influence was felt more especially at times when political graft and abuse became over-oppressive. Others who associated themselves with Dr. Dewey were Rabbi Stephen S. Wise and John Haynes Holmes. Dr. Dewey was a supporter of the

 Civil Liberties Union and was chairman of the League for Independent Political Action. One of the articles of Professor Dewey's political creed was "vote for the man rather than for the party." His faith was embodied in articles that he wrote for The New Republic, Philosophical Review, Journal of Philosophy, Monist and International Journal of Ethics.

 Lectured in Tokyo

 In 1919 Dr. Dewey delivered a series of lectures at the Imperial University at Tokyo. These were later published as "Reconstruction in Philosophy." That same year he was invited by former Chinese students in this country to lecture in China on the subjects of education and philosophy. He stayed in China for two years, making his headquarters at Peiping, but he traveled through all the provinces from Mukden to Canton.

 Dr. Dewey went to Turkey in 1924 to make a report on the new republican government schools. In 1926 he was in Mexico, where he lectured at the Summer School of the University of Mexico. In 1928 he was one of the delegates of American educators who visited children's institutions at Leningrad and Moscow at the invitation of the Soviet Government. Dr. Dewey was highly impressed with the educational experiments in new Russia, and voiced enthusiasm upon his return to this country. He was somewhat criticized for his views, however, and there were some who maintained that he was naive, though his sincerity was never questioned.

 In 1937 he experienced one of the stormiest episodes in his life. He went to Mexico as the head of a commission to investigate the validity of charges made by the Soviet Government against Leon Trotsky, who was living in Mexico. Trotsky had been sentenced to death by a Russian court for plotting to overthrow the Moscow Government, but Dr. Dewey insisted that Trotsky never had a chance to defend himself. "Now," he said, "it is up to him to present his case. I am neither a Trotskyite nor a Stalinist. I don't accept the Moscow evidence as conclusive till I hear the other side."

 The commission announced that it found Trotsky was innocent of the terrorism and fascist conspiracy with which he had been charged.

 An avowed anti-Communist, Dr. Dewey had his views as to the ideal balance between the State and the individual. What was needed, he explained, was an authority capable of directing and utilizing changes for a kind of individual freedom unlike that which the unconstrained economic liberty had produced and justified.

 Dr. Dewey believed that if democracy were to survive in this country it would require it would require a tremendous reorganization of instruction and administration in the schools. Democracy, he maintained, "cannot go forward unless the intelligence of the mass of people is educated to understand the social realities of their own time."

 Professor Dewey constantly urged the cultivation of independent thinking, and he deplored what he termed the "empty imitation" in this country of thought in Europe. He was often heard in public debate on matters of social or political significance, and he cheerfully agreed to act as chairman whenever he thought something of consequence might result from such disputations.

 Predicted War by Hitler

 As early as 1933 Dr. Dewey voiced his fear of what the future might bring if Hitler remained unchecked in Germany. Just before he sailed for Europe that year he predicted that Hitler would be headed for war as soon as he felt strong enough. A year later he asserted that Hitler and Hitlerism were "by all odds the greatest threat to world peace today." In 1936 he was one of a group of eighteen philosophers who refused to participate in the celebration of a philosophical institution in Berlin. At that time he also was wary of Japan, warning that a secret agreement existed between Japan and Germany.

 He called for the United States to take action against Japan, urging that a boycott be put into effect till such time as the Nipponese forces left China. The Chinese Government conferred on him the Order of the Jade for the contributions he had made to the education and leadership of China.

 He was honorary life president of the National Education Association, a member of the National Academy of Sciences, the American Psychological Association (president, 1899-1900), American Philosophical Association (president, 1905-06), and corresponding member of the Institut de France.

 In 1938 Dr. Dewey was voted by the Aristogenic Society as one of the ten "greatest Americans." This honor included the recording of every phase of his life for future generations.

 Active In Teachers Guild

 Professor Dewey took a sharp interest in the internal affairs of this city and the nation. In 1936 he was a leader in the movement to obtain a new city charter. He was active in organizations such as the New York Teachers Guild, the League for Industrial

 Democracy, the International League for Academic Freedom and the Committee for Cultural Freedom. He also assisted in the founding of a University-in-Exile for famous scholars driven out of their native countries.

 In 1944 he aided in the organization of a Council for a Democratic Germany, and was on and Educators-for-Roosevelt Committee formed to promote the re-election of Franklin D. Roosevelt. After the war he joined with other leaders in petitions to President

 Truman for the release of conscientious objectors still being held.

 Dr. Dewey was extremely courteous and mild of manner. He was a scholar who ventured to descend into the maelstrom of political strife and who took his blows, unfair as they were in many cases, with a smile and a shrug of the shoulder. But he was persistent and his enthusiasm was infectious. "I see no hope," he said once, "for sanity and reality in American life except through the agency of a new party."

 He sowed the seeds, but he never saw its fruits. When a bust of John Dewey, modeled by Jacob Epstein, was unveiled at Columbia in 1928, Dr. William Heard Kilpatrick, Professor of Education, said: "Dr. Dewey is America's greatest living philosopher and must be included among the greatest thinkers of all times. He has in the minds of many changed almost our whole conception of what philosophy is, delivering us from the old puzzles that have formed the stock in trade of the traditional philosophy. He is chiefly responsible for our thinking of intelligence as primarily instrumental. His philosophy has common sense, acceptability and a social bearing which distinguishes it in degree from all other philosophies."

 But perhaps the best description of all can be found in an editorial written to mark his eightieth birthday: "there are countless school children today and yesterday whose lives have been influenced in a constructive way by this one man who never shouted, and whose formally stated philosophy often is a stiff dose for more subtle minds. One thinks of him as refining into gold the rough ore of our tumultuous pioneer experience. He is yankeeism at its best--shrewd, wise, humane."

 Dr. Dewey retired from teaching at Columbia in 1930, when he was 70, but he went on writing and lecturing, publishing more than 300 books, essays and articles. By the time he was 90, his published works must have totaled 1,000.

 Opposed Loyalty Oaths

 In recent years, he had lived in a large apartment overlooking Central Park at Fifth Avenue and Ninety-seventh Street. He spent his winters in Florida. He never lost interest in public affairs, often speaking and writing on questions of the day. He opposed teachers' loyalty oaths, but came to believe that known Communists should not be permitted to teach children. He defended the United States action in Korea. For these and other anti-Soviet views, he won the criticism of Pravda.

 To the end he lent his name to the causes for which he believed, even when he could not be present in person. This year, he was a member of the Congress for Cultural Freedom, which, among other activities, sponsored a festival of Western music, art and literature in Paris dedicated to victims of Nazi, Soviet and Franco Spanish tyrannies. Last month he was elected an honorary vice chairman of the Liberal party of New York State.

 Dr. Dewey's ninetieth birthday was celebrated in many universities and by cultural societies in the United States and abroad. He was honored, mostly in absentia, by testimonial dinners and meetings for weeks in the fall of 1949. He did attend one large testimonial dinner at the Commodore Hotel, when admirers presented to him $90,000 to be used on worthy educational projects of his choosing. Among notables who sent felicitations were President

 Truman and Prime Minister Attlee of England.

 Honored by Yale in 1951

 Professor Dewey maintained surprisingly good health for such an old man. Even a serious operation in 1951 did not incapacitate him, and he recovered sufficiently to accept in person the honorary degree of Doctor of Literature at Yale's June commencement.

 At the age of 92, Dr. Dewey looked twenty years younger. His bushy hair and mustache were white, but they had been so for decades. His eyes were still keen, his mind alert, and his physical strength sufficient for him to take walks and typewrite his own scripts and letters.

 Dr. Dewey married twice. In 1886, Alice Chipman, one of his students at the University of Michigan, became his bride. There were born to them six children, two of whom died in childhood. A seventh child was adopted. Mrs. Dewey died in 1927.

 When he was 87, the philosopher married Mrs. Grant, a widow who lived in San Francisco, on December 11, 1946. Not quite half his age, Mrs. Grant came from an Oil City, Pa., glass manufacturing family which had been friends with Dr. Dewey before she was born.

 She had been a director of educational travel for the Cunard Steamship Company. Dr. Dewey had arranged years before to turn over the bulk of his assets to his first wife and their children, and after his second marriage the Deweys lived largely on his wife's inheritance and some later royalties.

 

Or send it a sequence of labels:

names = ["John Dewey", "Lucille Ball"]
manifest.loc[names]
year file text
name
John Dewey 1952 118.txt June 2, 1952\n\n OBITUARY\n\n Dr. John Dewey D...
Lucille Ball 1989 306.txt April 27, 1989\n\n OBITUARY\n\n Lucille Ball, ...

Finally, there is indexing by condition. This works by evaluating a condition and returning a Series of Boolean values. It is the most powerful method of indexing in Pandas by far.

Below, we find all poems with names that start with “S”. Use the .str attribute of an index of strings to accomplish this.

manifest.index.str.startswith("S")
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False,  True, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False,  True, False, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

See the Boolean values? Let’s assign the output of the above to a mask variable, with which we will index the DataFrame.

mask = manifest.index.str.startswith("S")
manifest.loc[mask]
year file text
name
Stephen Crane 1900 014.txt June 6, 1900\n\n OBITUARY\n\n Stephen Crane De...
Susan B Anthony 1906 022.txt March 13, 1906\n\n OBITUARY\n\n Miss Susan B. ...
Sarah Orne Jewett 1909 025.txt June 25, 1909\n\n OBITUARY\n\n Sarah Orne Jewe...
Scott Fitzgerald 1940 081.txt December 23, 1940\n\n OBITUARY\n\n Scott Fitzg...
Sergei Eisenstein 1948 108.txt February 12, 1948\n\n OBITUARY\n\n Sergei Eise...
Sam Rayburn 1961 159.txt November 17, 1961\n\n OBITUARY\n\n Rayburn Is ...
Sylvia Plath 1963 164.txt A postwar poet unafraid to confront her own de...
Sean O Casey 1964 170.txt September 19, 1964\n\n OBITUARY\n\n Sean O'Cas...
Shirley Jackson 1965 181.txt August 10, 1965\n\n OBITUARY\n\n Shirley Jacks...
Sonja Henie 1969 204.txt October 13, 1969\n\n OBITUARY\n\n Sonja Henie,...
Sylvia Plath 1974 232.txt January 13, 1974\n\n REVIEW\n\n Her Poetry, No...
Stan Kenton 1979 258.txt August 27, 1979\n\n OBITUARY\n\n Stan Kenton, ...
Satchel Paige 1982 273.txt June 9, 1982\n\n OBITUARY\n\n Satchel Paige, B...
Samuel Beckett 1989 316.txt December 27, 1989\n\n OBITUARY\n\n Samuel Beck...
Sammy Davis Jr 1990 318.txt May 17, 1990\n\n OBITUARY\n\n Sammy Davis Jr. ...
Shirley Booth 1992 332.txt October 21, 1992\n\n OBITUARY\n\n Shirley Boot...

What makes indexing by condition so powerful is that it generalizes to other data in the DataFrame, not just the index. Let’s reset the index and explore this a little.

manifest.reset_index(inplace = True)
manifest.head()
name year file text
0 Ada Lovelace 1852 000.txt A gifted mathematician who is now recognized a...
1 Robert E Lee 1870 001.txt October 13, 1870\n\n OBITUARY\n\n Gen. Robert ...
2 Andrew Johnson 1875 002.txt August 1, 1875\n\n OBITUARY\n\n Andrew Johnson...
3 Bedford Forrest 1877 003.txt October 30, 1877\n\n OBITUARY\n\n Death of Gen...
4 Lucretia Mott 1880 004.txt November 12, 1880\n\n OBITUARY\n\n Lucretia Mo...

Below, we find obituaries that contain the string “musician” (note that this will return the plural as well).

manifest.loc[manifest["text"].str.contains("musician"), "name"]
18         Benjamin Harrison
21     Emily Warren Roebling
51                   Balfour
59         John Philip Sousa
96               Jerome Kern
98               Bela Bartok
104      Fiorello La Guardia
110                Babe Ruth
144                W C Handy
145           Billie Holiday
151          Boris Pasternak
180        Albert Schweitzer
199          Coleman Hawkins
216          Louis Armstrong
218          Igor Stravinsky
220          Mahalia Jackson
228            Pablo Picasso
231              Earl Warren
250             Maria Callas
255           Arthur Fiedler
258              Stan Kenton
259          Richard Rodgers
270        Arthur Rubinstein
271          Thelonious Monk
276             Muddy Waters
280              Ansel Adams
285              Count Basie
291            Benny Goodman
302           Andres Segovie
312        Vladimir Horowitz
319        Leonard Bernstein
323              Frank Capra
325              Miles Davis
326            Martha Graham
330                John Cage
340          Dizzy Gillespie
350       Jacqueline Kennedy
371            Frank Sinatra
375           Pierre Trudeau
Name: name, dtype: object

Here, something more complicated: documents that contain the string “paintings” with file names above 09.txt.

mask = (manifest["text"].str.contains("Bird")) & \
    (manifest["file"].str.startswith("00") == False)
manifest.loc[mask, "name"]
175    David O Selznick
181     Shirley Jackson
224      Lyndon Johnson
261    Alfred Hitchcock
275          Earl Hines
281        Ethel Merman
340     Dizzy Gillespie
348       Jessica Tandy
377    Charles M Schulz
Name: name, dtype: object

3.4. Preparing for Data Analysis#

With the basics of indexing done, we will prepare to analyze the corpus. This will involve two steps. First, we preprocess the raw text in the text column of our DataFrame. Then, we define some plotting functions to graph the results of our analysis.

3.4.1. Preprocessing#

As we saw in the last chapter, operations like counting require texts to be processed in special ways. This includes changing the case of texts, breaking texts into lists of tokens, and so forth.

Previously, we used a simple heuristic to tokenize text: split the text stream on whitespace characters.

example = manifest.loc[manifest["name"] == "Ada Lovelace", "text"].item()
print(example.split())
['A', 'gifted', 'mathematician', 'who', 'is', 'now', 'recognized', 'as', 'the', 'first', 'computer', 'programmer.', 'By', 'CLAIRE', 'CAIN', 'MILLER', 'A', 'century', 'before', 'the', 'dawn', 'of', 'the', 'computer', 'age,', 'Ada', 'Lovelace', 'imagined', 'the', 'modern-day,', 'general-purpose', 'computer.', 'It', 'could', 'be', 'programmed', 'to', 'follow', 'instructions,', 'she', 'wrote', 'in', '1843.', 'It', 'could', 'not', 'just', 'calculate', 'but', 'also', 'create,', 'as', 'it', '“weaves', 'algebraic', 'patterns', 'just', 'as', 'the', 'Jacquard', 'loom', 'weaves', 'flowers', 'and', 'leaves.”', 'The', 'computer', 'she', 'was', 'writing', 'about,', 'the', 'British', 'inventor', 'Charles', 'Babbage’s', 'Analytical', 'Engine,', 'was', 'never', 'built.', 'But', 'her', 'writings', 'about', 'computing', 'have', 'earned', 'Lovelace', '—', 'who', 'died', 'of', 'uterine', 'cancer', 'in', '1852', 'at', '36', '—', 'recognition', 'as', 'the', 'first', 'computer', 'programmer.', 'The', 'program', 'she', 'wrote', 'for', 'the', 'Analytical', 'Engine', 'was', 'to', 'calculate', 'the', 'seventh', 'Bernoulli', 'number.', '(Bernoulli', 'numbers,', 'named', 'after', 'the', 'Swiss', 'mathematician', 'Jacob', 'Bernoulli,', 'are', 'used', 'in', 'many', 'different', 'areas', 'of', 'mathematics.)', 'But', 'her', 'deeper', 'influence', 'was', 'to', 'see', 'the', 'potential', 'of', 'computing.', 'The', 'machines', 'could', 'go', 'beyond', 'calculating', 'numbers,', 'she', 'said,', 'to', 'understand', 'symbols', 'and', 'be', 'used', 'to', 'create', 'music', 'or', 'art.', '“This', 'insight', 'would', 'become', 'the', 'core', 'concept', 'of', 'the', 'digital', 'age,”', 'Walter', 'Isaacson', 'wrote', 'in', 'his', 'book', '“The', 'Innovators.”', '“Any', 'piece', 'of', 'content,', 'data', 'or', 'information', '—', 'music,', 'text,', 'pictures,', 'numbers,', 'symbols,', 'sounds,', 'video', '—', 'could', 'be', 'expressed', 'in', 'digital', 'form', 'and', 'manipulated', 'by', 'machines.”', 'She', 'also', 'explored', 'the', 'ramifications', 'of', 'what', 'a', 'computer', 'could', 'do,', 'writing', 'about', 'the', 'responsibility', 'placed', 'on', 'the', 'person', 'programming', 'the', 'machine,', 'and', 'raising', 'and', 'then', 'dismissing', 'the', 'notion', 'that', 'computers', 'could', 'someday', 'think', 'and', 'create', 'on', 'their', 'own', '—', 'what', 'we', 'now', 'call', 'artificial', 'intelligence.', '“The', 'Analytical', 'Engine', 'has', 'no', 'pretensions', 'whatever', 'to', 'originate', 'any', 'thing,”', 'she', 'wrote.', '“It', 'can', 'do', 'whatever', 'we', 'know', 'how', 'to', 'order', 'it', 'to', 'perform.”', 'Lovelace,', 'a', 'British', 'socialite', 'who', 'was', 'the', 'daughter', 'of', 'Lord', 'Byron,', 'the', 'Romantic', 'poet,', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science,', 'one', 'of', 'her', 'biographers,', 'Betty', 'Alexandra', 'Toole,', 'has', 'written.', 'She', 'thought', 'of', 'math', 'and', 'logic', 'as', 'creative', 'and', 'imaginative,', 'and', 'called', 'it', '“poetical', 'science.”', 'Math', '“constitutes', 'the', 'language', 'through', 'which', 'alone', 'we', 'can', 'adequately', 'express', 'the', 'great', 'facts', 'of', 'the', 'natural', 'world,”', 'Lovelace', 'wrote.', 'Her', 'work,', 'which', 'was', 'rediscovered', 'in', 'the', 'mid-20th', 'century,', 'inspired', 'the', 'Defense', 'Department', 'to', 'name', 'a', 'programming', 'language', 'after', 'her', 'and', 'each', 'October', 'Ada', 'Lovelace', 'Day', 'signifies', 'a', 'celebration', 'of', 'women', 'in', 'technology.', 'Lovelace', 'lived', 'when', 'women', 'were', 'not', 'considered', 'to', 'be', 'prominent', 'scientific', 'thinkers,', 'and', 'her', 'skills', 'were', 'often', 'described', 'as', 'masculine.', '“With', 'an', 'understanding', 'thoroughly', 'masculine', 'in', 'solidity,', 'grasp', 'and', 'firmness,', 'Lady', 'Lovelace', 'had', 'all', 'the', 'delicacies', 'of', 'the', 'most', 'refined', 'female', 'character,”', 'said', 'an', 'obituary', 'in', 'The', 'London', 'Examiner.', 'Babbage,', 'who', 'called', 'her', 'the', '“enchantress', 'of', 'numbers,”', 'once', 'wrote', 'that', 'she', '“has', 'thrown', 'her', 'magical', 'spell', 'around', 'the', 'most', 'abstract', 'of', 'Sciences', 'and', 'has', 'grasped', 'it', 'with', 'a', 'force', 'which', 'few', 'masculine', 'intellects', '(in', 'our', 'own', 'country', 'at', 'least)', 'could', 'have', 'exerted', 'over', 'it.”', 'Augusta', 'Ada', 'Byron', 'was', 'born', 'on', 'Dec.', '10,', '1815,', 'in', 'London,', 'to', 'Lord', 'Byron', 'and', 'Annabella', 'Milbanke.', 'Her', 'parents', 'separated', 'when', 'she', 'was', 'an', 'infant,', 'and', 'her', 'father', 'died', 'when', 'she', 'was', '8.', 'Her', 'mother', '—', 'whom', 'Lord', 'Byron', 'called', 'the', '“princess', 'of', 'parallelograms”', 'and,', 'after', 'their', 'falling', 'out,', 'a', '“mathematical', 'Medea”', '—', 'was', 'a', 'social', 'reformer', 'from', 'a', 'wealthy', 'family', 'who', 'had', 'a', 'deep', 'interest', 'in', 'mathematics.', 'An', 'etching', 'from', 'a', 'portrait', 'of', 'Lovelace', 'as', 'a', 'child.', 'She', 'is', 'said', 'to', 'have', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science.', 'Smith', 'Collection/Gado/Getty', 'Images', 'Lovelace', 'showed', 'a', 'passion', 'for', 'math', 'and', 'mechanics', 'from', 'a', 'young', 'age,', 'encouraged', 'by', 'her', 'mother.', 'Because', 'of', 'her', 'class,', 'she', 'had', 'access', 'to', 'private', 'tutors', 'and', 'to', 'intellectuals', 'in', 'British', 'scientific', 'and', 'literary', 'society.', 'She', 'was', 'insatiably', 'curious', 'and', 'surrounded', 'herself', 'with', 'big', 'thinkers', 'of', 'the', 'day,', 'including', 'Mary', 'Somerville,', 'a', 'scientist', 'and', 'writer.', 'It', 'was', 'Somerville', 'who', 'introduced', 'Lovelace', 'to', 'Babbage', 'when', 'she', 'was', '17,', 'at', 'a', 'salon', 'he', 'hosted', 'soon', 'after', 'she', 'made', 'her', 'society', 'debut.', 'He', 'showed', 'her', 'a', 'two-foot', 'high,', 'brass', 'mechanical', 'calculator', 'he', 'had', 'built,', 'and', 'it', 'gripped', 'her', 'imagination.', 'They', 'began', 'a', 'correspondence', 'about', 'math', 'and', 'science', 'that', 'lasted', 'almost', 'two', 'decades.', 'She', 'also', 'met', 'her', 'husband,', 'William', 'King,', 'through', 'Somerville.', 'They', 'married', 'in', '1835,', 'when', 'she', 'was', '19.', 'He', 'soon', 'became', 'an', 'earl,', 'and', 'she', 'became', 'the', 'Countess', 'of', 'Lovelace.', 'By', '1839,', 'she', 'had', 'given', 'birth', 'to', 'two', 'sons', 'and', 'a', 'daughter.', 'She', 'was', 'determined,', 'however,', 'not', 'to', 'let', 'her', 'family', 'life', 'slow', 'her', 'work.', 'The', 'year', 'she', 'was', 'married,', 'she', 'wrote', 'to', 'Somerville:', '“I', 'now', 'read', 'Mathematics', 'every', 'day', 'and', 'am', 'occupied', 'in', 'Trigonometry', 'and', 'in', 'preliminaries', 'to', 'Cubic', 'and', 'Biquadratic', 'Equations.', 'So', 'you', 'see', 'that', 'matrimony', 'has', 'by', 'no', 'means', 'lessened', 'my', 'taste', 'for', 'these', 'pursuits,', 'nor', 'my', 'determination', 'to', 'carry', 'them', 'on.”', 'In', '1840,', 'Lovelace', 'asked', 'Augustus', 'De', 'Morgan,', 'a', 'math', 'professor', 'in', 'London,', 'to', 'tutor', 'her.', 'Through', 'exchanging', 'letters,', 'he', 'taught', 'her', 'university-level', 'math.', 'He', 'later', 'wrote', 'to', 'her', 'mother', 'that', 'if', 'a', 'young', 'male', 'student', 'had', 'shown', 'her', 'skill,', '“they', 'would', 'have', 'certainly', 'made', 'him', 'an', 'original', 'mathematical', 'investigator,', 'perhaps', 'of', 'first-rate', 'eminence.”', 'It', 'was', 'in', '1843,', 'when', 'she', 'was', '27,', 'that', 'Lovelace', 'wrote', 'her', 'most', 'lasting', 'contribution', 'to', 'computer', 'science.', 'She', 'published', 'her', 'translation', 'of', 'an', 'academic', 'paper', 'about', 'the', 'Babbage', 'Analytical', 'Engine', 'and', 'added', 'a', 'section,', 'nearly', 'three', 'times', 'the', 'length', 'of', 'the', 'paper,', 'titled,', '“Notes.”', 'Here,', 'she', 'described', 'how', 'the', 'computer', 'would', 'work,', 'imagined', 'its', 'potential', 'and', 'wrote', 'the', 'first', 'program.', 'Researchers', 'have', 'come', 'to', 'see', 'it', 'as', '“an', 'extraordinary', 'document,”', 'said', 'Ursula', 'Martin,', 'a', 'computer', 'scientist', 'at', 'the', 'University', 'of', 'Oxford', 'who', 'has', 'studied', 'Lovelace’s', 'life', 'and', 'work.', '“She’s', 'talking', 'about', 'the', 'abstract', 'principles', 'of', 'computation,', 'how', 'you', 'could', 'program', 'it,', 'and', 'big', 'ideas', 'like', 'maybe', 'it', 'could', 'compose', 'music,', 'maybe', 'it', 'could', 'think.”', 'Lovelace', 'died', 'less', 'than', 'a', 'decade', 'later,', 'on', 'Nov.', '27,', '1852.', 'In', 'the', '“Notes,”', 'she', 'imagined', 'a', 'future', 'in', 'which', 'computers', 'could', 'do', 'more', 'powerful', 'and', 'faster', 'analysis', 'than', 'humans.', '“A', 'new,', 'a', 'vast', 'and', 'a', 'powerful', 'language', 'is', 'developed', 'for', 'the', 'future', 'use', 'of', 'analysis,”', 'she', 'wrote,', '“in', 'which', 'to', 'wield', 'its', 'truths', 'so', 'that', 'these', 'may', 'become', 'of', 'more', 'speedy', 'and', 'accurate', 'practical', 'application', 'for', 'the', 'purposes', 'of', 'mankind.”', 'Claire', 'Cain', 'Miller', 'writes', 'about', 'gender', 'for', 'The', 'Upshot.', 'She', 'first', 'learned', 'about', 'Ada', 'Lovelace', 'while', 'covering', 'the', 'tech', 'industry,', 'where', 'women', 'are', 'severely', 'underrepresented.']

The problem with this is that it cannot handle punctuation that is directly attached to the preceding characters, as in the case of periods, commas, etc. To get around this, we use a more sophisticated tokenizer from the nltk package, which is based on a series of regexes.

print(nltk.word_tokenize(example))
['A', 'gifted', 'mathematician', 'who', 'is', 'now', 'recognized', 'as', 'the', 'first', 'computer', 'programmer', '.', 'By', 'CLAIRE', 'CAIN', 'MILLER', 'A', 'century', 'before', 'the', 'dawn', 'of', 'the', 'computer', 'age', ',', 'Ada', 'Lovelace', 'imagined', 'the', 'modern-day', ',', 'general-purpose', 'computer', '.', 'It', 'could', 'be', 'programmed', 'to', 'follow', 'instructions', ',', 'she', 'wrote', 'in', '1843', '.', 'It', 'could', 'not', 'just', 'calculate', 'but', 'also', 'create', ',', 'as', 'it', '“', 'weaves', 'algebraic', 'patterns', 'just', 'as', 'the', 'Jacquard', 'loom', 'weaves', 'flowers', 'and', 'leaves.', '”', 'The', 'computer', 'she', 'was', 'writing', 'about', ',', 'the', 'British', 'inventor', 'Charles', 'Babbage', '’', 's', 'Analytical', 'Engine', ',', 'was', 'never', 'built', '.', 'But', 'her', 'writings', 'about', 'computing', 'have', 'earned', 'Lovelace', '—', 'who', 'died', 'of', 'uterine', 'cancer', 'in', '1852', 'at', '36', '—', 'recognition', 'as', 'the', 'first', 'computer', 'programmer', '.', 'The', 'program', 'she', 'wrote', 'for', 'the', 'Analytical', 'Engine', 'was', 'to', 'calculate', 'the', 'seventh', 'Bernoulli', 'number', '.', '(', 'Bernoulli', 'numbers', ',', 'named', 'after', 'the', 'Swiss', 'mathematician', 'Jacob', 'Bernoulli', ',', 'are', 'used', 'in', 'many', 'different', 'areas', 'of', 'mathematics', '.', ')', 'But', 'her', 'deeper', 'influence', 'was', 'to', 'see', 'the', 'potential', 'of', 'computing', '.', 'The', 'machines', 'could', 'go', 'beyond', 'calculating', 'numbers', ',', 'she', 'said', ',', 'to', 'understand', 'symbols', 'and', 'be', 'used', 'to', 'create', 'music', 'or', 'art', '.', '“', 'This', 'insight', 'would', 'become', 'the', 'core', 'concept', 'of', 'the', 'digital', 'age', ',', '”', 'Walter', 'Isaacson', 'wrote', 'in', 'his', 'book', '“', 'The', 'Innovators.', '”', '“', 'Any', 'piece', 'of', 'content', ',', 'data', 'or', 'information', '—', 'music', ',', 'text', ',', 'pictures', ',', 'numbers', ',', 'symbols', ',', 'sounds', ',', 'video', '—', 'could', 'be', 'expressed', 'in', 'digital', 'form', 'and', 'manipulated', 'by', 'machines.', '”', 'She', 'also', 'explored', 'the', 'ramifications', 'of', 'what', 'a', 'computer', 'could', 'do', ',', 'writing', 'about', 'the', 'responsibility', 'placed', 'on', 'the', 'person', 'programming', 'the', 'machine', ',', 'and', 'raising', 'and', 'then', 'dismissing', 'the', 'notion', 'that', 'computers', 'could', 'someday', 'think', 'and', 'create', 'on', 'their', 'own', '—', 'what', 'we', 'now', 'call', 'artificial', 'intelligence', '.', '“', 'The', 'Analytical', 'Engine', 'has', 'no', 'pretensions', 'whatever', 'to', 'originate', 'any', 'thing', ',', '”', 'she', 'wrote', '.', '“', 'It', 'can', 'do', 'whatever', 'we', 'know', 'how', 'to', 'order', 'it', 'to', 'perform.', '”', 'Lovelace', ',', 'a', 'British', 'socialite', 'who', 'was', 'the', 'daughter', 'of', 'Lord', 'Byron', ',', 'the', 'Romantic', 'poet', ',', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science', ',', 'one', 'of', 'her', 'biographers', ',', 'Betty', 'Alexandra', 'Toole', ',', 'has', 'written', '.', 'She', 'thought', 'of', 'math', 'and', 'logic', 'as', 'creative', 'and', 'imaginative', ',', 'and', 'called', 'it', '“', 'poetical', 'science.', '”', 'Math', '“', 'constitutes', 'the', 'language', 'through', 'which', 'alone', 'we', 'can', 'adequately', 'express', 'the', 'great', 'facts', 'of', 'the', 'natural', 'world', ',', '”', 'Lovelace', 'wrote', '.', 'Her', 'work', ',', 'which', 'was', 'rediscovered', 'in', 'the', 'mid-20th', 'century', ',', 'inspired', 'the', 'Defense', 'Department', 'to', 'name', 'a', 'programming', 'language', 'after', 'her', 'and', 'each', 'October', 'Ada', 'Lovelace', 'Day', 'signifies', 'a', 'celebration', 'of', 'women', 'in', 'technology', '.', 'Lovelace', 'lived', 'when', 'women', 'were', 'not', 'considered', 'to', 'be', 'prominent', 'scientific', 'thinkers', ',', 'and', 'her', 'skills', 'were', 'often', 'described', 'as', 'masculine', '.', '“', 'With', 'an', 'understanding', 'thoroughly', 'masculine', 'in', 'solidity', ',', 'grasp', 'and', 'firmness', ',', 'Lady', 'Lovelace', 'had', 'all', 'the', 'delicacies', 'of', 'the', 'most', 'refined', 'female', 'character', ',', '”', 'said', 'an', 'obituary', 'in', 'The', 'London', 'Examiner', '.', 'Babbage', ',', 'who', 'called', 'her', 'the', '“', 'enchantress', 'of', 'numbers', ',', '”', 'once', 'wrote', 'that', 'she', '“', 'has', 'thrown', 'her', 'magical', 'spell', 'around', 'the', 'most', 'abstract', 'of', 'Sciences', 'and', 'has', 'grasped', 'it', 'with', 'a', 'force', 'which', 'few', 'masculine', 'intellects', '(', 'in', 'our', 'own', 'country', 'at', 'least', ')', 'could', 'have', 'exerted', 'over', 'it.', '”', 'Augusta', 'Ada', 'Byron', 'was', 'born', 'on', 'Dec.', '10', ',', '1815', ',', 'in', 'London', ',', 'to', 'Lord', 'Byron', 'and', 'Annabella', 'Milbanke', '.', 'Her', 'parents', 'separated', 'when', 'she', 'was', 'an', 'infant', ',', 'and', 'her', 'father', 'died', 'when', 'she', 'was', '8', '.', 'Her', 'mother', '—', 'whom', 'Lord', 'Byron', 'called', 'the', '“', 'princess', 'of', 'parallelograms', '”', 'and', ',', 'after', 'their', 'falling', 'out', ',', 'a', '“', 'mathematical', 'Medea', '”', '—', 'was', 'a', 'social', 'reformer', 'from', 'a', 'wealthy', 'family', 'who', 'had', 'a', 'deep', 'interest', 'in', 'mathematics', '.', 'An', 'etching', 'from', 'a', 'portrait', 'of', 'Lovelace', 'as', 'a', 'child', '.', 'She', 'is', 'said', 'to', 'have', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science', '.', 'Smith', 'Collection/Gado/Getty', 'Images', 'Lovelace', 'showed', 'a', 'passion', 'for', 'math', 'and', 'mechanics', 'from', 'a', 'young', 'age', ',', 'encouraged', 'by', 'her', 'mother', '.', 'Because', 'of', 'her', 'class', ',', 'she', 'had', 'access', 'to', 'private', 'tutors', 'and', 'to', 'intellectuals', 'in', 'British', 'scientific', 'and', 'literary', 'society', '.', 'She', 'was', 'insatiably', 'curious', 'and', 'surrounded', 'herself', 'with', 'big', 'thinkers', 'of', 'the', 'day', ',', 'including', 'Mary', 'Somerville', ',', 'a', 'scientist', 'and', 'writer', '.', 'It', 'was', 'Somerville', 'who', 'introduced', 'Lovelace', 'to', 'Babbage', 'when', 'she', 'was', '17', ',', 'at', 'a', 'salon', 'he', 'hosted', 'soon', 'after', 'she', 'made', 'her', 'society', 'debut', '.', 'He', 'showed', 'her', 'a', 'two-foot', 'high', ',', 'brass', 'mechanical', 'calculator', 'he', 'had', 'built', ',', 'and', 'it', 'gripped', 'her', 'imagination', '.', 'They', 'began', 'a', 'correspondence', 'about', 'math', 'and', 'science', 'that', 'lasted', 'almost', 'two', 'decades', '.', 'She', 'also', 'met', 'her', 'husband', ',', 'William', 'King', ',', 'through', 'Somerville', '.', 'They', 'married', 'in', '1835', ',', 'when', 'she', 'was', '19', '.', 'He', 'soon', 'became', 'an', 'earl', ',', 'and', 'she', 'became', 'the', 'Countess', 'of', 'Lovelace', '.', 'By', '1839', ',', 'she', 'had', 'given', 'birth', 'to', 'two', 'sons', 'and', 'a', 'daughter', '.', 'She', 'was', 'determined', ',', 'however', ',', 'not', 'to', 'let', 'her', 'family', 'life', 'slow', 'her', 'work', '.', 'The', 'year', 'she', 'was', 'married', ',', 'she', 'wrote', 'to', 'Somerville', ':', '“', 'I', 'now', 'read', 'Mathematics', 'every', 'day', 'and', 'am', 'occupied', 'in', 'Trigonometry', 'and', 'in', 'preliminaries', 'to', 'Cubic', 'and', 'Biquadratic', 'Equations', '.', 'So', 'you', 'see', 'that', 'matrimony', 'has', 'by', 'no', 'means', 'lessened', 'my', 'taste', 'for', 'these', 'pursuits', ',', 'nor', 'my', 'determination', 'to', 'carry', 'them', 'on.', '”', 'In', '1840', ',', 'Lovelace', 'asked', 'Augustus', 'De', 'Morgan', ',', 'a', 'math', 'professor', 'in', 'London', ',', 'to', 'tutor', 'her', '.', 'Through', 'exchanging', 'letters', ',', 'he', 'taught', 'her', 'university-level', 'math', '.', 'He', 'later', 'wrote', 'to', 'her', 'mother', 'that', 'if', 'a', 'young', 'male', 'student', 'had', 'shown', 'her', 'skill', ',', '“', 'they', 'would', 'have', 'certainly', 'made', 'him', 'an', 'original', 'mathematical', 'investigator', ',', 'perhaps', 'of', 'first-rate', 'eminence.', '”', 'It', 'was', 'in', '1843', ',', 'when', 'she', 'was', '27', ',', 'that', 'Lovelace', 'wrote', 'her', 'most', 'lasting', 'contribution', 'to', 'computer', 'science', '.', 'She', 'published', 'her', 'translation', 'of', 'an', 'academic', 'paper', 'about', 'the', 'Babbage', 'Analytical', 'Engine', 'and', 'added', 'a', 'section', ',', 'nearly', 'three', 'times', 'the', 'length', 'of', 'the', 'paper', ',', 'titled', ',', '“', 'Notes.', '”', 'Here', ',', 'she', 'described', 'how', 'the', 'computer', 'would', 'work', ',', 'imagined', 'its', 'potential', 'and', 'wrote', 'the', 'first', 'program', '.', 'Researchers', 'have', 'come', 'to', 'see', 'it', 'as', '“', 'an', 'extraordinary', 'document', ',', '”', 'said', 'Ursula', 'Martin', ',', 'a', 'computer', 'scientist', 'at', 'the', 'University', 'of', 'Oxford', 'who', 'has', 'studied', 'Lovelace', '’', 's', 'life', 'and', 'work', '.', '“', 'She', '’', 's', 'talking', 'about', 'the', 'abstract', 'principles', 'of', 'computation', ',', 'how', 'you', 'could', 'program', 'it', ',', 'and', 'big', 'ideas', 'like', 'maybe', 'it', 'could', 'compose', 'music', ',', 'maybe', 'it', 'could', 'think.', '”', 'Lovelace', 'died', 'less', 'than', 'a', 'decade', 'later', ',', 'on', 'Nov.', '27', ',', '1852', '.', 'In', 'the', '“', 'Notes', ',', '”', 'she', 'imagined', 'a', 'future', 'in', 'which', 'computers', 'could', 'do', 'more', 'powerful', 'and', 'faster', 'analysis', 'than', 'humans', '.', '“', 'A', 'new', ',', 'a', 'vast', 'and', 'a', 'powerful', 'language', 'is', 'developed', 'for', 'the', 'future', 'use', 'of', 'analysis', ',', '”', 'she', 'wrote', ',', '“', 'in', 'which', 'to', 'wield', 'its', 'truths', 'so', 'that', 'these', 'may', 'become', 'of', 'more', 'speedy', 'and', 'accurate', 'practical', 'application', 'for', 'the', 'purposes', 'of', 'mankind.', '”', 'Claire', 'Cain', 'Miller', 'writes', 'about', 'gender', 'for', 'The', 'Upshot', '.', 'She', 'first', 'learned', 'about', 'Ada', 'Lovelace', 'while', 'covering', 'the', 'tech', 'industry', ',', 'where', 'women', 'are', 'severely', 'underrepresented', '.']

Tip

You need to download this tokenizer the first time you use it, so don’t fret if you get an error. Just follow the instructions to download the file, which are as follows:

nltk.download("punkt")

Below, we incorporate this tokenizer into a preprocessing function that performs the following steps:

  1. Change the string to lowercase

  2. Tokenize the string into lists of tokens

  3. Optionally, create multi-gram token sequences (more on this next week)

def preprocess(doc, ngram = 1):
    """Preprocess a document.

    Parameters
    ----------
    doc : str
        The document to preprocess
    ngram : int
        How many n-grams to break the document into
    
    Returns
    -------
    tokens : list
        Tokenized document
    """
    # First, change the case of the words to lowercase
    doc = doc.lower()

    # Tokenize the string. Optionally, make 2-gram (or more) sequences from
    # those tokens
    tokens = nltk.word_tokenize(doc)
    if ngram > 1:
        tokens = list(nltk.ngrams(tokens, ngram))
    
    return tokens

With our function defined, we preprocess the corpus documents.

cleaned = [preprocess(doc) for doc in manifest["text"]]

Then we get rid of punctuation and numbers with a regex substitution. Note that this is a two-step processed: first we remove anything that isn’t an alphabetic character, then we filter out empty strings in the sublists.

cleaned = [[re.sub(r"[^a-zA-Z]", "", tok) for tok in doc] for doc in cleaned]
cleaned = [[tok for tok in doc if tok] for doc in cleaned]

print(cleaned[0])
['a', 'gifted', 'mathematician', 'who', 'is', 'now', 'recognized', 'as', 'the', 'first', 'computer', 'programmer', 'by', 'claire', 'cain', 'miller', 'a', 'century', 'before', 'the', 'dawn', 'of', 'the', 'computer', 'age', 'ada', 'lovelace', 'imagined', 'the', 'modernday', 'generalpurpose', 'computer', 'it', 'could', 'be', 'programmed', 'to', 'follow', 'instructions', 'she', 'wrote', 'in', 'it', 'could', 'not', 'just', 'calculate', 'but', 'also', 'create', 'as', 'it', 'weaves', 'algebraic', 'patterns', 'just', 'as', 'the', 'jacquard', 'loom', 'weaves', 'flowers', 'and', 'leaves', 'the', 'computer', 'she', 'was', 'writing', 'about', 'the', 'british', 'inventor', 'charles', 'babbage', 's', 'analytical', 'engine', 'was', 'never', 'built', 'but', 'her', 'writings', 'about', 'computing', 'have', 'earned', 'lovelace', 'who', 'died', 'of', 'uterine', 'cancer', 'in', 'at', 'recognition', 'as', 'the', 'first', 'computer', 'programmer', 'the', 'program', 'she', 'wrote', 'for', 'the', 'analytical', 'engine', 'was', 'to', 'calculate', 'the', 'seventh', 'bernoulli', 'number', 'bernoulli', 'numbers', 'named', 'after', 'the', 'swiss', 'mathematician', 'jacob', 'bernoulli', 'are', 'used', 'in', 'many', 'different', 'areas', 'of', 'mathematics', 'but', 'her', 'deeper', 'influence', 'was', 'to', 'see', 'the', 'potential', 'of', 'computing', 'the', 'machines', 'could', 'go', 'beyond', 'calculating', 'numbers', 'she', 'said', 'to', 'understand', 'symbols', 'and', 'be', 'used', 'to', 'create', 'music', 'or', 'art', 'this', 'insight', 'would', 'become', 'the', 'core', 'concept', 'of', 'the', 'digital', 'age', 'walter', 'isaacson', 'wrote', 'in', 'his', 'book', 'the', 'innovators', 'any', 'piece', 'of', 'content', 'data', 'or', 'information', 'music', 'text', 'pictures', 'numbers', 'symbols', 'sounds', 'video', 'could', 'be', 'expressed', 'in', 'digital', 'form', 'and', 'manipulated', 'by', 'machines', 'she', 'also', 'explored', 'the', 'ramifications', 'of', 'what', 'a', 'computer', 'could', 'do', 'writing', 'about', 'the', 'responsibility', 'placed', 'on', 'the', 'person', 'programming', 'the', 'machine', 'and', 'raising', 'and', 'then', 'dismissing', 'the', 'notion', 'that', 'computers', 'could', 'someday', 'think', 'and', 'create', 'on', 'their', 'own', 'what', 'we', 'now', 'call', 'artificial', 'intelligence', 'the', 'analytical', 'engine', 'has', 'no', 'pretensions', 'whatever', 'to', 'originate', 'any', 'thing', 'she', 'wrote', 'it', 'can', 'do', 'whatever', 'we', 'know', 'how', 'to', 'order', 'it', 'to', 'perform', 'lovelace', 'a', 'british', 'socialite', 'who', 'was', 'the', 'daughter', 'of', 'lord', 'byron', 'the', 'romantic', 'poet', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science', 'one', 'of', 'her', 'biographers', 'betty', 'alexandra', 'toole', 'has', 'written', 'she', 'thought', 'of', 'math', 'and', 'logic', 'as', 'creative', 'and', 'imaginative', 'and', 'called', 'it', 'poetical', 'science', 'math', 'constitutes', 'the', 'language', 'through', 'which', 'alone', 'we', 'can', 'adequately', 'express', 'the', 'great', 'facts', 'of', 'the', 'natural', 'world', 'lovelace', 'wrote', 'her', 'work', 'which', 'was', 'rediscovered', 'in', 'the', 'midth', 'century', 'inspired', 'the', 'defense', 'department', 'to', 'name', 'a', 'programming', 'language', 'after', 'her', 'and', 'each', 'october', 'ada', 'lovelace', 'day', 'signifies', 'a', 'celebration', 'of', 'women', 'in', 'technology', 'lovelace', 'lived', 'when', 'women', 'were', 'not', 'considered', 'to', 'be', 'prominent', 'scientific', 'thinkers', 'and', 'her', 'skills', 'were', 'often', 'described', 'as', 'masculine', 'with', 'an', 'understanding', 'thoroughly', 'masculine', 'in', 'solidity', 'grasp', 'and', 'firmness', 'lady', 'lovelace', 'had', 'all', 'the', 'delicacies', 'of', 'the', 'most', 'refined', 'female', 'character', 'said', 'an', 'obituary', 'in', 'the', 'london', 'examiner', 'babbage', 'who', 'called', 'her', 'the', 'enchantress', 'of', 'numbers', 'once', 'wrote', 'that', 'she', 'has', 'thrown', 'her', 'magical', 'spell', 'around', 'the', 'most', 'abstract', 'of', 'sciences', 'and', 'has', 'grasped', 'it', 'with', 'a', 'force', 'which', 'few', 'masculine', 'intellects', 'in', 'our', 'own', 'country', 'at', 'least', 'could', 'have', 'exerted', 'over', 'it', 'augusta', 'ada', 'byron', 'was', 'born', 'on', 'dec', 'in', 'london', 'to', 'lord', 'byron', 'and', 'annabella', 'milbanke', 'her', 'parents', 'separated', 'when', 'she', 'was', 'an', 'infant', 'and', 'her', 'father', 'died', 'when', 'she', 'was', 'her', 'mother', 'whom', 'lord', 'byron', 'called', 'the', 'princess', 'of', 'parallelograms', 'and', 'after', 'their', 'falling', 'out', 'a', 'mathematical', 'medea', 'was', 'a', 'social', 'reformer', 'from', 'a', 'wealthy', 'family', 'who', 'had', 'a', 'deep', 'interest', 'in', 'mathematics', 'an', 'etching', 'from', 'a', 'portrait', 'of', 'lovelace', 'as', 'a', 'child', 'she', 'is', 'said', 'to', 'have', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science', 'smith', 'collectiongadogetty', 'images', 'lovelace', 'showed', 'a', 'passion', 'for', 'math', 'and', 'mechanics', 'from', 'a', 'young', 'age', 'encouraged', 'by', 'her', 'mother', 'because', 'of', 'her', 'class', 'she', 'had', 'access', 'to', 'private', 'tutors', 'and', 'to', 'intellectuals', 'in', 'british', 'scientific', 'and', 'literary', 'society', 'she', 'was', 'insatiably', 'curious', 'and', 'surrounded', 'herself', 'with', 'big', 'thinkers', 'of', 'the', 'day', 'including', 'mary', 'somerville', 'a', 'scientist', 'and', 'writer', 'it', 'was', 'somerville', 'who', 'introduced', 'lovelace', 'to', 'babbage', 'when', 'she', 'was', 'at', 'a', 'salon', 'he', 'hosted', 'soon', 'after', 'she', 'made', 'her', 'society', 'debut', 'he', 'showed', 'her', 'a', 'twofoot', 'high', 'brass', 'mechanical', 'calculator', 'he', 'had', 'built', 'and', 'it', 'gripped', 'her', 'imagination', 'they', 'began', 'a', 'correspondence', 'about', 'math', 'and', 'science', 'that', 'lasted', 'almost', 'two', 'decades', 'she', 'also', 'met', 'her', 'husband', 'william', 'king', 'through', 'somerville', 'they', 'married', 'in', 'when', 'she', 'was', 'he', 'soon', 'became', 'an', 'earl', 'and', 'she', 'became', 'the', 'countess', 'of', 'lovelace', 'by', 'she', 'had', 'given', 'birth', 'to', 'two', 'sons', 'and', 'a', 'daughter', 'she', 'was', 'determined', 'however', 'not', 'to', 'let', 'her', 'family', 'life', 'slow', 'her', 'work', 'the', 'year', 'she', 'was', 'married', 'she', 'wrote', 'to', 'somerville', 'i', 'now', 'read', 'mathematics', 'every', 'day', 'and', 'am', 'occupied', 'in', 'trigonometry', 'and', 'in', 'preliminaries', 'to', 'cubic', 'and', 'biquadratic', 'equations', 'so', 'you', 'see', 'that', 'matrimony', 'has', 'by', 'no', 'means', 'lessened', 'my', 'taste', 'for', 'these', 'pursuits', 'nor', 'my', 'determination', 'to', 'carry', 'them', 'on', 'in', 'lovelace', 'asked', 'augustus', 'de', 'morgan', 'a', 'math', 'professor', 'in', 'london', 'to', 'tutor', 'her', 'through', 'exchanging', 'letters', 'he', 'taught', 'her', 'universitylevel', 'math', 'he', 'later', 'wrote', 'to', 'her', 'mother', 'that', 'if', 'a', 'young', 'male', 'student', 'had', 'shown', 'her', 'skill', 'they', 'would', 'have', 'certainly', 'made', 'him', 'an', 'original', 'mathematical', 'investigator', 'perhaps', 'of', 'firstrate', 'eminence', 'it', 'was', 'in', 'when', 'she', 'was', 'that', 'lovelace', 'wrote', 'her', 'most', 'lasting', 'contribution', 'to', 'computer', 'science', 'she', 'published', 'her', 'translation', 'of', 'an', 'academic', 'paper', 'about', 'the', 'babbage', 'analytical', 'engine', 'and', 'added', 'a', 'section', 'nearly', 'three', 'times', 'the', 'length', 'of', 'the', 'paper', 'titled', 'notes', 'here', 'she', 'described', 'how', 'the', 'computer', 'would', 'work', 'imagined', 'its', 'potential', 'and', 'wrote', 'the', 'first', 'program', 'researchers', 'have', 'come', 'to', 'see', 'it', 'as', 'an', 'extraordinary', 'document', 'said', 'ursula', 'martin', 'a', 'computer', 'scientist', 'at', 'the', 'university', 'of', 'oxford', 'who', 'has', 'studied', 'lovelace', 's', 'life', 'and', 'work', 'she', 's', 'talking', 'about', 'the', 'abstract', 'principles', 'of', 'computation', 'how', 'you', 'could', 'program', 'it', 'and', 'big', 'ideas', 'like', 'maybe', 'it', 'could', 'compose', 'music', 'maybe', 'it', 'could', 'think', 'lovelace', 'died', 'less', 'than', 'a', 'decade', 'later', 'on', 'nov', 'in', 'the', 'notes', 'she', 'imagined', 'a', 'future', 'in', 'which', 'computers', 'could', 'do', 'more', 'powerful', 'and', 'faster', 'analysis', 'than', 'humans', 'a', 'new', 'a', 'vast', 'and', 'a', 'powerful', 'language', 'is', 'developed', 'for', 'the', 'future', 'use', 'of', 'analysis', 'she', 'wrote', 'in', 'which', 'to', 'wield', 'its', 'truths', 'so', 'that', 'these', 'may', 'become', 'of', 'more', 'speedy', 'and', 'accurate', 'practical', 'application', 'for', 'the', 'purposes', 'of', 'mankind', 'claire', 'cain', 'miller', 'writes', 'about', 'gender', 'for', 'the', 'upshot', 'she', 'first', 'learned', 'about', 'ada', 'lovelace', 'while', 'covering', 'the', 'tech', 'industry', 'where', 'women', 'are', 'severely', 'underrepresented']

Finally, we assign to our DataFrame.

manifest["tokens"] = cleaned.copy()

3.4.2. Plotting#

A last step before analyzing these tokens: defining a function to plot our results with a histogram. We use seaborn for this. It has a simple interface that integrates directly with DataFrames.

def plot_metrics(data, variable, title = "", xlabel = "", figsize = (15, 5)):
    """Plot metrics with a histogram.

    Parameters
    ----------
    data : pd.DataFrame
        The data to plot
    variable : str
        Which variable to plot
    title : str
        Plot title
    xlabel : str
        Label of the X axis
    figsize : tuple
        Size of the figure
    """
    # First, check whether the variable we want to plot is in the DataFrame
    if variable not in data.columns:
        raise ValueError(variable, "not in data")

    # Create a figure with a plot in it, then add labels
    plt.figure(figsize = figsize)
    g = sns.histplot(data = data, x = variable)
    g.set(title = title, xlabel = xlabel, ylabel = "Count")
    plt.show()

3.5. Data Analysis#

Time to look at our data. Many of these operations will rely on the .apply() method. You can think of this method like a for loop: it applies some function to every element along an axis in the DataFrame. Axis 0 is the column axis, while 1 is the row axis. This feels somewhat backwards, but setting axis = 0 applies a function to all rows under a column; axis = 1 applies a function to all columns across a row.

3.5.1. Document metrics#

First, some simple document metrics. Below, we calculate the number of tokens in a document, as expressed in this notation:

\[ T(i) = \sum_{j=1}^{m_i}1 \]

Where:

  • \(T(i)\) is the total number of tokens for the \(i\)-th document

  • \(m_i\) represents the total number of tokens in every token list

  • Each token \(j\) in document \(i\) is counted once, indicated by \(1\)

In code, len() will handle this easily.

manifest["num_tokens"] = manifest["tokens"].apply(len)
plot_metrics(manifest, "num_tokens", title = "Token counts", xlabel = "Tokens")
../_images/fb7bd2dd42fa44bbaa4cfc8e10c4cc84bd5a4b6461b10dc5c7b97962e5ebf725.png

The number of types is the number of unique tokens in a document. We calculate it with:

\[ K(i) = \sum_{j \in J}1 \]

Where:

  • \(K(i)\) is the total number of types for the \(i\)-th document

  • \(j \in J\) represents each token \(j\) for \(J\) unique tokens

  • Each token \(j\) in \(J\) is counted once, indicated by \(1\)

To implement in code, we take advantage of a feature in the .apply() method: its outputs can be directed to another .apply() call, or chained.

manifest["num_types"] = manifest["tokens"].apply(np.unique).apply(len)
plot_metrics(manifest, "num_types", title = "Type counts", xlabel = "Types")
../_images/f2e548b5d46e1bdaeaf76ca8f83c5b44d74ae5142563883fa824c06aa8059af9.png

The type-token ratio is a measure of lexical diversity.

\[ TTR(i) = \frac{K(i)}{T(i)} \]

In other words, for document \(i\) it is the number of types \(K(i)\) divided by the number of tokens \(T(i)\).

manifest["ttr"] = manifest["num_types"] / manifest["num_tokens"]
plot_metrics(manifest, "ttr", title = "Type-token ratio", xlabel = "TTR")
../_images/34030df37264704ff4982996aa8b08a5d98feb848047d313ba67f611f5bd57a8.png

Use the .nlargest() method to find the document with the highest type-token ratio.

manifest.nlargest(n = 1, columns = "ttr")
name year file text tokens num_tokens num_types ttr
39 Hilaire G E Degas 1917 039.txt September 28, 1917\n\n OBITUARY\n\n Hilaire G.... [september, obituary, hilaire, g, e, degas, no... 167 111 0.664671

And .nsmallest() will return the lowest one:

manifest.nsmallest(n = 1, columns = "ttr")
name year file text tokens num_tokens num_types ttr
6 Ulysses Grant 1885 006.txt July 24, 1885\n\n OBITUARY\n\n The Career of a... [july, obituary, the, career, of, a, soldier, ... 40800 5514 0.135147

Finally, a global view of these three metrics using .describe():

manifest[["num_tokens", "num_types", "ttr"]].describe()
num_tokens num_types ttr
count 379.000000 379.000000 379.000000
mean 2491.387863 848.970976 0.395118
std 2796.214753 541.151464 0.070147
min 167.000000 58.000000 0.135147
25% 1133.000000 488.000000 0.349747
50% 1864.000000 732.000000 0.389706
75% 2972.500000 1038.500000 0.439554
max 40800.000000 5514.000000 0.664671

3.5.2. Token metrics#

Now, tokens. Next week we will use a special data structure, the document-term matrix to make working with token data easier, but base functionality in pandas will suffice for now. Using .explode() breaks token lists into individual rows.

manifest = manifest.explode("tokens")

That greatly lengthens the DataFrame. You will often hear of data scientists speak of long and wide data. That refers to tabular data that has many observations relative to variables (long) or vice versa (wide).

The .shape attribute stores information about the number of rows and columns.

num_rows, num_cols = manifest.shape
print(f"DataFrame dimensions: ({num_rows:,} x {num_cols})")
DataFrame dimensions: (944,236 x 8)

Use .value_counts() to count observations in a column. We assign the result to a new variable, convert it to a DataFrame, and then use .sort_values() to order them in descending order.

token_freq = manifest["tokens"].value_counts()
token_freq = pd.DataFrame(token_freq).reset_index()
token_freq.sort_values(by = "count", ascending = False, inplace = True)

Tokens with the highest frequency:

token_freq.head(10)
tokens count
0 the 60573
1 of 34239
2 and 28429
3 in 27792
4 a 24382
5 to 22995
6 was 17603
7 he 17387
8 his 14777
9 that 9563

And the lowest:

token_freq.tail(10)
tokens count
28288 coarservoiced 1
28289 greengrocer 1
28290 whelan 1
28291 naughty 1
28292 reemerging 1
28293 strop 1
28294 layout 1
28295 lodger 1
28296 ripper 1
38809 mentored 1

Though there are in fact many tokens that only occur once in the data. We refer to these as hapax legomena (Greek for “only said once”). How many are there?

hapaxes = token_freq[token_freq["count"] == 1]
print(f"Number of hapaxes: {len(hapaxes):,}")
Number of hapaxes: 15,779

A broader look at token counts will situate hapaxes. Below, we plot the 1,000 most frequent tokens.

N = 1000

plt.figure(figsize = (15, 5))
g = sns.scatterplot(data = token_freq[:N], x = "tokens", y = "count")
g.set(xlabel = "Tokens", ylabel = "Token counts", title = f"Top {N:,} Tokens")
plt.xticks(rotation = 90, ticks = range(0, N, 25))
plt.show()
../_images/81943d8012fc22669b858a78337452d40c362468cf306c50a60071879d059e2f.png

Even in the top 1,000 tokens, it’s evident that there is an extremely long tail in the count data. More, even at the highest counts there are big jumps between the most frequent token, the second one, the third, and so on.

Plotting a larger sample will show the same pattern. Below, we sample 10,000 tokens randomly.

N = 10_000
sampled = token_freq.sample(N, replace = False)
sampled.sort_values("count", ascending = False, inplace = True)

Now we plot on a line plot.

plt.figure(figsize = (15, 5))
g = sns.lineplot(sampled, x = "tokens", y = "count")
g.set(xlabel = "Tokens", ylabel = "Count", title = f"{N:,} Sampled Tokens")
plt.xticks(rotation = 90, ticks = range(0, N, 500))
plt.show()
../_images/94488913a2dcfcee7356240c1f9417e2bacb4b2ac2a966ef35af2f5159a65acd.png

What is this telling us? Our token distribution is Zipfian. The \(n\)-th value of a token is inversely proportional to its position \(n\). Or, put another way, the most common token in the data occurs twice as often as the next most common token, three times as often as the third most common token, and so on.

Importantly, the most frequent tokens in this data are deictic words: words like “and,” “the,” etc. These words are the very sinew of language, and yet they’re so redundant and so context-dependent that it’s difficult to get a sense of what they mean. A great number of language models start with this very problem—including Claude Shannon’s mathematical theory of communication, the subject of our next chapter.