3. Data Analysis in Python#
This chapter will show you the basics of analyzing data in Python. We will load text files into memory, align them with corresponding metadata, and produce information about their contents. Also covered: preparing text for numerical operations and graphing data.
Data: Melanie Walsh’s corpus of ~380 obituaries from the New York Times
Credits: Portions of this chapter are adapted from the UC Davis DataLab’s Python Basics and Natural Language Processing for Data Science
3.1. Preliminaries#
The packages we’ll need today will help us load text files (pathlib
), process
them into discrete tokens (nltk
), conduct data analysis about those tokens
(numpy
, pandas
), and plot the results (seaborn
, matplotlib
).
from pathlib import Path
import re
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
3.2. Loading a Corpus#
The obituaries are stored in individual plain text files at the location below.
We wrap this file path in a Path
object to make interacting with our
computers’ file systems more streamlined.
datadir = Path("data/texts/nyt/obituaries")
Use a glob pattern to retrieve paths to .txt
files. The output of the
.glob()
method is a generator, so convert it to a list.
paths = list(datadir.glob("*.txt"))
print(paths[:5])
[PosixPath('data/texts/nyt/obituaries/289.txt'), PosixPath('data/texts/nyt/obituaries/262.txt'), PosixPath('data/texts/nyt/obituaries/276.txt'), PosixPath('data/texts/nyt/obituaries/060.txt'), PosixPath('data/texts/nyt/obituaries/074.txt')]
As we did in the last chapter, we can load one of these files. Note the slight
difference in syntax when using Path
.
random_path = np.random.choice(paths)
with random_path.open("r") as fin:
doc = fin.read()
print(doc)
November 16, 1978
OBITUARY
Margaret Mead Is Dead of Cancer at 76
By ALDEN WHITMAN
Margaret Mead, the anthropologist, author, lecturer and social critic, died yesterday at New York Hospital after a yearlong battle with cancer. She was 76 years old.
Dr. Mead, who was curator emeritus of the department of anthropology at the American Museum of Natural History, had known that she had cancer but remained active at her work until she entered the hospital on Oct.3, according to a museum spokesman.
President Carter mourned her death, saying in a statement that she had "brought the humane insights of cultural anthropology to a public of millions." There were other tributes from Kurt Waldheim, Secretary-General of the United Nations; from
Mayor Koch; the Smithsonian Institution; Edward J. Lehman, the executive director of the American Anthropological Association, and Faye Wattleton, president of the Planned Parenthood Federation. "I'm in the middle of several different things," Dr. Mead said offhandedly a few years ago and reeled off to an inquiring friend a dozen projects that she was pursuing simultaneously. She was not boasting. She was just stating a fact of her life that had been true since early childhood. The slight but sturdy Dr. Mead was possessed of virtually boundless energy, an unquenchable curiosity, a tenacious memory and a genius for organizing her time.
She often gave the impression of being ubiquitous because she was rarely at rest in any one place for very long and because she could not permit a moment to pass unutilized. In all this she had a zest that even in her 70's confounded friends and colleagues of lesser verve.
The American Museum of Natural History, with which she was associated for most of her professional life, once drew up a list of subjects in which she was "a specialist." The list read: "Education and culture; relationship between character structure and social forms; personality and culture; cultural aspects of problems of nutrition; mental health; family life; ecology; ekistics; transnational relations; national character; cultural change, and cultural building."
The museum might well have added "et cetera," for Dr. Mead was not only an anthropologist and ethnologist of the first rank but also something of a national oracle on other subjects ranging from atomic politics to feminism. She took on (and dismissed with disdain) Dr. Edward Teller, the hydrogen bomb advocate, and she was once described as "a general among the foot soldiers of modern feminism." Insofar as anyone can be a polymath,
Dr. Mead was widely regarded as one.
Headed Science Association
One evidence of her formidable powers was her election, at the age of 72, to the presidency of the American Association for the Advancement of Science. She was the second woman to head this group, one of the ranking organizations of the country's scientific community. Her stature as a scientist has been assured for many years, albeit somewhat grudgingly because she was a woman in a male-dominated discipline.
For those who saw Dr. Mead in middle age and on, she was a robust, 5-foot-2-inch figure who carried a forked walking stick (she broke her ankle years ago). Her head was topped with fluffy, slightly curly hair cut in bangs, and her feet were shod in plain leather sandals. Her voice was melodious, and her face, with its rimless glasses, was pleasant and open.
Although she could be lacerating, she was more often gentle and witty. She believed civilized mankind to be often ill-informed and pigheaded, yet she usually displayed great compassion for its individual members.
From the publication of her first book, "Coming of Age in Samoa," in 1928, in which she described the values of adolescent lovemaking in Samoan society, Dr. Mead's name became associated with sexual theory. A good deal of her subsequent writing contended that sexual repression worked against healthy maturation of the young and against successful marriages. 'Eclectic Circuitry'
Her anthropological studies also covered other topics and were generally highly regarded as making her an expert in the sociocultural life of primitive peoples. Some, though, were reserved about the seeming contradictory nature of her material. "She illustrates the principle of eclectic circuitry," one critic said.
The number of Dr. Mead's scientific and popular lectures was staggering--110 in one sample 12-month period--and each was different. Her popular lectures, delivered usually to overflow crowds, were sometimes on rather esoteric subjects. "Acculturation
Among the Iatmul Tribe of New Guinea" was one of them. ("For years I have been able to guarantee audiences a good address by using words that aren't in home dictionaries," she once said.)
Sometimes she got her audiences mixed up. Once, for example, she spoke learnedly on sex deviations among the Tchambuli to a group of theologians. They took it in good part, as did a men's luncheon club whose members applauded her talk on cultural stability in the South Seas.
Over the years, Dr. Mead lectured, sometimes for no fee, on such subjects as air pollution, hunger, mental hygiene, sex, women's careers, population control, primitive art, the family, nutrition, city planning, military service, tribal customs, alcoholism, child development, architecture, drugs and civil liberties. No matter what the topic, she did her homework. After one talk on tribal customs, a questioner asked about consumption of betel nuts in the Admiralty Islands. There was a ready and long response, as if betel nut problems were her life work.
An Active Advocate
Dr. Mead's fellow anthropologists were often uneasy about her. "You wonder what she'll take off on next," one said some years ago. "We know what Dr. Blank will say--he's probably already distributed his paper. But we're never sure about Margaret Mead."
Not only was Dr. Mead unpredictable; sometimes she also did not abide by the rules of behavior that most scientists set for themselves. Anthropology, essentially the study of adaptation, should refrain from influencing the events it observes and interprets, most scientists believe. But Dr. Mead, according to her critics was not only a student of adaptation but also an active advocate of many specific changes in modern society.
The critics, however, almost universally admired her as a person, however much they were distressed by her as a scholar-activist. They thought she was too scattershot and sometime self-contradictory. "But then," one critic said, "we do owe a lot to Margaret for putting us on the map."
Some social scientists thought that Dr. Mead was lacking in introspection on the human relations of her field work in the South Seas. "The remarkable thing about Margaret is that she's always been interested in the psychological end of anthropology and is, in fact, one of the leading contributors to the field," a critic said. "But her first love and primary interest is the study of culture, and she never gets to the person in the full sense." 'Oh, Piffle'
To this and other criticisms, Dr. Mead's usual reaction was, "Oh, piffle." It was said with noticeable spunk, tinged with disdain.
Spunkiness was, indeed, among Margaret Mead's earliest traits. Born in Philadelphia on Dec. 16, 1901, she was the daughter of Edward and Emily Fogg Mead. Her father, who taught economics at the University of Pennsylvania, had hoped for a son and once told his daughter, "It's a pity you aren't a boy; you'd have gone far."
She determined to go to college and did, to De Pauw University, from which she went to Barnard College to get her Bachelor of Arts degree in 1923.
At Barnard, the young student met Franz Boas, a magnetic man who was one of the world's ranking anthropologists. He became her mentor, and she became one of his four graduate students at Columbia, where she took her M.A. in 1924 and her Ph.D. in 1929. "Franz Boas had to plan--much as if he were a general," Dr. Mead recalled, "with only a handful of troops to save a whole country." Dr. Boas thought she ought to work among American Indians, his area of interest, but she wanted to investigate Polynesia.
Spunkiness and Guile
Her spunkiness won out, assisted by a bit of guile. She suggested to Dr. Boas that he was trying to manipulate her and suggested to her father that her mentor was trying to control his daughter. Dr. Boas gave in, and her father gave her $1,000 for a world trip.
By this time, Dr. Mead was married to Luther S. Cressman, a young seminarian who often joked unhumorously of having to make an appointment to see his wife. They parted temporarily when she went to Samoa in 1926.
On shipboard, there was a love affair with Reo F. Fortune, a New Zealand anthropologist, to whom she was married after a brief reconciliation with Dr. Cressman. Meanwhile, she did the field work for and wrote "Coming of Age in Samoa." From the start, it was enormously popular, especially among young people, some of whom were influenced by it to become anthropologists.
The scientific question underlying "Coming of Age in Samoa" was whether "the disturbances which vex our adolescents [are] due to the nature of adolescence itself or the civilization." Her findings suggest that the answer was the civilization.
The easygoing ways in Samoa minimized conflict and the incidence of neurotic personalities due to guilt feelings.
Two Daring Chapters
The book was descriptive rather than statistical. It also included two chapters that daringly applied her findings to modern society, in which she proposed that straitlaced sex attitudes might be relaxed without "accepting promiscuity."
The book has often been attacked in scientific circles as too subjective and lacking the data for verifiable behavior. However, her conclusions were based on detailed observation, and if she did not conduct anthropometric tests or produce statistical surveys she did convey her subjects graphically. A typical sentence read, "Her grandmother is very old; the muscles in her neck are stringy like uncooked pork."
Dr. Mead settled down with the people she was studying. She ate their wild boar, wild pigeon and dried fish; helped to care for ill children, and gained the confidence of her informants. At one time she built a wall-less house so she could observe everything around her.
She possessed a trait unusual in anthropologists of her time, an ability to shed her Western preconceptions. She would sit on the ground for hours without moving as she watched tribal peoples. "She knows how to use her eyes, how to see," said
Ken Heyman, a fellow scientist. "She has an uncanny perception for different cultural styles."
Books Showed Intuition
This finely attuned intuition was evident in her books on the seven cultures she studied-- Samoan, Manua, Arapesh, Mundugumor, Tchambuli, Iatmul and Balinese. Out of these inquiries came, in addition to "Coming of Age in Samoa," "Growing
Up in New Guinea," "Sex and Temperament in Three Primitive Societies," "Balinese Character" and "New Lives for Old." Some of her most extensive studies were done with the Manua--she visited them several times--and they spoke her name as "Makrit Mit."
Dr. Mead's association with tribal peoples was the subject of a notable New Yorker cartoon that depicted a tribal chief handing out books to boys about to be initiated into adolescence. "Rather than go into the details," he was saying. "I'm simply going to present each of you with a copy of this excellent book by Margaret Mead."
The idea behind the cartoon was not far fetched, because the Iatmul peoples once met her at their dock singing "My Darling Clementine" and then carried her off to their village.
Generalizing from her investigations, Dr. Mead said that each culture had its own distinct psychological profile. "Each society," ranquil. Dr. Mead and her husband Dr. Fortune, met Gregory Bateson, a British anthropologist, in New Guinea. There was a personal crisis among the three as a result of which there was a divorce, and Dr. Mead and Dr. Bateson were married. They had a daughter, Catherine. They were divorced after about fifteen years. "The Bateson years were probably the richest of her life, " a friend of Dr. Mead said, noting that she and her husband were "perfect partners in mind and temperament. " Recalling the union in her memoir, "Blackberry Winter, "
Dr. Mead was wistful about her marriage and its years in Bali, saying: "I think it is a good thing to have such a model once [as Mr. Bateson] ing the union in her memoir, "Blackberry Winter, " Dr. Mead was wistful about her marriage and its years in Bali, saying: "I think it is a good thing to have such a model once [as Mr. Bateson] even if the model includes the kind of extra intensity in which a lifetime is condensed into a few short years. "
In another recollection, she seemed to fault herself, saying "American women are good mothers, but they make poor wives; Americans are very poor at being attentive to anybody else. "
Nevertheless, in their Bali years the couple took and annotated 25,000 photographs. This work, which was done in 1936-38, had a large impact on other anthropologists.
Turned Dry: An Anthropologist Looks at America, " issued in 1942. The book dealt with American character outlined against the background of the seven other cultures she had studied. It increased the demand for her lectures and gave her the chance to speak out on current issues.
One of the issues that she tackled was male-female relationships, her thoughts on which she gathered into "Male and Female: A Study of Sexes in a Changing World, " published in 1949. "A vast. turbulent book, " Rebecca West said of it. Among its observations was, "Differences in sex as they are known today are based on the bringing up by the mother--she is always pushing the female toward similarity and the male toward difference. "
In more recent years, Dr. Mead became an outspoken leader of the feminist movement. Indeed, she felt it her duty to improve people's understanding of themselves and especially women's understanding of themselves. She liked to talk, often with scorching humor, about what she saw as the follies of conventional ways of loving, working, birthing, housing and aging. This sense of mission appeared to many to account for Dr. Mead's restless zeal. "She wanted to be a mother to the world, " a friend said. Taught at Columbia
In addition to her post at the American Museum of Natural History, she was also adjunct professor of anthropology at Columbia and taught tho the world, " a friend said. Taught at Columbia
In addition to her post at the American Museum of Natural History, she was also adjunct professor of anthropology at Columbia and taught the subject at Fordham.
In addition to her daughter, Mary Catherine Bateson Kassarjian, dean of social sciences at Raza Shah Civar University in Iran, Dr. Mead is survived by a granddaughter, Sevanne, and a sister, Elizabeth Mead Steig of Cambridge, Mass.
Funeral services will be private and burial will be in Buckingham, Pa. A memorial service will be held at 2 P.M. tomorrow in St. Paul's Chapel, Columbia University.
It will make our lives easier to define a function that loads all files at
once. That way we only have to call that function, rather than rewriting some
loading code every time we want files. Here is what the load_corpus()
function does below:
Steps through each path in
paths
Opens the file and appends it to a list
def load_corpus(paths):
"""Load a corpus from paths.
Parameters
----------
paths : list[Path]
A list of paths
Returns
-------
corpus : list[str]
The corpus
"""
# Initialize an empty list to store the corpus
corpus = []
# March through each path, open the file, and load it
for path in paths:
with path.open("r") as fin:
doc = fin.read()
# Then add the file to the list
corpus.append(doc)
# Return the result: a list of strings, where each string is the contents
# of a file
return corpus
With this function defined, we load our files.
corpus = load_corpus(paths)
print("Size of the corpus:", len(corpus), "files.")
Size of the corpus: 379 files.
3.3. Working with Tabular Data#
Note however that the file names do not tell us the title of these poems:
print(random_path.name)
252.txt
Were we to run analyses on these documents, we would have no guide telling us which data is about which document. This is where metadata comes in. Often when working with text data, you will find information about the contents of a corpus stored separately from the data itself. Part of your workflow will require aligning corpus contents with this metadata.
3.3.1. Loading tabular data#
In our case, metadata is stored in a comma-separated (CSV) file, a plain text
format for tabular data. Tabular data arranges information in columns and
rows, just like a spreadsheet. The pandas
package helps us work with this
kind of data. When we load it into Python, we create a DataFrame. Just as
with a spreadsheet, a DataFrame has columns and rows. But it also offers a huge
amount of functionality for working with its contents.
Below, we load our metadata.
manifest = pd.read_csv("data/texts/nyt/metadata.csv")
Here is a high-level overview of the metadata. It shows the columns and their names, the number of observations in each column that contain values, and the datatype of these columns.
manifest.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379 entries, 0 to 378
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 379 non-null object
1 year 379 non-null int64
2 file 379 non-null object
dtypes: int64(1), object(2)
memory usage: 9.0+ KB
Just want to know the columns? Use the .columns
attribute.
manifest.columns
Index(['name', 'year', 'file'], dtype='object')
Use the .head()
method to look at the first few rows of the data.
manifest.head()
name | year | file | |
---|---|---|---|
0 | Ada Lovelace | 1852 | 000.txt |
1 | Robert E Lee | 1870 | 001.txt |
2 | Andrew Johnson | 1875 | 002.txt |
3 | Bedford Forrest | 1877 | 003.txt |
4 | Lucretia Mott | 1880 | 004.txt |
Note the file
column. Values stored there correspond to the names of files in
the data directory. If we use file
as a guide to construct a list of paths,
the order of the files in corpus
will be the same as the order in the
DataFrame.
3.3.2. Indexing by column#
Accessing values in file
requires us to index the DataFrame. In pandas
, we
use bracket notation in conjunction with a column’s name to index that column.
manifest["file"]
0 000.txt
1 001.txt
2 002.txt
3 003.txt
4 004.txt
...
374 374.txt
375 375.txt
376 376.txt
377 377.txt
378 378.txt
Name: file, Length: 379, dtype: object
There are several ways to index rows, which we discuss below. But the simplest involves treating an indexed column like a list.
manifest["file"][10]
'010.txt'
The above hints at what we do next: we use a list comprehension to iterate
through each value in file
and combine it with the data directory path. Note
this time that we do not need to use a glob pattern, since we build the final
path directly from the value in file
using the /
operator.
ordered_paths = [datadir / fname for fname in manifest["file"]]
Now, when we load our corpus, it will be aligned to the metadata’s order.
corpus = load_corpus(ordered_paths)
From here, we could go about our analysis. But that would involve jumping
across two different objects, manifest
and corpus
. This is a pain, so we
will create a new column in our metadata sheet and assign the contents of our
corpus to it.
manifest["text"] = corpus.copy()
Under the hood, every column in a DataFrame is a Series. A DataFrame, in other words, is a collection of Series objects. The latter have much of the same functionality as the former, but DataFrames provide us with the ability to do more faceted indexing and global analyses.
For example, now that our corpus contents are stored in the text
column, we
can index that data alongside other information in the DataFrame. To do so, use
a list of column names.
manifest[["name", "text"]].head()
name | text | |
---|---|---|
0 | Ada Lovelace | A gifted mathematician who is now recognized a... |
1 | Robert E Lee | October 13, 1870\n\n OBITUARY\n\n Gen. Robert ... |
2 | Andrew Johnson | August 1, 1875\n\n OBITUARY\n\n Andrew Johnson... |
3 | Bedford Forrest | October 30, 1877\n\n OBITUARY\n\n Death of Gen... |
4 | Lucretia Mott | November 12, 1880\n\n OBITUARY\n\n Lucretia Mo... |
3.3.3. Indexing by row#
Indexing by rows is more complicated than indexing by columns. This is because a DataFrame index serves three important roles:
As metadata that provides more context about a dataset
As a method of data alignment
As a convenience function for subsetting data
Use the .index
attribute to access the values of a DataFrame index. These
values can be numbers, strings, dates, or other values.
manifest.index
RangeIndex(start=0, stop=379, step=1)
Like tuples, indexes are immutable. But you can change the index of a
DataFrame. Below, we set the index to title
, using inplace = True
so we do
not need to reassign the DataFrame back to the same variable.
manifest.set_index("name", inplace = True)
manifest.head()
year | file | text | |
---|---|---|---|
name | |||
Ada Lovelace | 1852 | 000.txt | A gifted mathematician who is now recognized a... |
Robert E Lee | 1870 | 001.txt | October 13, 1870\n\n OBITUARY\n\n Gen. Robert ... |
Andrew Johnson | 1875 | 002.txt | August 1, 1875\n\n OBITUARY\n\n Andrew Johnson... |
Bedford Forrest | 1877 | 003.txt | October 30, 1877\n\n OBITUARY\n\n Death of Gen... |
Lucretia Mott | 1880 | 004.txt | November 12, 1880\n\n OBITUARY\n\n Lucretia Mo... |
There are three ways to index by row:
By integer position
By label/name
By a condition
Indexing by integer position works with the .iloc
property.
manifest.iloc[45]
year 1922
file 045.txt
text August 3, 1922\n\n OBITUARY\n\n Dr. Bell, Inve...
Name: Alexander Graham Bell, dtype: object
Use a sequence of values to return multiple rows:
manifest.iloc[[2, 4, 6, 8, 10]]
year | file | text | |
---|---|---|---|
name | |||
Andrew Johnson | 1875 | 002.txt | August 1, 1875\n\n OBITUARY\n\n Andrew Johnson... |
Lucretia Mott | 1880 | 004.txt | November 12, 1880\n\n OBITUARY\n\n Lucretia Mo... |
Ulysses Grant | 1885 | 006.txt | July 24, 1885\n\n OBITUARY\n\n The Career of a... |
Emma Lazarus | 1887 | 008.txt | November 20, 1887\n\n OBITUARY\n\n Emma Lazaru... |
P T Barnum | 1891 | 010.txt | April 8, 1891\n\n OBITUARY\n\n The Great Showm... |
Or send a slice. Here, the first five rows:
manifest.iloc[0:5]
year | file | text | |
---|---|---|---|
name | |||
Ada Lovelace | 1852 | 000.txt | A gifted mathematician who is now recognized a... |
Robert E Lee | 1870 | 001.txt | October 13, 1870\n\n OBITUARY\n\n Gen. Robert ... |
Andrew Johnson | 1875 | 002.txt | August 1, 1875\n\n OBITUARY\n\n Andrew Johnson... |
Bedford Forrest | 1877 | 003.txt | October 30, 1877\n\n OBITUARY\n\n Death of Gen... |
Lucretia Mott | 1880 | 004.txt | November 12, 1880\n\n OBITUARY\n\n Lucretia Mo... |
Here, every tenth row:
manifest.iloc[::10]
year | file | text | |
---|---|---|---|
name | |||
Ada Lovelace | 1852 | 000.txt | A gifted mathematician who is now recognized a... |
P T Barnum | 1891 | 010.txt | April 8, 1891\n\n OBITUARY\n\n The Great Showm... |
James M N Whistler | 1903 | 020.txt | July 18, 1903\n\n OBITUARY\n\n James M'N. Whis... |
Joseph Pulitzer | 1911 | 030.txt | Monday, October 30, 1911\n\n OBITUARY\n\n Jose... |
C J Walker | 1919 | 040.txt | May 26, 1919\n\n OBITUARY\n\n Wealthiest Negre... |
Marie Curie | 1929 | 050.txt | PARIS, July 4.--Mme. Marie Curie, whose work a... |
Florenz Ziegfeld | 1932 | 060.txt | July 23, 1932\n\n OBITUARY\n\n Florenz Ziegfel... |
John W Heisman | 1936 | 070.txt | October 4, 1936\n\n OBITUARY\n\n John W. Heism... |
Howard Carter | 1939 | 080.txt | March 3, 1939\n\n OBITUARY\n\n Howard Carter, ... |
Alfred E Smith | 1944 | 090.txt | October 4, 1944\n\n OBITUARY\n\n Alfred E. Smi... |
Lord Keynes | 1946 | 100.txt | April 22, 1946\n\n OBITUARY\n\n Lord Keynes Di... |
Babe Ruth | 1948 | 110.txt | August 17, 1948\n\n OBITUARY\n\n Babe Ruth, Ba... |
Charles Spaulding | 1952 | 120.txt | August 2, 1952\n\n OBITUARY\n\n Ex-Slave's Son... |
Henri Matisse | 1954 | 130.txt | November 4, 1954\n\n OBITUARY\n\n Art World Mo... |
Charles Merrill | 1956 | 140.txt | October 7, 1956\n\n OBITUARY\n\n Charles Merri... |
John Dulles | 1959 | 150.txt | May 25, 1959\n\n OBITUARY\n\n Dulles Formulate... |
Carl G Jung | 1961 | 160.txt | June 7, 1961\n\n OBITUARY\n\n Dr. Carl G. Jung... |
Sean O Casey | 1964 | 170.txt | September 19, 1964\n\n OBITUARY\n\n Sean O'Cas... |
Albert Schweitzer | 1965 | 180.txt | September 6, 1965\n\n OBITUARY\n\n Albert Schw... |
Langston Hughes | 1967 | 190.txt | May 23, 1967\n\n OBITUARY\n\n Langston Hughes,... |
Madhubala | 1969 | 200.txt | A Bollywood legend whose tragic life mirrored ... |
Coco Chanel | 1971 | 210.txt | January 11, 1971\n\n OBITUARY\n\n Chanel, the ... |
Mahalia Jackson | 1972 | 220.txt | January 28, 1972\n\n OBITUARY\n\n Mahalia Jack... |
Nancy Mitford | 1973 | 230.txt | July 1, 1973\n\n OBITUARY\n\n Nancy Mitford, A... |
Chiang Kai shek | 1975 | 240.txt | April 6, 1975\n\n OBITUARY\n\n The Life of Chi... |
Maria Callas | 1977 | 250.txt | September 17, 1977\n\n OBITUARY\n\n Maria Call... |
Jesse Owens | 1980 | 260.txt | April 1, 1980\n\n OBITUARY\n\n Jesse Owens Die... |
Arthur Rubinstein | 1982 | 270.txt | December 21, 1982\n\n OBITUARY\n\n Arthur Rubi... |
Ansel Adams | 1984 | 280.txt | April 24, 1984\n\n OBITUARY\n\n Ansel Adams, P... |
Georgia O Keeffe | 1986 | 290.txt | March 7, 1986\n\n OBITUARY\n\n Georgia O' Keef... |
James Baldwin | 1987 | 300.txt | December 2, 1987\n\n OBITUARY\n\n James Baldwi... |
Andrei A Gromyko | 1989 | 310.txt | July 4, 1989\n\n OBITUARY\n\n Andrei A. Gromyk... |
Erte | 1990 | 320.txt | April 22, 1990\n\n OBITUARY\n\n Erte, a Master... |
John Cage | 1992 | 330.txt | August 13, 1992\n\n OBITUARY\n\n John Cage, 79... |
Dizzy Gillespie | 1993 | 340.txt | January 7, 1993\n\n OBITUARY\n\n Dizzy Gillesp... |
Jacqueline Kennedy | 1994 | 350.txt | May 20, 1994\n\n OBITUARY\n\n Death of a First... |
Deng Xiaoping | 1997 | 360.txt | February 20, 1997\n\n OBITUARY\n\n Deng Xiaopi... |
Fred W Friendly | 1998 | 370.txt | March 5, 1998\n\n OBITUARY\n\n Fred W. Friendl... |
Indexing by label works with .loc
.
name = "John Dewey"
manifest.loc[name]
year 1952
file 118.txt
text June 2, 1952\n\n OBITUARY\n\n Dr. John Dewey D...
Name: John Dewey, dtype: object
Use it conjunction with a column name to access the value in a cell.
print(manifest.loc[name, "text"])
June 2, 1952
OBITUARY
Dr. John Dewey Dead at 92; Philosopher a Noted Liberal
By THE NEW YORK TIMES
Dr. John Dewey, the philosopher from whose teachings has grown the school of progressive education and "learning by doing," died of pneumonia in his home, 1158 Fifth Avenue, at 7 o'clock last night. He was 92 years old.
His wife, the former Mrs. Roberta Lowit Grant, who was with him when he died, said he had been ill for twenty-six hours. He had broken a hip last November, and had been confined to the apartment, except for occasional trips to the roof for sunning.
The widow said Dr. Dewey had been carrying on various projects at home to the last, and had outlined several works. She had no idea how near to possible publication any of them might be.
Surviving also are two adopted children, Adrienne, 12, and John, 9. Five other children of his first marriage also survive--Frederick A. Dewey of New York, Mrs. Evelyn Smith of Kansas City, Mo.; Mrs. Lucy A. Brandaur of Syracuse, N.Y.; Miss Jane U. Dewey of Baltimore and Sabino L. Dewey of Huntington, L.I., the last also having been adopted.
Mrs. Dewey said the funeral service would be held at the Community Church of New York, 40 East Thirty-fifth Street, on Wednesday at 1 P.M.
As a philosopher--and he was acknowledged by many as America's foremost philosopher of his time--Dr. Dewey was not content to bring forth theories; he came forward to emphasize his ideas of liberalism, and, with the courage of a crusader, was willing to lend his name and reputation to causes that were frowned upon by staid society.
He was too big a man to be sneered at as an "armchair Bolshevist." His convictions were those of an essentially honest man, and although he might well have sat back to criticize the general order of things, he took an active part in the attempt to create a third political party, to lend his voice and influence to help the down-trodden, to do away with oppression in this country and elsewhere, and to strive for a finer universal education.
In his quest for betterment he met--and was prepared to meet--not only opposition but defeat. Some of his plans were quixotic and much too good for this world, but he never wavered in a cause that he considered just and he commanded the respect of all who opposed him.
As the champion of an ideal and liberal democracy, Dr. Dewey saw the good as well as the bad in countries where the masses were groping for new social systems. He visited Russia, China and Turkey; saw for himself, and maintained his views in the face of public opinion in this country. He condemned hasty judgment of the affairs of other peoples and pointed to the flaws at home in no uncertain terms.
Dr. Dewey had become attached to liberalism in his student days at the University of Vermont and at Johns Hopkins, where he came under the influence of Coleridge, Emerson and T. H. Green, but what finally emancipated him from the cumbersome and academic systems of transcendentalism was his discovery in 1891 of William James' "Psychology." In this work, according to Prof. Herbert W. Schneider of Columbia, he not only found the "instrumental theory of concepts" on which Dewey's logic was based, but also experienced that contagious mental "loosening up" with which James influenced his generation and which made him the father of American philosophy.
Noted for Educational Reform
Dr. Dewey's principal achievement was perhaps his educational reform. He was the chief prophet of progressive education. After twenty years that movement--"learning by doing"--had become a major factor in American education in the late
Thirties, and in 1941 the New York State Department of Education approved a six-year experiment in schools embodying the Dewey philosophy.
But progressive education was long the center of controversy among educators, and in the early Forties criticism was becoming more outspoken. The revolt against Dewey and pragmatism in education was strongest in Chicago, the scene of his first and greatest triumphs. At the University of Chicago, where Dr. Dewey was head of the Department of Philosophy and for two years director of the School of Education, President Robert Hutchins has sponsored a system of "education for freedom" which seeks to separate the teaching of the "intellectual" from the "practical" arts. Both Dr. Hutchins and Dr. Nicholas Murray Butler, long president of Columbia University, sharply attacked progressive education in 1944.
In a birthday interview that year Dr. Dewey dismissed as "a childish point of view" the criticism by Dr. Butler, in an address at the opening of the university, that progressive education, "a most reactionary philosophy," has led to undisciplined youth.
And replying to Dr. Hutchins' attacks, he said: "President Hutchins calls for liberal education of a small, elite group and vocational education for the masses. I cannot think of any idea more completely reactionary and more fatal to the whole democratic outlook."
While Professor of Philosophy at the University of Michigan in 1893 Dewey wrote: "If I were asked to name the most needed of all reforms in the spirit of education I should say: 'Cease conceiving of education as mere preparation for later life, and make of it the full meaning of the present life.' And to add that only in this case does it become truly a preparation for later life is not the paradox it seems. An activity which does not have worth enough to be carried on for its own sake cannot be very effective as a preparation for something else if the new spirit in education forms the habit of requiring that every act be an outlet of the whole self, and it provides the instruments of such complete functioning."
Later in life Professor Dewey devoted much time and thought to reform of government. He declared that the "control of government must be redeemed from the special interests which have usurped it and restored to the people." Unless this were done, he warned, political democracy would be doomed.
Championed New Thought
He referred to the major political parties as "the errand boys of big business," and he championed new thought, actively through his connections with the People's Lobby, of which he was the president, and more indirectly by his writings.
During 1946 Dr. Dewey participated with labor leaders in conferences at Chicago and Detroit designed to lay the groundwork for a third, or People's, party for 1948. At the Detroit conference, a National Educational Committee was formed. Leaders at the conferences were from the Congress of Industrial Organizations, the American Federation of Labor and the farmers' unions.
John Dewey was born at Burlington, Vt., on Oct. 20, 1859, son of Archibald S. Dewey and Lucina A. Rich Dewey. His father was a merchant who traced his ancestry to 1640. His mother was the daughter of a prosperous Vermont farmer of Cape Cod ancestry.
He studied in common schools and later attended the University of Vermont, being graduated in 1879. He then taught school at Oil City, Pa., and subsequently in general country schools in Vermont. One year he spent studying philosophy with Prof. H. A.
P. Torrey of the University of Vermont.
After this he went to Johns Hopkins, where he studied philosophy and psychology under Prof. G. S. Morris, and in 1884 he received his Ph.D. degree.
That year he was appointed instructor and assistant Professor of Philosophy at the University of Michigan and remained as such until 1888, when he went to the University of Minnesota as Professor of Philosophy. After a year he returned to the University of Michigan where he remained for five years.
From 1894 to 1904 Professor Dewey was head of the Department of Philosophy at the University of Chicago and for two years he was director of the School of Education of the same institution. In 1904 he was appointed Professor of Philosophy at Columbia
University. Besides his regular work there Dr. Dewey taught at Teachers College.
He retired with the title of Professor Emeritus on July 1, 1930.
In 1886 his first work, "Psychology," was published.
This was followed by "Liebnitz," "Critical Theory of Ethics," "Study of Ethics," "School and Society," "Studies in Logical Theory," "How to Think," "Influence of Darwin on Philosophy and
Other Essays," "German Philosophy and Politics," "Democracy and Education," "Reconstruction in Philosophy," "Human Nature and Conduct," "Experience and Nature," "The Public and Its Problems," "The Quest for Certainty" and "Individualism, Old and New."
Others were "Philosophy and Civilization," "Art as Experience," "Liberalism and Social Action," "Logic: The Theory of Inquiry" and "Culture and Freedom."
In reviewing Dr. Dewey's "Problems of Men," published in June, 1946, Dr. Alvin Johnson, president emeritus of the New School for Social Research, said Dr. Dewey struck "straight at reactionary philosophers." In replying to his philosophical and educational critics, Dr. Johnson said that Dewey concluded: "Philosophy counts for next to nothing in the present world-wide crisis of human affairs and should count for less.
It needs a thorough house-cleaning and the final, definitive abandonment of most of its traditional values. Those values are class values. They were established in a time when the masses of mankind lived in slavery, or near-slavery, and when a little body of the elect could occupy themselves with speculations on the divine and the absolute. The present world belongs to a democracy. And the democracy cannot waste time on recondite speculations that have nothing to do with life."
Professor Dewey had a small but enthusiastic following of Socialists, moderate radicals and thinkers. There was a so-called "Dewey group," which was popularly known as a gathering of liberal-minded men and women. Among those who became his disciples, or rather associates in thought, were such men as Walter Lippmann, Charles A. Beard, Sinclair Lewis, Morris Hillquit, Oswald Garison Villard and Norman Thomas. Their influence was felt more especially at times when political graft and abuse became over-oppressive. Others who associated themselves with Dr. Dewey were Rabbi Stephen S. Wise and John Haynes Holmes. Dr. Dewey was a supporter of the
Civil Liberties Union and was chairman of the League for Independent Political Action. One of the articles of Professor Dewey's political creed was "vote for the man rather than for the party." His faith was embodied in articles that he wrote for The New Republic, Philosophical Review, Journal of Philosophy, Monist and International Journal of Ethics.
Lectured in Tokyo
In 1919 Dr. Dewey delivered a series of lectures at the Imperial University at Tokyo. These were later published as "Reconstruction in Philosophy." That same year he was invited by former Chinese students in this country to lecture in China on the subjects of education and philosophy. He stayed in China for two years, making his headquarters at Peiping, but he traveled through all the provinces from Mukden to Canton.
Dr. Dewey went to Turkey in 1924 to make a report on the new republican government schools. In 1926 he was in Mexico, where he lectured at the Summer School of the University of Mexico. In 1928 he was one of the delegates of American educators who visited children's institutions at Leningrad and Moscow at the invitation of the Soviet Government. Dr. Dewey was highly impressed with the educational experiments in new Russia, and voiced enthusiasm upon his return to this country. He was somewhat criticized for his views, however, and there were some who maintained that he was naive, though his sincerity was never questioned.
In 1937 he experienced one of the stormiest episodes in his life. He went to Mexico as the head of a commission to investigate the validity of charges made by the Soviet Government against Leon Trotsky, who was living in Mexico. Trotsky had been sentenced to death by a Russian court for plotting to overthrow the Moscow Government, but Dr. Dewey insisted that Trotsky never had a chance to defend himself. "Now," he said, "it is up to him to present his case. I am neither a Trotskyite nor a Stalinist. I don't accept the Moscow evidence as conclusive till I hear the other side."
The commission announced that it found Trotsky was innocent of the terrorism and fascist conspiracy with which he had been charged.
An avowed anti-Communist, Dr. Dewey had his views as to the ideal balance between the State and the individual. What was needed, he explained, was an authority capable of directing and utilizing changes for a kind of individual freedom unlike that which the unconstrained economic liberty had produced and justified.
Dr. Dewey believed that if democracy were to survive in this country it would require it would require a tremendous reorganization of instruction and administration in the schools. Democracy, he maintained, "cannot go forward unless the intelligence of the mass of people is educated to understand the social realities of their own time."
Professor Dewey constantly urged the cultivation of independent thinking, and he deplored what he termed the "empty imitation" in this country of thought in Europe. He was often heard in public debate on matters of social or political significance, and he cheerfully agreed to act as chairman whenever he thought something of consequence might result from such disputations.
Predicted War by Hitler
As early as 1933 Dr. Dewey voiced his fear of what the future might bring if Hitler remained unchecked in Germany. Just before he sailed for Europe that year he predicted that Hitler would be headed for war as soon as he felt strong enough. A year later he asserted that Hitler and Hitlerism were "by all odds the greatest threat to world peace today." In 1936 he was one of a group of eighteen philosophers who refused to participate in the celebration of a philosophical institution in Berlin. At that time he also was wary of Japan, warning that a secret agreement existed between Japan and Germany.
He called for the United States to take action against Japan, urging that a boycott be put into effect till such time as the Nipponese forces left China. The Chinese Government conferred on him the Order of the Jade for the contributions he had made to the education and leadership of China.
He was honorary life president of the National Education Association, a member of the National Academy of Sciences, the American Psychological Association (president, 1899-1900), American Philosophical Association (president, 1905-06), and corresponding member of the Institut de France.
In 1938 Dr. Dewey was voted by the Aristogenic Society as one of the ten "greatest Americans." This honor included the recording of every phase of his life for future generations.
Active In Teachers Guild
Professor Dewey took a sharp interest in the internal affairs of this city and the nation. In 1936 he was a leader in the movement to obtain a new city charter. He was active in organizations such as the New York Teachers Guild, the League for Industrial
Democracy, the International League for Academic Freedom and the Committee for Cultural Freedom. He also assisted in the founding of a University-in-Exile for famous scholars driven out of their native countries.
In 1944 he aided in the organization of a Council for a Democratic Germany, and was on and Educators-for-Roosevelt Committee formed to promote the re-election of Franklin D. Roosevelt. After the war he joined with other leaders in petitions to President
Truman for the release of conscientious objectors still being held.
Dr. Dewey was extremely courteous and mild of manner. He was a scholar who ventured to descend into the maelstrom of political strife and who took his blows, unfair as they were in many cases, with a smile and a shrug of the shoulder. But he was persistent and his enthusiasm was infectious. "I see no hope," he said once, "for sanity and reality in American life except through the agency of a new party."
He sowed the seeds, but he never saw its fruits. When a bust of John Dewey, modeled by Jacob Epstein, was unveiled at Columbia in 1928, Dr. William Heard Kilpatrick, Professor of Education, said: "Dr. Dewey is America's greatest living philosopher and must be included among the greatest thinkers of all times. He has in the minds of many changed almost our whole conception of what philosophy is, delivering us from the old puzzles that have formed the stock in trade of the traditional philosophy. He is chiefly responsible for our thinking of intelligence as primarily instrumental. His philosophy has common sense, acceptability and a social bearing which distinguishes it in degree from all other philosophies."
But perhaps the best description of all can be found in an editorial written to mark his eightieth birthday: "there are countless school children today and yesterday whose lives have been influenced in a constructive way by this one man who never shouted, and whose formally stated philosophy often is a stiff dose for more subtle minds. One thinks of him as refining into gold the rough ore of our tumultuous pioneer experience. He is yankeeism at its best--shrewd, wise, humane."
Dr. Dewey retired from teaching at Columbia in 1930, when he was 70, but he went on writing and lecturing, publishing more than 300 books, essays and articles. By the time he was 90, his published works must have totaled 1,000.
Opposed Loyalty Oaths
In recent years, he had lived in a large apartment overlooking Central Park at Fifth Avenue and Ninety-seventh Street. He spent his winters in Florida. He never lost interest in public affairs, often speaking and writing on questions of the day. He opposed teachers' loyalty oaths, but came to believe that known Communists should not be permitted to teach children. He defended the United States action in Korea. For these and other anti-Soviet views, he won the criticism of Pravda.
To the end he lent his name to the causes for which he believed, even when he could not be present in person. This year, he was a member of the Congress for Cultural Freedom, which, among other activities, sponsored a festival of Western music, art and literature in Paris dedicated to victims of Nazi, Soviet and Franco Spanish tyrannies. Last month he was elected an honorary vice chairman of the Liberal party of New York State.
Dr. Dewey's ninetieth birthday was celebrated in many universities and by cultural societies in the United States and abroad. He was honored, mostly in absentia, by testimonial dinners and meetings for weeks in the fall of 1949. He did attend one large testimonial dinner at the Commodore Hotel, when admirers presented to him $90,000 to be used on worthy educational projects of his choosing. Among notables who sent felicitations were President
Truman and Prime Minister Attlee of England.
Honored by Yale in 1951
Professor Dewey maintained surprisingly good health for such an old man. Even a serious operation in 1951 did not incapacitate him, and he recovered sufficiently to accept in person the honorary degree of Doctor of Literature at Yale's June commencement.
At the age of 92, Dr. Dewey looked twenty years younger. His bushy hair and mustache were white, but they had been so for decades. His eyes were still keen, his mind alert, and his physical strength sufficient for him to take walks and typewrite his own scripts and letters.
Dr. Dewey married twice. In 1886, Alice Chipman, one of his students at the University of Michigan, became his bride. There were born to them six children, two of whom died in childhood. A seventh child was adopted. Mrs. Dewey died in 1927.
When he was 87, the philosopher married Mrs. Grant, a widow who lived in San Francisco, on December 11, 1946. Not quite half his age, Mrs. Grant came from an Oil City, Pa., glass manufacturing family which had been friends with Dr. Dewey before she was born.
She had been a director of educational travel for the Cunard Steamship Company. Dr. Dewey had arranged years before to turn over the bulk of his assets to his first wife and their children, and after his second marriage the Deweys lived largely on his wife's inheritance and some later royalties.
Or send it a sequence of labels:
names = ["John Dewey", "Lucille Ball"]
manifest.loc[names]
year | file | text | |
---|---|---|---|
name | |||
John Dewey | 1952 | 118.txt | June 2, 1952\n\n OBITUARY\n\n Dr. John Dewey D... |
Lucille Ball | 1989 | 306.txt | April 27, 1989\n\n OBITUARY\n\n Lucille Ball, ... |
Finally, there is indexing by condition. This works by evaluating a condition and returning a Series of Boolean values. It is the most powerful method of indexing in Pandas by far.
Below, we find all poems with names that start with “S”. Use the .str
attribute of an index of strings to accomplish this.
manifest.index.str.startswith("S")
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, True, False, False, False,
False, False, False, False, True, False, False, True, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, False, False,
False, False, True, False, False, False, False, False, True,
False, False, False, False, False, False, False, False, False,
False, True, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, True, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, True, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, True,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False])
See the Boolean values? Let’s assign the output of the above to a mask variable, with which we will index the DataFrame.
mask = manifest.index.str.startswith("S")
manifest.loc[mask]
year | file | text | |
---|---|---|---|
name | |||
Stephen Crane | 1900 | 014.txt | June 6, 1900\n\n OBITUARY\n\n Stephen Crane De... |
Susan B Anthony | 1906 | 022.txt | March 13, 1906\n\n OBITUARY\n\n Miss Susan B. ... |
Sarah Orne Jewett | 1909 | 025.txt | June 25, 1909\n\n OBITUARY\n\n Sarah Orne Jewe... |
Scott Fitzgerald | 1940 | 081.txt | December 23, 1940\n\n OBITUARY\n\n Scott Fitzg... |
Sergei Eisenstein | 1948 | 108.txt | February 12, 1948\n\n OBITUARY\n\n Sergei Eise... |
Sam Rayburn | 1961 | 159.txt | November 17, 1961\n\n OBITUARY\n\n Rayburn Is ... |
Sylvia Plath | 1963 | 164.txt | A postwar poet unafraid to confront her own de... |
Sean O Casey | 1964 | 170.txt | September 19, 1964\n\n OBITUARY\n\n Sean O'Cas... |
Shirley Jackson | 1965 | 181.txt | August 10, 1965\n\n OBITUARY\n\n Shirley Jacks... |
Sonja Henie | 1969 | 204.txt | October 13, 1969\n\n OBITUARY\n\n Sonja Henie,... |
Sylvia Plath | 1974 | 232.txt | January 13, 1974\n\n REVIEW\n\n Her Poetry, No... |
Stan Kenton | 1979 | 258.txt | August 27, 1979\n\n OBITUARY\n\n Stan Kenton, ... |
Satchel Paige | 1982 | 273.txt | June 9, 1982\n\n OBITUARY\n\n Satchel Paige, B... |
Samuel Beckett | 1989 | 316.txt | December 27, 1989\n\n OBITUARY\n\n Samuel Beck... |
Sammy Davis Jr | 1990 | 318.txt | May 17, 1990\n\n OBITUARY\n\n Sammy Davis Jr. ... |
Shirley Booth | 1992 | 332.txt | October 21, 1992\n\n OBITUARY\n\n Shirley Boot... |
What makes indexing by condition so powerful is that it generalizes to other data in the DataFrame, not just the index. Let’s reset the index and explore this a little.
manifest.reset_index(inplace = True)
manifest.head()
name | year | file | text | |
---|---|---|---|---|
0 | Ada Lovelace | 1852 | 000.txt | A gifted mathematician who is now recognized a... |
1 | Robert E Lee | 1870 | 001.txt | October 13, 1870\n\n OBITUARY\n\n Gen. Robert ... |
2 | Andrew Johnson | 1875 | 002.txt | August 1, 1875\n\n OBITUARY\n\n Andrew Johnson... |
3 | Bedford Forrest | 1877 | 003.txt | October 30, 1877\n\n OBITUARY\n\n Death of Gen... |
4 | Lucretia Mott | 1880 | 004.txt | November 12, 1880\n\n OBITUARY\n\n Lucretia Mo... |
Below, we find obituaries that contain the string “musician” (note that this will return the plural as well).
manifest.loc[manifest["text"].str.contains("musician"), "name"]
18 Benjamin Harrison
21 Emily Warren Roebling
51 Balfour
59 John Philip Sousa
96 Jerome Kern
98 Bela Bartok
104 Fiorello La Guardia
110 Babe Ruth
144 W C Handy
145 Billie Holiday
151 Boris Pasternak
180 Albert Schweitzer
199 Coleman Hawkins
216 Louis Armstrong
218 Igor Stravinsky
220 Mahalia Jackson
228 Pablo Picasso
231 Earl Warren
250 Maria Callas
255 Arthur Fiedler
258 Stan Kenton
259 Richard Rodgers
270 Arthur Rubinstein
271 Thelonious Monk
276 Muddy Waters
280 Ansel Adams
285 Count Basie
291 Benny Goodman
302 Andres Segovie
312 Vladimir Horowitz
319 Leonard Bernstein
323 Frank Capra
325 Miles Davis
326 Martha Graham
330 John Cage
340 Dizzy Gillespie
350 Jacqueline Kennedy
371 Frank Sinatra
375 Pierre Trudeau
Name: name, dtype: object
Here, something more complicated: documents that contain the string “paintings”
with file names above 09.txt
.
mask = (manifest["text"].str.contains("Bird")) & \
(manifest["file"].str.startswith("00") == False)
manifest.loc[mask, "name"]
175 David O Selznick
181 Shirley Jackson
224 Lyndon Johnson
261 Alfred Hitchcock
275 Earl Hines
281 Ethel Merman
340 Dizzy Gillespie
348 Jessica Tandy
377 Charles M Schulz
Name: name, dtype: object
3.4. Preparing for Data Analysis#
With the basics of indexing done, we will prepare to analyze the corpus. This
will involve two steps. First, we preprocess the raw text in the text
column
of our DataFrame. Then, we define some plotting functions to graph the results
of our analysis.
3.4.1. Preprocessing#
As we saw in the last chapter, operations like counting require texts to be processed in special ways. This includes changing the case of texts, breaking texts into lists of tokens, and so forth.
Previously, we used a simple heuristic to tokenize text: split the text stream on whitespace characters.
example = manifest.loc[manifest["name"] == "Ada Lovelace", "text"].item()
print(example.split())
['A', 'gifted', 'mathematician', 'who', 'is', 'now', 'recognized', 'as', 'the', 'first', 'computer', 'programmer.', 'By', 'CLAIRE', 'CAIN', 'MILLER', 'A', 'century', 'before', 'the', 'dawn', 'of', 'the', 'computer', 'age,', 'Ada', 'Lovelace', 'imagined', 'the', 'modern-day,', 'general-purpose', 'computer.', 'It', 'could', 'be', 'programmed', 'to', 'follow', 'instructions,', 'she', 'wrote', 'in', '1843.', 'It', 'could', 'not', 'just', 'calculate', 'but', 'also', 'create,', 'as', 'it', '“weaves', 'algebraic', 'patterns', 'just', 'as', 'the', 'Jacquard', 'loom', 'weaves', 'flowers', 'and', 'leaves.”', 'The', 'computer', 'she', 'was', 'writing', 'about,', 'the', 'British', 'inventor', 'Charles', 'Babbage’s', 'Analytical', 'Engine,', 'was', 'never', 'built.', 'But', 'her', 'writings', 'about', 'computing', 'have', 'earned', 'Lovelace', '—', 'who', 'died', 'of', 'uterine', 'cancer', 'in', '1852', 'at', '36', '—', 'recognition', 'as', 'the', 'first', 'computer', 'programmer.', 'The', 'program', 'she', 'wrote', 'for', 'the', 'Analytical', 'Engine', 'was', 'to', 'calculate', 'the', 'seventh', 'Bernoulli', 'number.', '(Bernoulli', 'numbers,', 'named', 'after', 'the', 'Swiss', 'mathematician', 'Jacob', 'Bernoulli,', 'are', 'used', 'in', 'many', 'different', 'areas', 'of', 'mathematics.)', 'But', 'her', 'deeper', 'influence', 'was', 'to', 'see', 'the', 'potential', 'of', 'computing.', 'The', 'machines', 'could', 'go', 'beyond', 'calculating', 'numbers,', 'she', 'said,', 'to', 'understand', 'symbols', 'and', 'be', 'used', 'to', 'create', 'music', 'or', 'art.', '“This', 'insight', 'would', 'become', 'the', 'core', 'concept', 'of', 'the', 'digital', 'age,”', 'Walter', 'Isaacson', 'wrote', 'in', 'his', 'book', '“The', 'Innovators.”', '“Any', 'piece', 'of', 'content,', 'data', 'or', 'information', '—', 'music,', 'text,', 'pictures,', 'numbers,', 'symbols,', 'sounds,', 'video', '—', 'could', 'be', 'expressed', 'in', 'digital', 'form', 'and', 'manipulated', 'by', 'machines.”', 'She', 'also', 'explored', 'the', 'ramifications', 'of', 'what', 'a', 'computer', 'could', 'do,', 'writing', 'about', 'the', 'responsibility', 'placed', 'on', 'the', 'person', 'programming', 'the', 'machine,', 'and', 'raising', 'and', 'then', 'dismissing', 'the', 'notion', 'that', 'computers', 'could', 'someday', 'think', 'and', 'create', 'on', 'their', 'own', '—', 'what', 'we', 'now', 'call', 'artificial', 'intelligence.', '“The', 'Analytical', 'Engine', 'has', 'no', 'pretensions', 'whatever', 'to', 'originate', 'any', 'thing,”', 'she', 'wrote.', '“It', 'can', 'do', 'whatever', 'we', 'know', 'how', 'to', 'order', 'it', 'to', 'perform.”', 'Lovelace,', 'a', 'British', 'socialite', 'who', 'was', 'the', 'daughter', 'of', 'Lord', 'Byron,', 'the', 'Romantic', 'poet,', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science,', 'one', 'of', 'her', 'biographers,', 'Betty', 'Alexandra', 'Toole,', 'has', 'written.', 'She', 'thought', 'of', 'math', 'and', 'logic', 'as', 'creative', 'and', 'imaginative,', 'and', 'called', 'it', '“poetical', 'science.”', 'Math', '“constitutes', 'the', 'language', 'through', 'which', 'alone', 'we', 'can', 'adequately', 'express', 'the', 'great', 'facts', 'of', 'the', 'natural', 'world,”', 'Lovelace', 'wrote.', 'Her', 'work,', 'which', 'was', 'rediscovered', 'in', 'the', 'mid-20th', 'century,', 'inspired', 'the', 'Defense', 'Department', 'to', 'name', 'a', 'programming', 'language', 'after', 'her', 'and', 'each', 'October', 'Ada', 'Lovelace', 'Day', 'signifies', 'a', 'celebration', 'of', 'women', 'in', 'technology.', 'Lovelace', 'lived', 'when', 'women', 'were', 'not', 'considered', 'to', 'be', 'prominent', 'scientific', 'thinkers,', 'and', 'her', 'skills', 'were', 'often', 'described', 'as', 'masculine.', '“With', 'an', 'understanding', 'thoroughly', 'masculine', 'in', 'solidity,', 'grasp', 'and', 'firmness,', 'Lady', 'Lovelace', 'had', 'all', 'the', 'delicacies', 'of', 'the', 'most', 'refined', 'female', 'character,”', 'said', 'an', 'obituary', 'in', 'The', 'London', 'Examiner.', 'Babbage,', 'who', 'called', 'her', 'the', '“enchantress', 'of', 'numbers,”', 'once', 'wrote', 'that', 'she', '“has', 'thrown', 'her', 'magical', 'spell', 'around', 'the', 'most', 'abstract', 'of', 'Sciences', 'and', 'has', 'grasped', 'it', 'with', 'a', 'force', 'which', 'few', 'masculine', 'intellects', '(in', 'our', 'own', 'country', 'at', 'least)', 'could', 'have', 'exerted', 'over', 'it.”', 'Augusta', 'Ada', 'Byron', 'was', 'born', 'on', 'Dec.', '10,', '1815,', 'in', 'London,', 'to', 'Lord', 'Byron', 'and', 'Annabella', 'Milbanke.', 'Her', 'parents', 'separated', 'when', 'she', 'was', 'an', 'infant,', 'and', 'her', 'father', 'died', 'when', 'she', 'was', '8.', 'Her', 'mother', '—', 'whom', 'Lord', 'Byron', 'called', 'the', '“princess', 'of', 'parallelograms”', 'and,', 'after', 'their', 'falling', 'out,', 'a', '“mathematical', 'Medea”', '—', 'was', 'a', 'social', 'reformer', 'from', 'a', 'wealthy', 'family', 'who', 'had', 'a', 'deep', 'interest', 'in', 'mathematics.', 'An', 'etching', 'from', 'a', 'portrait', 'of', 'Lovelace', 'as', 'a', 'child.', 'She', 'is', 'said', 'to', 'have', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science.', 'Smith', 'Collection/Gado/Getty', 'Images', 'Lovelace', 'showed', 'a', 'passion', 'for', 'math', 'and', 'mechanics', 'from', 'a', 'young', 'age,', 'encouraged', 'by', 'her', 'mother.', 'Because', 'of', 'her', 'class,', 'she', 'had', 'access', 'to', 'private', 'tutors', 'and', 'to', 'intellectuals', 'in', 'British', 'scientific', 'and', 'literary', 'society.', 'She', 'was', 'insatiably', 'curious', 'and', 'surrounded', 'herself', 'with', 'big', 'thinkers', 'of', 'the', 'day,', 'including', 'Mary', 'Somerville,', 'a', 'scientist', 'and', 'writer.', 'It', 'was', 'Somerville', 'who', 'introduced', 'Lovelace', 'to', 'Babbage', 'when', 'she', 'was', '17,', 'at', 'a', 'salon', 'he', 'hosted', 'soon', 'after', 'she', 'made', 'her', 'society', 'debut.', 'He', 'showed', 'her', 'a', 'two-foot', 'high,', 'brass', 'mechanical', 'calculator', 'he', 'had', 'built,', 'and', 'it', 'gripped', 'her', 'imagination.', 'They', 'began', 'a', 'correspondence', 'about', 'math', 'and', 'science', 'that', 'lasted', 'almost', 'two', 'decades.', 'She', 'also', 'met', 'her', 'husband,', 'William', 'King,', 'through', 'Somerville.', 'They', 'married', 'in', '1835,', 'when', 'she', 'was', '19.', 'He', 'soon', 'became', 'an', 'earl,', 'and', 'she', 'became', 'the', 'Countess', 'of', 'Lovelace.', 'By', '1839,', 'she', 'had', 'given', 'birth', 'to', 'two', 'sons', 'and', 'a', 'daughter.', 'She', 'was', 'determined,', 'however,', 'not', 'to', 'let', 'her', 'family', 'life', 'slow', 'her', 'work.', 'The', 'year', 'she', 'was', 'married,', 'she', 'wrote', 'to', 'Somerville:', '“I', 'now', 'read', 'Mathematics', 'every', 'day', 'and', 'am', 'occupied', 'in', 'Trigonometry', 'and', 'in', 'preliminaries', 'to', 'Cubic', 'and', 'Biquadratic', 'Equations.', 'So', 'you', 'see', 'that', 'matrimony', 'has', 'by', 'no', 'means', 'lessened', 'my', 'taste', 'for', 'these', 'pursuits,', 'nor', 'my', 'determination', 'to', 'carry', 'them', 'on.”', 'In', '1840,', 'Lovelace', 'asked', 'Augustus', 'De', 'Morgan,', 'a', 'math', 'professor', 'in', 'London,', 'to', 'tutor', 'her.', 'Through', 'exchanging', 'letters,', 'he', 'taught', 'her', 'university-level', 'math.', 'He', 'later', 'wrote', 'to', 'her', 'mother', 'that', 'if', 'a', 'young', 'male', 'student', 'had', 'shown', 'her', 'skill,', '“they', 'would', 'have', 'certainly', 'made', 'him', 'an', 'original', 'mathematical', 'investigator,', 'perhaps', 'of', 'first-rate', 'eminence.”', 'It', 'was', 'in', '1843,', 'when', 'she', 'was', '27,', 'that', 'Lovelace', 'wrote', 'her', 'most', 'lasting', 'contribution', 'to', 'computer', 'science.', 'She', 'published', 'her', 'translation', 'of', 'an', 'academic', 'paper', 'about', 'the', 'Babbage', 'Analytical', 'Engine', 'and', 'added', 'a', 'section,', 'nearly', 'three', 'times', 'the', 'length', 'of', 'the', 'paper,', 'titled,', '“Notes.”', 'Here,', 'she', 'described', 'how', 'the', 'computer', 'would', 'work,', 'imagined', 'its', 'potential', 'and', 'wrote', 'the', 'first', 'program.', 'Researchers', 'have', 'come', 'to', 'see', 'it', 'as', '“an', 'extraordinary', 'document,”', 'said', 'Ursula', 'Martin,', 'a', 'computer', 'scientist', 'at', 'the', 'University', 'of', 'Oxford', 'who', 'has', 'studied', 'Lovelace’s', 'life', 'and', 'work.', '“She’s', 'talking', 'about', 'the', 'abstract', 'principles', 'of', 'computation,', 'how', 'you', 'could', 'program', 'it,', 'and', 'big', 'ideas', 'like', 'maybe', 'it', 'could', 'compose', 'music,', 'maybe', 'it', 'could', 'think.”', 'Lovelace', 'died', 'less', 'than', 'a', 'decade', 'later,', 'on', 'Nov.', '27,', '1852.', 'In', 'the', '“Notes,”', 'she', 'imagined', 'a', 'future', 'in', 'which', 'computers', 'could', 'do', 'more', 'powerful', 'and', 'faster', 'analysis', 'than', 'humans.', '“A', 'new,', 'a', 'vast', 'and', 'a', 'powerful', 'language', 'is', 'developed', 'for', 'the', 'future', 'use', 'of', 'analysis,”', 'she', 'wrote,', '“in', 'which', 'to', 'wield', 'its', 'truths', 'so', 'that', 'these', 'may', 'become', 'of', 'more', 'speedy', 'and', 'accurate', 'practical', 'application', 'for', 'the', 'purposes', 'of', 'mankind.”', 'Claire', 'Cain', 'Miller', 'writes', 'about', 'gender', 'for', 'The', 'Upshot.', 'She', 'first', 'learned', 'about', 'Ada', 'Lovelace', 'while', 'covering', 'the', 'tech', 'industry,', 'where', 'women', 'are', 'severely', 'underrepresented.']
The problem with this is that it cannot handle punctuation that is directly
attached to the preceding characters, as in the case of periods, commas, etc.
To get around this, we use a more sophisticated tokenizer from the nltk
package, which is based on a series of regexes.
print(nltk.word_tokenize(example))
['A', 'gifted', 'mathematician', 'who', 'is', 'now', 'recognized', 'as', 'the', 'first', 'computer', 'programmer', '.', 'By', 'CLAIRE', 'CAIN', 'MILLER', 'A', 'century', 'before', 'the', 'dawn', 'of', 'the', 'computer', 'age', ',', 'Ada', 'Lovelace', 'imagined', 'the', 'modern-day', ',', 'general-purpose', 'computer', '.', 'It', 'could', 'be', 'programmed', 'to', 'follow', 'instructions', ',', 'she', 'wrote', 'in', '1843', '.', 'It', 'could', 'not', 'just', 'calculate', 'but', 'also', 'create', ',', 'as', 'it', '“', 'weaves', 'algebraic', 'patterns', 'just', 'as', 'the', 'Jacquard', 'loom', 'weaves', 'flowers', 'and', 'leaves.', '”', 'The', 'computer', 'she', 'was', 'writing', 'about', ',', 'the', 'British', 'inventor', 'Charles', 'Babbage', '’', 's', 'Analytical', 'Engine', ',', 'was', 'never', 'built', '.', 'But', 'her', 'writings', 'about', 'computing', 'have', 'earned', 'Lovelace', '—', 'who', 'died', 'of', 'uterine', 'cancer', 'in', '1852', 'at', '36', '—', 'recognition', 'as', 'the', 'first', 'computer', 'programmer', '.', 'The', 'program', 'she', 'wrote', 'for', 'the', 'Analytical', 'Engine', 'was', 'to', 'calculate', 'the', 'seventh', 'Bernoulli', 'number', '.', '(', 'Bernoulli', 'numbers', ',', 'named', 'after', 'the', 'Swiss', 'mathematician', 'Jacob', 'Bernoulli', ',', 'are', 'used', 'in', 'many', 'different', 'areas', 'of', 'mathematics', '.', ')', 'But', 'her', 'deeper', 'influence', 'was', 'to', 'see', 'the', 'potential', 'of', 'computing', '.', 'The', 'machines', 'could', 'go', 'beyond', 'calculating', 'numbers', ',', 'she', 'said', ',', 'to', 'understand', 'symbols', 'and', 'be', 'used', 'to', 'create', 'music', 'or', 'art', '.', '“', 'This', 'insight', 'would', 'become', 'the', 'core', 'concept', 'of', 'the', 'digital', 'age', ',', '”', 'Walter', 'Isaacson', 'wrote', 'in', 'his', 'book', '“', 'The', 'Innovators.', '”', '“', 'Any', 'piece', 'of', 'content', ',', 'data', 'or', 'information', '—', 'music', ',', 'text', ',', 'pictures', ',', 'numbers', ',', 'symbols', ',', 'sounds', ',', 'video', '—', 'could', 'be', 'expressed', 'in', 'digital', 'form', 'and', 'manipulated', 'by', 'machines.', '”', 'She', 'also', 'explored', 'the', 'ramifications', 'of', 'what', 'a', 'computer', 'could', 'do', ',', 'writing', 'about', 'the', 'responsibility', 'placed', 'on', 'the', 'person', 'programming', 'the', 'machine', ',', 'and', 'raising', 'and', 'then', 'dismissing', 'the', 'notion', 'that', 'computers', 'could', 'someday', 'think', 'and', 'create', 'on', 'their', 'own', '—', 'what', 'we', 'now', 'call', 'artificial', 'intelligence', '.', '“', 'The', 'Analytical', 'Engine', 'has', 'no', 'pretensions', 'whatever', 'to', 'originate', 'any', 'thing', ',', '”', 'she', 'wrote', '.', '“', 'It', 'can', 'do', 'whatever', 'we', 'know', 'how', 'to', 'order', 'it', 'to', 'perform.', '”', 'Lovelace', ',', 'a', 'British', 'socialite', 'who', 'was', 'the', 'daughter', 'of', 'Lord', 'Byron', ',', 'the', 'Romantic', 'poet', ',', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science', ',', 'one', 'of', 'her', 'biographers', ',', 'Betty', 'Alexandra', 'Toole', ',', 'has', 'written', '.', 'She', 'thought', 'of', 'math', 'and', 'logic', 'as', 'creative', 'and', 'imaginative', ',', 'and', 'called', 'it', '“', 'poetical', 'science.', '”', 'Math', '“', 'constitutes', 'the', 'language', 'through', 'which', 'alone', 'we', 'can', 'adequately', 'express', 'the', 'great', 'facts', 'of', 'the', 'natural', 'world', ',', '”', 'Lovelace', 'wrote', '.', 'Her', 'work', ',', 'which', 'was', 'rediscovered', 'in', 'the', 'mid-20th', 'century', ',', 'inspired', 'the', 'Defense', 'Department', 'to', 'name', 'a', 'programming', 'language', 'after', 'her', 'and', 'each', 'October', 'Ada', 'Lovelace', 'Day', 'signifies', 'a', 'celebration', 'of', 'women', 'in', 'technology', '.', 'Lovelace', 'lived', 'when', 'women', 'were', 'not', 'considered', 'to', 'be', 'prominent', 'scientific', 'thinkers', ',', 'and', 'her', 'skills', 'were', 'often', 'described', 'as', 'masculine', '.', '“', 'With', 'an', 'understanding', 'thoroughly', 'masculine', 'in', 'solidity', ',', 'grasp', 'and', 'firmness', ',', 'Lady', 'Lovelace', 'had', 'all', 'the', 'delicacies', 'of', 'the', 'most', 'refined', 'female', 'character', ',', '”', 'said', 'an', 'obituary', 'in', 'The', 'London', 'Examiner', '.', 'Babbage', ',', 'who', 'called', 'her', 'the', '“', 'enchantress', 'of', 'numbers', ',', '”', 'once', 'wrote', 'that', 'she', '“', 'has', 'thrown', 'her', 'magical', 'spell', 'around', 'the', 'most', 'abstract', 'of', 'Sciences', 'and', 'has', 'grasped', 'it', 'with', 'a', 'force', 'which', 'few', 'masculine', 'intellects', '(', 'in', 'our', 'own', 'country', 'at', 'least', ')', 'could', 'have', 'exerted', 'over', 'it.', '”', 'Augusta', 'Ada', 'Byron', 'was', 'born', 'on', 'Dec.', '10', ',', '1815', ',', 'in', 'London', ',', 'to', 'Lord', 'Byron', 'and', 'Annabella', 'Milbanke', '.', 'Her', 'parents', 'separated', 'when', 'she', 'was', 'an', 'infant', ',', 'and', 'her', 'father', 'died', 'when', 'she', 'was', '8', '.', 'Her', 'mother', '—', 'whom', 'Lord', 'Byron', 'called', 'the', '“', 'princess', 'of', 'parallelograms', '”', 'and', ',', 'after', 'their', 'falling', 'out', ',', 'a', '“', 'mathematical', 'Medea', '”', '—', 'was', 'a', 'social', 'reformer', 'from', 'a', 'wealthy', 'family', 'who', 'had', 'a', 'deep', 'interest', 'in', 'mathematics', '.', 'An', 'etching', 'from', 'a', 'portrait', 'of', 'Lovelace', 'as', 'a', 'child', '.', 'She', 'is', 'said', 'to', 'have', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science', '.', 'Smith', 'Collection/Gado/Getty', 'Images', 'Lovelace', 'showed', 'a', 'passion', 'for', 'math', 'and', 'mechanics', 'from', 'a', 'young', 'age', ',', 'encouraged', 'by', 'her', 'mother', '.', 'Because', 'of', 'her', 'class', ',', 'she', 'had', 'access', 'to', 'private', 'tutors', 'and', 'to', 'intellectuals', 'in', 'British', 'scientific', 'and', 'literary', 'society', '.', 'She', 'was', 'insatiably', 'curious', 'and', 'surrounded', 'herself', 'with', 'big', 'thinkers', 'of', 'the', 'day', ',', 'including', 'Mary', 'Somerville', ',', 'a', 'scientist', 'and', 'writer', '.', 'It', 'was', 'Somerville', 'who', 'introduced', 'Lovelace', 'to', 'Babbage', 'when', 'she', 'was', '17', ',', 'at', 'a', 'salon', 'he', 'hosted', 'soon', 'after', 'she', 'made', 'her', 'society', 'debut', '.', 'He', 'showed', 'her', 'a', 'two-foot', 'high', ',', 'brass', 'mechanical', 'calculator', 'he', 'had', 'built', ',', 'and', 'it', 'gripped', 'her', 'imagination', '.', 'They', 'began', 'a', 'correspondence', 'about', 'math', 'and', 'science', 'that', 'lasted', 'almost', 'two', 'decades', '.', 'She', 'also', 'met', 'her', 'husband', ',', 'William', 'King', ',', 'through', 'Somerville', '.', 'They', 'married', 'in', '1835', ',', 'when', 'she', 'was', '19', '.', 'He', 'soon', 'became', 'an', 'earl', ',', 'and', 'she', 'became', 'the', 'Countess', 'of', 'Lovelace', '.', 'By', '1839', ',', 'she', 'had', 'given', 'birth', 'to', 'two', 'sons', 'and', 'a', 'daughter', '.', 'She', 'was', 'determined', ',', 'however', ',', 'not', 'to', 'let', 'her', 'family', 'life', 'slow', 'her', 'work', '.', 'The', 'year', 'she', 'was', 'married', ',', 'she', 'wrote', 'to', 'Somerville', ':', '“', 'I', 'now', 'read', 'Mathematics', 'every', 'day', 'and', 'am', 'occupied', 'in', 'Trigonometry', 'and', 'in', 'preliminaries', 'to', 'Cubic', 'and', 'Biquadratic', 'Equations', '.', 'So', 'you', 'see', 'that', 'matrimony', 'has', 'by', 'no', 'means', 'lessened', 'my', 'taste', 'for', 'these', 'pursuits', ',', 'nor', 'my', 'determination', 'to', 'carry', 'them', 'on.', '”', 'In', '1840', ',', 'Lovelace', 'asked', 'Augustus', 'De', 'Morgan', ',', 'a', 'math', 'professor', 'in', 'London', ',', 'to', 'tutor', 'her', '.', 'Through', 'exchanging', 'letters', ',', 'he', 'taught', 'her', 'university-level', 'math', '.', 'He', 'later', 'wrote', 'to', 'her', 'mother', 'that', 'if', 'a', 'young', 'male', 'student', 'had', 'shown', 'her', 'skill', ',', '“', 'they', 'would', 'have', 'certainly', 'made', 'him', 'an', 'original', 'mathematical', 'investigator', ',', 'perhaps', 'of', 'first-rate', 'eminence.', '”', 'It', 'was', 'in', '1843', ',', 'when', 'she', 'was', '27', ',', 'that', 'Lovelace', 'wrote', 'her', 'most', 'lasting', 'contribution', 'to', 'computer', 'science', '.', 'She', 'published', 'her', 'translation', 'of', 'an', 'academic', 'paper', 'about', 'the', 'Babbage', 'Analytical', 'Engine', 'and', 'added', 'a', 'section', ',', 'nearly', 'three', 'times', 'the', 'length', 'of', 'the', 'paper', ',', 'titled', ',', '“', 'Notes.', '”', 'Here', ',', 'she', 'described', 'how', 'the', 'computer', 'would', 'work', ',', 'imagined', 'its', 'potential', 'and', 'wrote', 'the', 'first', 'program', '.', 'Researchers', 'have', 'come', 'to', 'see', 'it', 'as', '“', 'an', 'extraordinary', 'document', ',', '”', 'said', 'Ursula', 'Martin', ',', 'a', 'computer', 'scientist', 'at', 'the', 'University', 'of', 'Oxford', 'who', 'has', 'studied', 'Lovelace', '’', 's', 'life', 'and', 'work', '.', '“', 'She', '’', 's', 'talking', 'about', 'the', 'abstract', 'principles', 'of', 'computation', ',', 'how', 'you', 'could', 'program', 'it', ',', 'and', 'big', 'ideas', 'like', 'maybe', 'it', 'could', 'compose', 'music', ',', 'maybe', 'it', 'could', 'think.', '”', 'Lovelace', 'died', 'less', 'than', 'a', 'decade', 'later', ',', 'on', 'Nov.', '27', ',', '1852', '.', 'In', 'the', '“', 'Notes', ',', '”', 'she', 'imagined', 'a', 'future', 'in', 'which', 'computers', 'could', 'do', 'more', 'powerful', 'and', 'faster', 'analysis', 'than', 'humans', '.', '“', 'A', 'new', ',', 'a', 'vast', 'and', 'a', 'powerful', 'language', 'is', 'developed', 'for', 'the', 'future', 'use', 'of', 'analysis', ',', '”', 'she', 'wrote', ',', '“', 'in', 'which', 'to', 'wield', 'its', 'truths', 'so', 'that', 'these', 'may', 'become', 'of', 'more', 'speedy', 'and', 'accurate', 'practical', 'application', 'for', 'the', 'purposes', 'of', 'mankind.', '”', 'Claire', 'Cain', 'Miller', 'writes', 'about', 'gender', 'for', 'The', 'Upshot', '.', 'She', 'first', 'learned', 'about', 'Ada', 'Lovelace', 'while', 'covering', 'the', 'tech', 'industry', ',', 'where', 'women', 'are', 'severely', 'underrepresented', '.']
Tip
You need to download this tokenizer the first time you use it, so don’t fret if you get an error. Just follow the instructions to download the file, which are as follows:
nltk.download("punkt")
Below, we incorporate this tokenizer into a preprocessing function that performs the following steps:
Change the string to lowercase
Tokenize the string into lists of tokens
Optionally, create multi-gram token sequences (more on this next week)
def preprocess(doc, ngram = 1):
"""Preprocess a document.
Parameters
----------
doc : str
The document to preprocess
ngram : int
How many n-grams to break the document into
Returns
-------
tokens : list
Tokenized document
"""
# First, change the case of the words to lowercase
doc = doc.lower()
# Tokenize the string. Optionally, make 2-gram (or more) sequences from
# those tokens
tokens = nltk.word_tokenize(doc)
if ngram > 1:
tokens = list(nltk.ngrams(tokens, ngram))
return tokens
With our function defined, we preprocess the corpus documents.
cleaned = [preprocess(doc) for doc in manifest["text"]]
Then we get rid of punctuation and numbers with a regex substitution. Note that this is a two-step processed: first we remove anything that isn’t an alphabetic character, then we filter out empty strings in the sublists.
cleaned = [[re.sub(r"[^a-zA-Z]", "", tok) for tok in doc] for doc in cleaned]
cleaned = [[tok for tok in doc if tok] for doc in cleaned]
print(cleaned[0])
['a', 'gifted', 'mathematician', 'who', 'is', 'now', 'recognized', 'as', 'the', 'first', 'computer', 'programmer', 'by', 'claire', 'cain', 'miller', 'a', 'century', 'before', 'the', 'dawn', 'of', 'the', 'computer', 'age', 'ada', 'lovelace', 'imagined', 'the', 'modernday', 'generalpurpose', 'computer', 'it', 'could', 'be', 'programmed', 'to', 'follow', 'instructions', 'she', 'wrote', 'in', 'it', 'could', 'not', 'just', 'calculate', 'but', 'also', 'create', 'as', 'it', 'weaves', 'algebraic', 'patterns', 'just', 'as', 'the', 'jacquard', 'loom', 'weaves', 'flowers', 'and', 'leaves', 'the', 'computer', 'she', 'was', 'writing', 'about', 'the', 'british', 'inventor', 'charles', 'babbage', 's', 'analytical', 'engine', 'was', 'never', 'built', 'but', 'her', 'writings', 'about', 'computing', 'have', 'earned', 'lovelace', 'who', 'died', 'of', 'uterine', 'cancer', 'in', 'at', 'recognition', 'as', 'the', 'first', 'computer', 'programmer', 'the', 'program', 'she', 'wrote', 'for', 'the', 'analytical', 'engine', 'was', 'to', 'calculate', 'the', 'seventh', 'bernoulli', 'number', 'bernoulli', 'numbers', 'named', 'after', 'the', 'swiss', 'mathematician', 'jacob', 'bernoulli', 'are', 'used', 'in', 'many', 'different', 'areas', 'of', 'mathematics', 'but', 'her', 'deeper', 'influence', 'was', 'to', 'see', 'the', 'potential', 'of', 'computing', 'the', 'machines', 'could', 'go', 'beyond', 'calculating', 'numbers', 'she', 'said', 'to', 'understand', 'symbols', 'and', 'be', 'used', 'to', 'create', 'music', 'or', 'art', 'this', 'insight', 'would', 'become', 'the', 'core', 'concept', 'of', 'the', 'digital', 'age', 'walter', 'isaacson', 'wrote', 'in', 'his', 'book', 'the', 'innovators', 'any', 'piece', 'of', 'content', 'data', 'or', 'information', 'music', 'text', 'pictures', 'numbers', 'symbols', 'sounds', 'video', 'could', 'be', 'expressed', 'in', 'digital', 'form', 'and', 'manipulated', 'by', 'machines', 'she', 'also', 'explored', 'the', 'ramifications', 'of', 'what', 'a', 'computer', 'could', 'do', 'writing', 'about', 'the', 'responsibility', 'placed', 'on', 'the', 'person', 'programming', 'the', 'machine', 'and', 'raising', 'and', 'then', 'dismissing', 'the', 'notion', 'that', 'computers', 'could', 'someday', 'think', 'and', 'create', 'on', 'their', 'own', 'what', 'we', 'now', 'call', 'artificial', 'intelligence', 'the', 'analytical', 'engine', 'has', 'no', 'pretensions', 'whatever', 'to', 'originate', 'any', 'thing', 'she', 'wrote', 'it', 'can', 'do', 'whatever', 'we', 'know', 'how', 'to', 'order', 'it', 'to', 'perform', 'lovelace', 'a', 'british', 'socialite', 'who', 'was', 'the', 'daughter', 'of', 'lord', 'byron', 'the', 'romantic', 'poet', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science', 'one', 'of', 'her', 'biographers', 'betty', 'alexandra', 'toole', 'has', 'written', 'she', 'thought', 'of', 'math', 'and', 'logic', 'as', 'creative', 'and', 'imaginative', 'and', 'called', 'it', 'poetical', 'science', 'math', 'constitutes', 'the', 'language', 'through', 'which', 'alone', 'we', 'can', 'adequately', 'express', 'the', 'great', 'facts', 'of', 'the', 'natural', 'world', 'lovelace', 'wrote', 'her', 'work', 'which', 'was', 'rediscovered', 'in', 'the', 'midth', 'century', 'inspired', 'the', 'defense', 'department', 'to', 'name', 'a', 'programming', 'language', 'after', 'her', 'and', 'each', 'october', 'ada', 'lovelace', 'day', 'signifies', 'a', 'celebration', 'of', 'women', 'in', 'technology', 'lovelace', 'lived', 'when', 'women', 'were', 'not', 'considered', 'to', 'be', 'prominent', 'scientific', 'thinkers', 'and', 'her', 'skills', 'were', 'often', 'described', 'as', 'masculine', 'with', 'an', 'understanding', 'thoroughly', 'masculine', 'in', 'solidity', 'grasp', 'and', 'firmness', 'lady', 'lovelace', 'had', 'all', 'the', 'delicacies', 'of', 'the', 'most', 'refined', 'female', 'character', 'said', 'an', 'obituary', 'in', 'the', 'london', 'examiner', 'babbage', 'who', 'called', 'her', 'the', 'enchantress', 'of', 'numbers', 'once', 'wrote', 'that', 'she', 'has', 'thrown', 'her', 'magical', 'spell', 'around', 'the', 'most', 'abstract', 'of', 'sciences', 'and', 'has', 'grasped', 'it', 'with', 'a', 'force', 'which', 'few', 'masculine', 'intellects', 'in', 'our', 'own', 'country', 'at', 'least', 'could', 'have', 'exerted', 'over', 'it', 'augusta', 'ada', 'byron', 'was', 'born', 'on', 'dec', 'in', 'london', 'to', 'lord', 'byron', 'and', 'annabella', 'milbanke', 'her', 'parents', 'separated', 'when', 'she', 'was', 'an', 'infant', 'and', 'her', 'father', 'died', 'when', 'she', 'was', 'her', 'mother', 'whom', 'lord', 'byron', 'called', 'the', 'princess', 'of', 'parallelograms', 'and', 'after', 'their', 'falling', 'out', 'a', 'mathematical', 'medea', 'was', 'a', 'social', 'reformer', 'from', 'a', 'wealthy', 'family', 'who', 'had', 'a', 'deep', 'interest', 'in', 'mathematics', 'an', 'etching', 'from', 'a', 'portrait', 'of', 'lovelace', 'as', 'a', 'child', 'she', 'is', 'said', 'to', 'have', 'had', 'a', 'gift', 'for', 'combining', 'art', 'and', 'science', 'smith', 'collectiongadogetty', 'images', 'lovelace', 'showed', 'a', 'passion', 'for', 'math', 'and', 'mechanics', 'from', 'a', 'young', 'age', 'encouraged', 'by', 'her', 'mother', 'because', 'of', 'her', 'class', 'she', 'had', 'access', 'to', 'private', 'tutors', 'and', 'to', 'intellectuals', 'in', 'british', 'scientific', 'and', 'literary', 'society', 'she', 'was', 'insatiably', 'curious', 'and', 'surrounded', 'herself', 'with', 'big', 'thinkers', 'of', 'the', 'day', 'including', 'mary', 'somerville', 'a', 'scientist', 'and', 'writer', 'it', 'was', 'somerville', 'who', 'introduced', 'lovelace', 'to', 'babbage', 'when', 'she', 'was', 'at', 'a', 'salon', 'he', 'hosted', 'soon', 'after', 'she', 'made', 'her', 'society', 'debut', 'he', 'showed', 'her', 'a', 'twofoot', 'high', 'brass', 'mechanical', 'calculator', 'he', 'had', 'built', 'and', 'it', 'gripped', 'her', 'imagination', 'they', 'began', 'a', 'correspondence', 'about', 'math', 'and', 'science', 'that', 'lasted', 'almost', 'two', 'decades', 'she', 'also', 'met', 'her', 'husband', 'william', 'king', 'through', 'somerville', 'they', 'married', 'in', 'when', 'she', 'was', 'he', 'soon', 'became', 'an', 'earl', 'and', 'she', 'became', 'the', 'countess', 'of', 'lovelace', 'by', 'she', 'had', 'given', 'birth', 'to', 'two', 'sons', 'and', 'a', 'daughter', 'she', 'was', 'determined', 'however', 'not', 'to', 'let', 'her', 'family', 'life', 'slow', 'her', 'work', 'the', 'year', 'she', 'was', 'married', 'she', 'wrote', 'to', 'somerville', 'i', 'now', 'read', 'mathematics', 'every', 'day', 'and', 'am', 'occupied', 'in', 'trigonometry', 'and', 'in', 'preliminaries', 'to', 'cubic', 'and', 'biquadratic', 'equations', 'so', 'you', 'see', 'that', 'matrimony', 'has', 'by', 'no', 'means', 'lessened', 'my', 'taste', 'for', 'these', 'pursuits', 'nor', 'my', 'determination', 'to', 'carry', 'them', 'on', 'in', 'lovelace', 'asked', 'augustus', 'de', 'morgan', 'a', 'math', 'professor', 'in', 'london', 'to', 'tutor', 'her', 'through', 'exchanging', 'letters', 'he', 'taught', 'her', 'universitylevel', 'math', 'he', 'later', 'wrote', 'to', 'her', 'mother', 'that', 'if', 'a', 'young', 'male', 'student', 'had', 'shown', 'her', 'skill', 'they', 'would', 'have', 'certainly', 'made', 'him', 'an', 'original', 'mathematical', 'investigator', 'perhaps', 'of', 'firstrate', 'eminence', 'it', 'was', 'in', 'when', 'she', 'was', 'that', 'lovelace', 'wrote', 'her', 'most', 'lasting', 'contribution', 'to', 'computer', 'science', 'she', 'published', 'her', 'translation', 'of', 'an', 'academic', 'paper', 'about', 'the', 'babbage', 'analytical', 'engine', 'and', 'added', 'a', 'section', 'nearly', 'three', 'times', 'the', 'length', 'of', 'the', 'paper', 'titled', 'notes', 'here', 'she', 'described', 'how', 'the', 'computer', 'would', 'work', 'imagined', 'its', 'potential', 'and', 'wrote', 'the', 'first', 'program', 'researchers', 'have', 'come', 'to', 'see', 'it', 'as', 'an', 'extraordinary', 'document', 'said', 'ursula', 'martin', 'a', 'computer', 'scientist', 'at', 'the', 'university', 'of', 'oxford', 'who', 'has', 'studied', 'lovelace', 's', 'life', 'and', 'work', 'she', 's', 'talking', 'about', 'the', 'abstract', 'principles', 'of', 'computation', 'how', 'you', 'could', 'program', 'it', 'and', 'big', 'ideas', 'like', 'maybe', 'it', 'could', 'compose', 'music', 'maybe', 'it', 'could', 'think', 'lovelace', 'died', 'less', 'than', 'a', 'decade', 'later', 'on', 'nov', 'in', 'the', 'notes', 'she', 'imagined', 'a', 'future', 'in', 'which', 'computers', 'could', 'do', 'more', 'powerful', 'and', 'faster', 'analysis', 'than', 'humans', 'a', 'new', 'a', 'vast', 'and', 'a', 'powerful', 'language', 'is', 'developed', 'for', 'the', 'future', 'use', 'of', 'analysis', 'she', 'wrote', 'in', 'which', 'to', 'wield', 'its', 'truths', 'so', 'that', 'these', 'may', 'become', 'of', 'more', 'speedy', 'and', 'accurate', 'practical', 'application', 'for', 'the', 'purposes', 'of', 'mankind', 'claire', 'cain', 'miller', 'writes', 'about', 'gender', 'for', 'the', 'upshot', 'she', 'first', 'learned', 'about', 'ada', 'lovelace', 'while', 'covering', 'the', 'tech', 'industry', 'where', 'women', 'are', 'severely', 'underrepresented']
Finally, we assign to our DataFrame.
manifest["tokens"] = cleaned.copy()
3.4.2. Plotting#
A last step before analyzing these tokens: defining a function to plot our
results with a histogram. We use seaborn
for this. It has a simple
interface that integrates directly with DataFrames.
def plot_metrics(data, variable, title = "", xlabel = "", figsize = (15, 5)):
"""Plot metrics with a histogram.
Parameters
----------
data : pd.DataFrame
The data to plot
variable : str
Which variable to plot
title : str
Plot title
xlabel : str
Label of the X axis
figsize : tuple
Size of the figure
"""
# First, check whether the variable we want to plot is in the DataFrame
if variable not in data.columns:
raise ValueError(variable, "not in data")
# Create a figure with a plot in it, then add labels
plt.figure(figsize = figsize)
g = sns.histplot(data = data, x = variable)
g.set(title = title, xlabel = xlabel, ylabel = "Count")
plt.show()
3.5. Data Analysis#
Time to look at our data. Many of these operations will rely on the .apply()
method. You can think of this method like a for
loop: it applies some
function to every element along an axis in the DataFrame. Axis 0
is the
column axis, while 1
is the row axis. This feels somewhat backwards, but
setting axis = 0
applies a function to all rows under a column; axis = 1
applies a function to all columns across a row.
3.5.1. Document metrics#
First, some simple document metrics. Below, we calculate the number of tokens in a document, as expressed in this notation:
Where:
\(T(i)\) is the total number of tokens for the \(i\)-th document
\(m_i\) represents the total number of tokens in every token list
Each token \(j\) in document \(i\) is counted once, indicated by \(1\)
In code, len()
will handle this easily.
manifest["num_tokens"] = manifest["tokens"].apply(len)
plot_metrics(manifest, "num_tokens", title = "Token counts", xlabel = "Tokens")

The number of types is the number of unique tokens in a document. We calculate it with:
Where:
\(K(i)\) is the total number of types for the \(i\)-th document
\(j \in J\) represents each token \(j\) for \(J\) unique tokens
Each token \(j\) in \(J\) is counted once, indicated by \(1\)
To implement in code, we take advantage of a feature in the .apply()
method:
its outputs can be directed to another .apply()
call, or chained.
manifest["num_types"] = manifest["tokens"].apply(np.unique).apply(len)
plot_metrics(manifest, "num_types", title = "Type counts", xlabel = "Types")

The type-token ratio is a measure of lexical diversity.
In other words, for document \(i\) it is the number of types \(K(i)\) divided by the number of tokens \(T(i)\).
manifest["ttr"] = manifest["num_types"] / manifest["num_tokens"]
plot_metrics(manifest, "ttr", title = "Type-token ratio", xlabel = "TTR")

Use the .nlargest()
method to find the document with the highest type-token
ratio.
manifest.nlargest(n = 1, columns = "ttr")
name | year | file | text | tokens | num_tokens | num_types | ttr | |
---|---|---|---|---|---|---|---|---|
39 | Hilaire G E Degas | 1917 | 039.txt | September 28, 1917\n\n OBITUARY\n\n Hilaire G.... | [september, obituary, hilaire, g, e, degas, no... | 167 | 111 | 0.664671 |
And .nsmallest()
will return the lowest one:
manifest.nsmallest(n = 1, columns = "ttr")
name | year | file | text | tokens | num_tokens | num_types | ttr | |
---|---|---|---|---|---|---|---|---|
6 | Ulysses Grant | 1885 | 006.txt | July 24, 1885\n\n OBITUARY\n\n The Career of a... | [july, obituary, the, career, of, a, soldier, ... | 40800 | 5514 | 0.135147 |
Finally, a global view of these three metrics using .describe()
:
manifest[["num_tokens", "num_types", "ttr"]].describe()
num_tokens | num_types | ttr | |
---|---|---|---|
count | 379.000000 | 379.000000 | 379.000000 |
mean | 2491.387863 | 848.970976 | 0.395118 |
std | 2796.214753 | 541.151464 | 0.070147 |
min | 167.000000 | 58.000000 | 0.135147 |
25% | 1133.000000 | 488.000000 | 0.349747 |
50% | 1864.000000 | 732.000000 | 0.389706 |
75% | 2972.500000 | 1038.500000 | 0.439554 |
max | 40800.000000 | 5514.000000 | 0.664671 |
3.5.2. Token metrics#
Now, tokens. Next week we will use a special data structure, the
document-term matrix to make working with token data easier, but base
functionality in pandas
will suffice for now. Using .explode()
breaks token
lists into individual rows.
manifest = manifest.explode("tokens")
That greatly lengthens the DataFrame. You will often hear of data scientists speak of long and wide data. That refers to tabular data that has many observations relative to variables (long) or vice versa (wide).
The .shape
attribute stores information about the number of rows and columns.
num_rows, num_cols = manifest.shape
print(f"DataFrame dimensions: ({num_rows:,} x {num_cols})")
DataFrame dimensions: (944,236 x 8)
Use .value_counts()
to count observations in a column. We assign the result
to a new variable, convert it to a DataFrame, and then use .sort_values()
to
order them in descending order.
token_freq = manifest["tokens"].value_counts()
token_freq = pd.DataFrame(token_freq).reset_index()
token_freq.sort_values(by = "count", ascending = False, inplace = True)
Tokens with the highest frequency:
token_freq.head(10)
tokens | count | |
---|---|---|
0 | the | 60573 |
1 | of | 34239 |
2 | and | 28429 |
3 | in | 27792 |
4 | a | 24382 |
5 | to | 22995 |
6 | was | 17603 |
7 | he | 17387 |
8 | his | 14777 |
9 | that | 9563 |
And the lowest:
token_freq.tail(10)
tokens | count | |
---|---|---|
28288 | coarservoiced | 1 |
28289 | greengrocer | 1 |
28290 | whelan | 1 |
28291 | naughty | 1 |
28292 | reemerging | 1 |
28293 | strop | 1 |
28294 | layout | 1 |
28295 | lodger | 1 |
28296 | ripper | 1 |
38809 | mentored | 1 |
Though there are in fact many tokens that only occur once in the data. We refer to these as hapax legomena (Greek for “only said once”). How many are there?
hapaxes = token_freq[token_freq["count"] == 1]
print(f"Number of hapaxes: {len(hapaxes):,}")
Number of hapaxes: 15,779
A broader look at token counts will situate hapaxes. Below, we plot the 1,000 most frequent tokens.
N = 1000
plt.figure(figsize = (15, 5))
g = sns.scatterplot(data = token_freq[:N], x = "tokens", y = "count")
g.set(xlabel = "Tokens", ylabel = "Token counts", title = f"Top {N:,} Tokens")
plt.xticks(rotation = 90, ticks = range(0, N, 25))
plt.show()

Even in the top 1,000 tokens, it’s evident that there is an extremely long tail in the count data. More, even at the highest counts there are big jumps between the most frequent token, the second one, the third, and so on.
Plotting a larger sample will show the same pattern. Below, we sample 10,000 tokens randomly.
N = 10_000
sampled = token_freq.sample(N, replace = False)
sampled.sort_values("count", ascending = False, inplace = True)
Now we plot on a line plot.
plt.figure(figsize = (15, 5))
g = sns.lineplot(sampled, x = "tokens", y = "count")
g.set(xlabel = "Tokens", ylabel = "Count", title = f"{N:,} Sampled Tokens")
plt.xticks(rotation = 90, ticks = range(0, N, 500))
plt.show()

What is this telling us? Our token distribution is Zipfian. The \(n\)-th value of a token is inversely proportional to its position \(n\). Or, put another way, the most common token in the data occurs twice as often as the next most common token, three times as often as the third most common token, and so on.
Importantly, the most frequent tokens in this data are deictic words: words like “and,” “the,” etc. These words are the very sinew of language, and yet they’re so redundant and so context-dependent that it’s difficult to get a sense of what they mean. A great number of language models start with this very problem—including Claude Shannon’s mathematical theory of communication, the subject of our next chapter.