2. Python Basics#
This chapter overviews key functionality and concepts in Python. We will learn how to load files into Python, store data in various data structures, and iterate through that data. Additionally, we will discuss regular expressions for string matching and write our own functions.
Data: A plain text version of Mary Shelley’s Frankenstein
Credits: Portions of this chapter are adapted from the UC Davis DataLab’s Python Basics
2.1. Preliminaries#
To do our work, we will import an entire module and a single object from another module.
import re
from collections import Counter
Recall from the last chapter that importing an entire module gives us access to
all of its functionality. We specify which part of that module we want to use
with dot .
notation.
re.findall
<function re.findall(pattern, string, flags=0)>
To use the object we imported, initialize it by assigning it to a variable:
c = Counter()
c
Counter()
2.2. Loading Data#
Loading a file into your Python environment requires writing a path to the
file’s location. Below, we assign a path to Mary Shelley’s Frankenstein,
which currently sits in data/
, a subdirectory of our current working
directory.
path = "data/texts/shelley_frankenstein.txt"
Use open
to open a connection to the file. This function requires you to
specify a value to the mode
argument. We use r
because are working with
plain text data; rb
would for binary data.
fin = open(path, mode = "r")
With the connection established, read the data and assign it to a new variable:
frankenstein = fin.read()
Finally, close the file connection:
fin.close()
You can accomplish these operations with less lines of code using the with open
pattern. There’s no need to close the file connection once you’ve read
the data. Using this pattern, Python will do it for you.
with open(path, mode = "r") as fin:
frankenstein = fin.read()
Python represents plain text files as streams of characters. That is, every
keystroke you would use to type out a text corresponds to a character. Calling
len
, or length, on our data makes this apparent:
len(frankenstein)
418917
The Penguin edition of Frankenstein is about 220 pages. Assuming each page has 350 words, that would put the book’s word count in the neighborhood of 77,000 words—far less than the number above. But we see this number because Python is counting characters, not words.
2.3. Data Structures#
Usually, however, we want to work with words. This requires us to change how
Python represents our text data. And, because there is no inherent concept of
what a word is in the language, it falls on us to define how to make words out
of characters. This process is called tokenization. Tokenizing a text means
breaking its continuous sequence of characters into separate substrings, or
tokens. There are many different ways to do this, but for now we start with a
very simple approach: we break the character sequence along whitespace
characters, characters like \s
, for space, but also \n
(for new lines),
\t
(for tab), and so on.
Use the .split()
method to break frankenstein
along whitespace characters:
tokens = frankenstein.split()
2.3.1. Lists#
The result of .split()
is a list, a general-purpose, one-dimensional
container for storing data. Lists are probably the most common data structure
in Python. They make very little assumptions about the kind of data they store,
and they store this data in an ordered manner. That is, lists have a first
element, a second element, and so on up until the full length of the list.
len(tokens)
74975
To select an element, or a group of elements, from a list, you index the
list. The square brackets [ ]
are Python’s index operator. Use them in
conjunction with the index position of the element(s) you want to select.
The index position is simply a number that corresponds to where in the list an
element is located.
tokens[42]
'is'
Python uses zero-based indexing. That means the positions of elements are counted from 0, not 1.
tokens[0]
'Letter'
Use the colon :
to select multiple elements.
tokens[10:20]
['17—.',
'You',
'will',
'rejoice',
'to',
'hear',
'that',
'no',
'disaster',
'has']
Setting no starting position takes all elements in the list up to your index position:
tokens[:10]
['Letter',
'1',
'_To',
'Mrs.',
'Saville,',
'England._',
'St.',
'Petersburgh,',
'Dec.',
'11th,']
While leaving off the ending position takes all elements from an index position to the end of the list:
tokens[74970:]
['lost', 'in', 'darkness', 'and', 'distance.']
Alternatively, count backwards from the end of a list with a negative number:
tokens[-5:]
['lost', 'in', 'darkness', 'and', 'distance.']
Add another colon to take every n-th element in your selection. Below, we take every second element from index positions 100-200.
tokens[100:200:2]
['This',
'which',
'travelled',
'the',
'towards',
'I',
'advancing,',
'me',
'foretaste',
'those',
'climes.',
'by',
'wind',
'promise,',
'daydreams',
'more',
'and',
'I',
'in',
'to',
'persuaded',
'the',
'is',
'seat',
'frost',
'desolation;',
'ever',
'itself',
'my',
'as',
'region',
'beauty',
'delight.',
'Margaret,',
'sun',
'for',
'visible,',
'broad',
'just',
'the',
'and',
'a',
'splendour.',
'with',
'leave,',
'sister,',
'will',
'some',
'in',
'navigators—there']
Leave the first selection unspecified to take every n-th element across the whole list:
tokens[::1000]
['Letter',
'stagecoach.',
'he',
'shape',
'promised',
'music.',
'plain',
'known.',
'sweet',
'reasoning,',
'time',
'repulsive',
'disciple;',
'yet',
'charnel-houses',
'filled',
'little',
'hardly',
'your',
'commenced',
'niece,',
'narrower',
'knew',
'being,',
'interpretation',
'hardly',
'true',
'often',
'me.',
'to',
'hell',
'But',
'as',
'earth.',
'for',
'more',
'13',
'days',
'various',
'and',
'the',
'‘Accursed',
'the',
'beast',
'the',
'flesh',
'place,',
'to',
'like',
'all',
'standing',
'irksome',
'of',
'miserable',
'trembling',
'ocean;',
'mounted',
'the',
'witnesses.',
'me.',
'that',
'her.',
'cousin',
'lessons',
'its',
'the',
'listened',
'which',
'even',
'misery.',
'them',
'terrible',
'continue',
'midnight;',
'loathing']
You can also use [ ]
to create a list manually. Here’s an empty list:
[]
[]
And here is one with all kinds of data types—and another list! Lists can contain lists.
l = [8, "x", False, ["a", "b", "c"]]
To index an element in this sublist, you’ll need to select the index position of the sublist, then select the one for the element you want.
l[3][1]
'b'
You can set the element of a list by assigning a value at that index:
l[2] = True
l
[8, 'x', True, ['a', 'b', 'c']]
Assigning elements of a container is not without complication. Below, we use
list()
—another method of creating a list—to break a character string into
individual pieces. We assign the output of this to x
. Then, we create a new
variable, y
, from x
.
x = list("abc")
y = x
y
['a', 'b', 'c']
Assigning a new value to an index position in x
will propagate the change to
y
.
x[2] = "d"
print(x, y)
['a', 'b', 'd'] ['a', 'b', 'd']
Why did this happen? When you create a list and assign it to a variable, the variable points, or refers, to the location of the list in your computer’s memory. If you create a second variable from the first, it will refer to the first variable, which in turn refers to the data. As a result, operations called on one variable will affect the other, and vice versa.
When in doubt, use .copy()
to prevent this.
x = list("abc")
y = x.copy()
x[2] = "d"
print(x, y)
['a', 'b', 'd'] ['a', 'b', 'c']
2.3.2. Tuples#
References can be confusing. If you know that the elements of a container shouldn’t change, you can also avoid the problem above by creating a tuple. Like a list, a tuple is a one-dimensional container for general data storage. The key difference is that tuples are immutable: once you create a tuple, you are neither able to alter it nor its elements.
Make a tuple by enclosing comma-separated values in parentheses ( )
.
tup = (1, 2, 3)
tup
(1, 2, 3)
Alternatively, convert a list to a tuple using tuple()
.
x = list("abc")
x = tuple(x)
x
('a', 'b', 'c')
You will get an error if you attempt to change this tuple.
x[2] = "d"
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[30], line 1
----> 1 x[2] = "d"
TypeError: 'tuple' object does not support item assignment
2.3.3. Sets#
Unlike lists and tuples, sets cannot contain multiple instances of the same
element. They only have unique elements. Create them using curly brackets { }
or set()
.
set_a = {"a", "b", "c"}
set_b = set("aabc")
set_a == set_b
True
Sets are useful containers for keeping track of features in your data. For example, converting our list of tokens to a set will automatically prune out all repeated tokens. The result will be a set of in NLP are called types: the unique elements in a document. In effect, it is the vocabulary of the document.
types = set(tokens)
print("Number of types:", len(types))
Number of types: 11590
Sets also offer additional functionality for performing comparisons. We won’t touch on this too much in the following chapters, but it’s useful to know about. Given the following two sentences from Frankenstein, for example:
a = "I am surrounded by mountains of ice which admit of no escape and threaten every moment to crush my vessel."
b = "This ice is not made of such stuff as your hearts may be; it is mutable and cannot withstand you if you say that it shall not."
We split them into tokens and convert both to sets:
a = set(a.split())
b = set(b.split())
Now, we find their intersection. This is where the two sentences’ vocabularies overlap:
a.intersection(b)
{'and', 'ice', 'of'}
We can also find their difference, or the set of tokens that do not overlap:
a.difference(b)
{'I',
'admit',
'am',
'by',
'crush',
'escape',
'every',
'moment',
'mountains',
'my',
'no',
'surrounded',
'threaten',
'to',
'vessel.',
'which'}
Finally, we can build a new set that combines our two sets:
c = a.union(b)
c
{'I',
'This',
'admit',
'am',
'and',
'as',
'be;',
'by',
'cannot',
'crush',
'escape',
'every',
'hearts',
'ice',
'if',
'is',
'it',
'made',
'may',
'moment',
'mountains',
'mutable',
'my',
'no',
'not',
'not.',
'of',
'say',
'shall',
'stuff',
'such',
'surrounded',
'that',
'threaten',
'to',
'vessel.',
'which',
'withstand',
'you',
'your'}
The downside of sets, however, is that they are unordered. This means they cannot be indexed.
c[5]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[38], line 1
----> 1 c[5]
TypeError: 'set' object is not subscriptable
2.3.4. Dictionaries#
Finally, there are dictionaries. Like sets, dictionaries store unique elements, but they associate those elements with a particular value. These can be individual values, like numbers, or containers, like lists, tuples, and so on. Every element in a dictionary is therefore a key–value pair. This makes dictionaries powerful data structures for associating values in your data with metadata of one kind or another.
Create a dictionary with curly brackets { }
and colons :
that separate the
key–value pairs.
counts = {"x": 4, "y": 1, "z": 6}
counts
{'x': 4, 'y': 1, 'z': 6}
Alternatively, use dict()
:
counts = dict(x = 4, y = 1, z = 6)
counts
{'x': 4, 'y': 1, 'z': 6}
Unlike sets, dictionaries can be indexed by their keys. This returns the value stored at a particular key.
counts["y"]
1
The .get()
method also works.
counts.get("y")
1
You can set a default value to control for cases when the dictionary doesn’t have a requested key.
counts.get("a", None)
Assign a new value to a key to update it.
counts["z"] = counts["z"] - 1
counts
{'x': 4, 'y': 1, 'z': 5}
Or use the .update()
method in conjunction with the curly bracket and colon
syntax. Note that this is an in place operation. You do not need to
reassign the result to a variable.
counts.update({"x": 10})
counts
{'x': 10, 'y': 1, 'z': 5}
Either method also enables you to add new keys to a dictionary.
counts["w"] = 7
counts
{'x': 10, 'y': 1, 'z': 5, 'w': 7}
Using .pop()
removes a key–value pair.
counts.pop("w")
counts
{'x': 10, 'y': 1, 'z': 5}
The .keys()
method returns all keys in a dictionary. Functionally, this is a
set.
counts.keys()
dict_keys(['x', 'y', 'z'])
Alternatively the .values()
method returns a dictionary’s values.
counts.values()
dict_values([10, 1, 5])
At the beginning of the chapter we imported a Counter
object. This is a
special kind of dictionary. It counts its input and stores the results as
key–value pairs.
A Counter
can work on characters:
Counter(frankenstein)
Counter({' ': 68672,
'e': 43982,
't': 28282,
'a': 25362,
'o': 23768,
'n': 23223,
'i': 20389,
's': 20132,
'r': 19603,
'h': 18909,
'd': 16233,
'l': 12164,
'm': 9958,
'u': 9880,
'c': 8490,
'f': 8188,
'y': 7442,
'\n': 7315,
'w': 7088,
'p': 5602,
'g': 5474,
',': 4956,
'b': 4528,
'v': 3681,
'I': 3091,
'.': 2920,
'k': 1592,
';': 971,
'x': 649,
'T': 550,
'“': 480,
'A': 351,
'j': 346,
'q': 313,
'”': 293,
'H': 287,
'M': 276,
'W': 274,
'S': 272,
'!': 238,
'?': 220,
'B': 219,
'z': 211,
'E': 190,
'C': 153,
'F': 151,
'’': 144,
'Y': 134,
'—': 124,
'-': 123,
'O': 107,
'D': 92,
'G': 90,
'_': 84,
'N': 77,
'L': 72,
'P': 69,
'J': 66,
':': 48,
'‘': 43,
'R': 39,
'V': 36,
'1': 35,
'K': 24,
'æ': 21,
'7': 16,
'U': 16,
'2': 15,
'(': 15,
')': 15,
'3': 6,
'ê': 6,
'8': 5,
'4': 4,
'5': 4,
'9': 4,
'[': 3,
']': 3,
'6': 3,
'ô': 2,
'0': 2,
'é': 1,
'è': 1})
But it also works on containers like lists. That makes them highly useful for our purposes. Below we calculate the token counts in Frankenstein…
token_freq = Counter(tokens)
…and use the .most_common()
method to get the top-10 most frequent tokens
in the novel:
top_ten = token_freq.most_common(10)
top_ten
[('the', 3897),
('and', 2903),
('I', 2719),
('of', 2634),
('to', 2072),
('my', 1631),
('a', 1338),
('in', 1071),
('was', 992),
('that', 974)]
2.4. Iteration#
The output of .most_common()
is a list of tuples. You’ll see patterns like
this frequently: data structures wrapping other data structures. But while we
could work with this list as we could any list, indexing it to retrieve tuples,
which we could then index again, that would be inefficient for many operations.
More, it might require us to know in advance which elements are at what index
positions. This information is not always easily available, especially when
writing general-purpose code.
It would be better to work with our data in a more programmatic fashion. We can do this with the above containers because they are all iterables: that is, they enable us to step through each of their elements and do things like perform checks, run calculations, or even move elements to other parts of our code. This is called iterating through our data; each step is one iteration.
2.4.1. For-loops#
The standard method for advancing through an iterable is a for-loop. Even if
you’ve never written a line of code before, you’ve probably heard of them. A
for-loop begins with the for
keyword, followed by:
A placeholder variable, which will be automatically assigned to an element at the beginning of each iteration
The
in
keywordAn object with elements
A colon
:
Code in the body of the loop must be indented. An equivalent of four spaces for indentation is standard.
Below, we iterate through each tuple in top_ten
. At the start of the
iteration, a tuple is assigned to tup
; we then print this tuple.
for tup in top_ten:
print(tup)
('the', 3897)
('and', 2903)
('I', 2719)
('of', 2634)
('to', 2072)
('my', 1631)
('a', 1338)
('in', 1071)
('was', 992)
('that', 974)
For-loops can be nested inside of for-loops. Let’s re-implement the above with two for-loops.
for tup in top_ten:
for part in tup:
print(part)
print("\n")
the
3897
and
2903
I
2719
of
2634
to
2072
my
1631
a
1338
in
1071
was
992
that
974
See how the outer print
statement only triggers once the inner for-loop has
finished? Every iteration of the first for-loop kicks off the second for-loop
anew.
Within the indented portion of a for-loop you can perform checks and
computations. In every iteration below, we assign the token in the tuple to a
variable tok
and its value to val
. Then, we check whether val
is even. If
it is, we print tok
and val
.
for tup in top_ten:
tok, val = tup[0], tup[1]
if val % 2 == 0:
print(tok, val)
of 2634
to 2072
a 1338
was 992
that 974
Oftentimes you want to save the result of a check. The easiest way to do this
is by creating a new, empty list and using .append()
to add elements to it.
is_even = []
for tup in top_ten:
val = tup[1]
if val % 2 == 0:
is_even.append(tup)
print(is_even)
[('of', 2634), ('to', 2072), ('a', 1338), ('was', 992), ('that', 974)]
Other data structures are iterable in Python. In addition to lists, you’ll find
yourself iterating through dictionaries with some frequency. Use .keys()
or
.values()
to iterate, respectively, through the keys and values of a
dictionary. Or, use .items()
to iterate through both at the same time. Note
that .items()
requires using two placeholder variables separated by a comma
,
.
for key, value in counts.items():
print(key, "->", value)
x -> 10
y -> 1
z -> 5
Below, we divide every count in token_freq
by the total number of tokens in
Frankenstein to express counts as percentages, using a new Counter
to store
our results.
num_tokens = token_freq.total()
percentages = Counter()
for token, count in token_freq.items():
percent = count / num_tokens
percentages[token] = percent
Here is the equivalent of top_ten
, but with percentages:
percentages.most_common(10)
[('the', 0.05197732577525842),
('and', 0.038719573191063686),
('I', 0.03626542180726909),
('of', 0.035131710570190065),
('to', 0.027635878626208737),
('my', 0.021753917972657553),
('a', 0.01784594864954985),
('in', 0.014284761587195731),
('was', 0.013231077025675225),
('that', 0.012990996998999667)]
2.4.2. Comprehensions#
Comprehensions are idiomatic to Python. They allow you to perform operations across an iterable without needing to pre-allocate an empty copy to store the results. This makes them both concise and efficient. You will most frequently see comprehensions used in the context of lists (i.e. “list comprehensions”), but you can also use them for dictionaries and sets.
The syntax for comprehension includes the keywords for
and in
, just like a
for-loop. The difference is that in the list comprehension, the repeated code
comes before the for
keyword rather than after it, and the entire
expression is enclosed in square brackets [ ]
.
Below, we use the .istitle()
method to find capitalized tokens in
Frankenstein. This method returns a Boolean value, so the resultant list will
contain True
and False
values that specify capitalization at a certain
index.
is_title = [token.istitle() for token in tokens]
is_title[:10]
[True, False, True, True, True, True, True, True, True, False]
That should be straightforward enough, but we don’t know which tokens these
values reference. With comprehensions, an easy way around this is to use an
if
statement embedded in the comprehension. Put that statement and a
conditional check at the end of the comprehension to filter a list.
is_title = [token for token in tokens if token.istitle()]
is_title[:10]
['Letter',
'_To',
'Mrs.',
'Saville,',
'England._',
'St.',
'Petersburgh,',
'Dec.',
'You',
'I']
Comprehensions become particularly powerful when you use them to manipulate
each element in an iterable. Below, we change all tokens to their lowercase
variants using .lower()
.
lowercase = [token.lower() for token in tokens]
lowercase[:10]
['letter',
'1',
'_to',
'mrs.',
'saville,',
'england._',
'st.',
'petersburgh,',
'dec.',
'11th,']
2.4.3. While-loops#
While-loops continue iterating until a condition is met. Whereas a for-loop only iterates through your data once, a while-loop iterates indefinitely. That means you need to specify an exit condition to break out of your while-loop, otherwise your code will get trapped and eventually your computer will kill the process.
The syntax for a while-loop is quite simple: start it with the while
keyword
and a condition. Below, we increment a counter to print the first ten tokens in
Frankenstein.
current_index = 0
while current_index < 10:
print(tokens[current_index])
current_index += 1
Letter
1
_To
Mrs.
Saville,
England._
St.
Petersburgh,
Dec.
11th,
Note that we must specify, and then manually increment, the counter. If we didn’t, the loop would have no reference telling it when it should break.
Here is a more open-ended loop. We set the condition to True
, keeping the
loop running until we reach an exit condition. Then, for each iteration, we
index our list of tokens and check whether the token at that index matches the
one we’re looking for. If it does, the code prints that index position and
stops the iteration with a break
statement. If it doesn’t, we increment the
counter and try again.
find_first = "Frankenstein"
current_index = 0
while True:
token = tokens[current_index]
if token == find_first:
print("The first occurrence of", find_first, "is at", current_index)
break
current_index += 1
The first occurrence of Frankenstein is at 18961
2.5. Regular Expressions#
You have likely noticed by now that our tokenization strategy has created some
strange tokens. Most notably, punctuation sticks to words because there was no
whitespace to separate them. This means that, for our Counter
, the following
two tokens are counted separately, even though they’re the same word:
variants = ["ship,", "ship."]
for tok in variants:
print(tok, "->", token_freq[tok])
ship, -> 2
ship. -> 2
We can handle this in a number of ways. Many rely on writing out regular expressions, or regexes. Regexes are special sequences of characters that represent patterns for matching in text; these sequences are comprised of regular old characters in text, or literals, and metacharacters, special characters that stand for whole classes of literals. Regexes work as a search mechanism, and they become highly useful in text processing for their ability to find variants like the tokens above.
2.5.1. Literals#
The following regex will match on the string “ship”:
ship = r"ship"
Note how we prepend our string with r
. That tells Python to treat the string
as a regex sequence. Using findall()
from the re
module will return a list
of all matches on this regex:
re.findall(ship, frankenstein)
['ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship',
'ship']
When you use literals like this, Python will match only on the exact sequence. But that’s a problem for us, because there’s no way to know whether the above output refers to “ship” and any following punctuation, or if our regex has also matched on words that contain “ship,” like “relationship” and “shipment.”
The latter will most certainly be the case. We’ll see this if we search with
finditer()
. It finds all matches and also returns where they start and end in
the character sequence (the object returned is a Match
). Below, we use those
start/end positions to glimpse the context of matches.
found = re.finditer(ship, frankenstein)
for match in found:
# Get the match text
span = match.group()
# Get its start and end, then offset both
start = match.start() - 2
end = match.end() + 2
# Ensure our expanded start/end locations don't overshoot the string
if start < 0:
start = 0
if end > len(frankenstein):
end = len(frankenstein)
print(span, "->", frankenstein[start:end])
ship -> rdship.
ship -> a ship t
ship -> e
ship f
ship -> d ship:
ship -> e ship o
ship -> r ship.
ship -> ndship.
ship -> ndship a
ship -> orship i
ship -> onship,
ship -> ndship t
ship -> rdship,
ship -> ndship,
ship -> onships
ship -> ndship?
ship -> onship w
ship -> e shippi
ship -> ndship w
ship -> rdships
ship -> e ship,
ship -> ndship o
ship -> rdships.
ship -> r ship.
ship -> rdships
ship -> rdships
ship -> r ship,
ship -> rdships.
ship -> owship,
2.5.2. Metacharacters#
Controlling for cases where our regex returns more than what we want requires metacharacters.
The .
metacharacter stands for any character except a newline \n
.
re.findall(r"ship.", frankenstein)
['ship.',
'ship ',
'ship ',
'ship:',
'ship ',
'ship.',
'ship.',
'ship ',
'ship ',
'ship,',
'ship ',
'ship,',
'ship,',
'ships',
'ship?',
'ship ',
'shipp',
'ship ',
'ships',
'ship,',
'ship ',
'ships',
'ship.',
'ships',
'ships',
'ship,',
'ships',
'ship,']
If you want the literal period .
, you need to use an escape character
\
.
re.findall(r"ship\.", frankenstein)
['ship.', 'ship.', 'ship.', 'ship.']
Note that this won’t work:
re.findall(r"\", frankenstein)
Cell In[71], line 1
re.findall(r"\", frankenstein)
^
SyntaxError: unterminated string literal (detected at line 1)
Instead, escape the escape character:
re.findall(r"\\", frankenstein)
[]
No such characters in this text, however.
Use +
as a repetition operator to find instances where the preceding
character is repeated at least once, but with no limit up to a newline
character. If we use it with .
, it returns strings up to the ends of lines.
re.findall(r"ship.+", frankenstein)
['ship. I',
'ship there, which can easily be done by paying the',
'ship for his gentleness and the mildness of his discipline. This',
'ship: I have never believed it to be',
'ship on all sides, scarcely leaving her the sea-room in which',
'ship. We, however, lay to until the',
'ship. You have hope, and the world before you, and have no cause for',
'ship and',
'ship in his attachment to my mother, differing wholly from the',
'ship, and',
'ship to one among them. Henry',
'ship, and even danger for',
'ship, nor the beauty of earth, nor of',
'ships which',
'ship? I resolved, at least, not to despair, but in every way',
'ship with them. Yet even thus I',
'shipping for London. During this',
'ship was of that',
'ships',
'ship, but he escaped, I know not how.',
'ship of the villagers',
'ships. During the day I was',
'ship. I had determined, if you were going southwards,',
'ships into a death which I still dread, for my task is unfulfilled.',
'ships that I have undergone?',
'ship, brought to me a',
'ships.',
'ship, and I was still spurned. Was there no']
Related to +
is ?
and *
. The first means “match zero or one”, while the
second means “match zero or more”. An example of *
is below. Note we change
our regex slightly to demonstrate the zero matching.
re.findall(r"ship*", frankenstein)
['ship',
'ship',
'shi',
'ship',
'ship',
'shi',
'ship',
'ship',
'shi',
'shi',
'shi',
'ship',
'shi',
'ship',
'ship',
'ship',
'ship',
'ship',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'ship',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'ship',
'ship',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'ship',
'shi',
'shi',
'shipp',
'shi',
'shi',
'ship',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'shi',
'ship',
'ship',
'ship',
'ship',
'shi',
'ship',
'ship',
'ship',
'ship',
'ship',
'shi',
'shi',
'ship']
Use curly brackets { }
in conjunction with numbers to specify a limit for how
many repetitions you want. Here is “match three to five”:
re.findall(r"ship.{3,5}", frankenstein)
['ship. I',
'ship ther',
'ship for ',
'ship: I h',
'ship on a',
'ship. We,',
'ship. You',
'ship and',
'ship in h',
'ship, and',
'ship to o',
'ship, and',
'ship, nor',
'ships whi',
'ship? I r',
'ship with',
'shipping ',
'ship was ',
'ship, but',
'ship of t',
'ships. Du',
'ship. I h',
'ships int',
'ships tha',
'ship, bro',
'ship, and']
Want to constrain your search to particular characters? Parentheses ( )
specify groups of characters, including metacharacters. Use them in conjunction
with the “or” operator |
to get two (or more) variants of a string.
re.findall(r"(ship\.|ship,)", frankenstein)
['ship.',
'ship.',
'ship.',
'ship,',
'ship,',
'ship,',
'ship,',
'ship.',
'ship,',
'ship,']
Or, use square brackets [ ]
to specify literals following an “or” logic, e.g.
“character X or character Y or…”.
re.findall(r"ship[.,]", frankenstein)
['ship.',
'ship.',
'ship.',
'ship,',
'ship,',
'ship,',
'ship,',
'ship.',
'ship,',
'ship,']
Note that literals are case-sensitive.
re.findall(r"[Ss]everal", frankenstein)
['several',
'several',
'Several',
'several',
'several',
'several',
'several',
'several',
'several',
'several',
'several',
'several',
'several',
'several',
'several',
'Several',
'several',
'several',
'several',
'Several',
'several',
'Several',
'several',
'several',
'several',
'several',
'several',
'several',
'several',
'Several',
'several',
'several',
'several',
'several',
'Several',
'several',
'several',
'several',
'several',
'several',
'Several',
'several',
'several',
'Several',
'several',
'several',
'several',
'several',
'several',
'several',
'several']
Including a space character is valid here:
re.findall(r"ship[., ]", frankenstein)
['ship.',
'ship ',
'ship ',
'ship ',
'ship.',
'ship.',
'ship ',
'ship ',
'ship,',
'ship ',
'ship,',
'ship,',
'ship ',
'ship ',
'ship,',
'ship ',
'ship.',
'ship,',
'ship,']
But you can also use \s
. This specifies a character class: whole types of
characters (in this case, spaces).
re.findall(r"ship[.,\s]", frankenstein)
['ship.',
'ship ',
'ship ',
'ship ',
'ship.',
'ship.',
'ship ',
'ship ',
'ship,',
'ship ',
'ship,',
'ship,',
'ship ',
'ship ',
'ship,',
'ship ',
'ship.',
'ship,',
'ship,']
Below, we find all spaces (and multiple space sequences) in the novel:
re.findall(r"\s+", frankenstein)
[' ',
'\n\n',
' ',
' ',
' ',
'\n\n\n',
' ',
' ',
' ',
' ',
'\n\n\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
'\n\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
'\n',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
...]
There are also character classes for digits \d
and alphanumeric characters
\w
. Here is an example with digits, which you could use to find chapter
breaks:
re.findall(r"Chapter \d+", frankenstein)
['Chapter 1',
'Chapter 2',
'Chapter 3',
'Chapter 4',
'Chapter 5',
'Chapter 6',
'Chapter 7',
'Chapter 8',
'Chapter 9',
'Chapter 10',
'Chapter 11',
'Chapter 12',
'Chapter 13',
'Chapter 14',
'Chapter 15',
'Chapter 16',
'Chapter 17',
'Chapter 18',
'Chapter 19',
'Chapter 20',
'Chapter 21',
'Chapter 22',
'Chapter 23',
'Chapter 24']
Using \w
, the pattern below specifies alphanumeric characters followed by a
newline.
re.findall(r"\w+\n", frankenstein)
['1\n',
'_\n',
'the\n',
'evil\n',
'assure\n',
'success\n',
'of\n',
'which\n',
'this\n',
'towards\n',
'fervent\n',
'of\n',
'the\n',
'ever\n',
'a\n',
'put\n',
'in\n',
'habitable\n',
'the\n',
'undiscovered\n',
'I\n',
'may\n',
'this\n',
'I\n',
'world\n',
'by\n',
'to\n',
'this\n',
'little\n',
'his\n',
'you\n',
'all\n',
'pole\n',
'are\n',
'at\n',
'my\n',
'me\n',
'as\n',
'intellectual\n',
'I\n',
'have\n',
'Ocean\n',
'a\n',
'the\n',
'study\n',
'which\n',
'injunction\n',
'poets\n',
'also\n',
'the\n',
'well\n',
'my\n',
'I\n',
'this\n',
'I\n',
'often\n',
'my\n',
'those\n',
'derive\n',
'an\n',
'I\n',
'second\n',
'greatest\n',
'to\n',
'encouraging\n',
'is\n',
'am\n',
'which\n',
'spirits\n',
'fly\n',
'in\n',
'The\n',
'have\n',
'the\n',
'exercise\n',
'no\n',
'and\n',
'my\n',
'the\n',
'necessary\n',
'to\n',
'how\n',
'your\n',
'Walton\n',
'2\n',
'_\n',
'a\n',
'have\n',
'certainly\n',
'the\n',
'no\n',
'there\n',
'no\n',
'thoughts\n',
'of\n',
'whose\n',
'I\n',
'yet\n',
'whose\n',
'a\n',
'execution\n',
'me\n',
'wild\n',
'own\n',
'its\n',
'the\n',
'native\n',
'many\n',
'my\n',
'painters\n',
'sense\n',
'to\n',
'the\n',
'Yet\n',
'these\n',
'courage\n',
'phrase\n',
'an\n',
'of\n',
'assist\n',
'the\n',
'This\n',
'made\n',
'years\n',
'the\n',
'to\n',
'be\n',
'kindliness\n',
'felt\n',
'heard\n',
'the\n',
'loved\n',
'considerable\n',
'saw\n',
'in\n',
'friend\n',
'his\n',
'he\n',
'his\n',
'young\n',
'old\n',
'returned\n',
'her\n',
'is\n',
'kind\n',
'conduct\n',
'which\n',
'can\n',
'am\n',
'voyage\n',
'The\n',
'it\n',
'sail\n',
'me\n',
'the\n',
'my\n',
'of\n',
'which\n',
'the\n',
'not\n',
'and\n',
'I\n',
'my\n',
'that\n',
'something\n',
'practically\n',
'and\n',
'belief\n',
'out\n',
'unvisited\n',
'after\n',
'of\n',
'to\n',
'to\n',
'when\n',
'Walton\n',
'3\n',
'_\n',
'advanced\n',
'on\n',
'not\n',
'good\n',
'the\n',
'dangers\n',
'We\n',
'of\n',
'desire\n',
'not\n',
'a\n',
'are\n',
'and\n',
'as\n',
'I\n',
'stars\n',
'not\n',
'the\n',
'must\n',
'4\n',
'_\n',
'forbear\n',
'before\n',
'closed\n',
'which\n',
'we\n',
'out\n',
'to\n',
'to\n',
'suddenly\n',
'own\n',
'by\n',
'a\n',
'progress\n',
'the\n',
'that\n',
'by\n',
'the\n',
'before\n',
'the\n',
'which\n',
'to\n',
'and\n',
'apparently\n',
'we\n',
'large\n',
'human\n',
'of\n',
'the\n',
'perish\n',
'a\n',
'addressed\n',
'have\n',
'not\n',
'I\n',
'the\n',
'for\n',
'were\n',
'and\n',
'attempted\n',
'fresh\n',
'and\n',
'to\n',
'we\n',
'the\n',
'often\n',
'he\n',
'and\n',
'more\n',
'of\n',
'anyone\n',
'most\n',
'with\n',
'he\n',
'his\n',
'off\n',
'not\n',
'body\n',
'ice\n',
'and\n',
'we\n',
'of\n',
'had\n',
'good\n',
'to\n',
'have\n',
'the\n',
'answer\n',
'near\n',
'safety\n',
'the\n',
'for\n',
'in\n',
'instant\n',
'the\n',
'very\n',
'all\n',
'communication\n',
'his\n',
'must\n',
'wreck\n',
'friend\n',
'been\n',
'brother\n',
'my\n',
'so\n',
'poignant\n',
'and\n',
'although\n',
'he\n',
'frequently\n',
'without\n',
'my\n',
'taken\n',
'the\n',
'soul\n',
'would\n',
'my\n',
'for\n',
'should\n',
'a\n',
'I\n',
'before\n',
'trickle\n',
'I\n',
'you\n',
'the\n',
'weakened\n',
'were\n',
'despise\n',
'of\n',
'asked\n',
'it\n',
'a\n',
'than\n',
'could\n',
'are\n',
'than\n',
'to\n',
'most\n',
'respecting\n',
'for\n',
'life\n',
'settled\n',
'presently\n',
'he\n',
'sight\n',
'of\n',
'he\n',
'he\n',
'a\n',
'divine\n',
'and\n',
'therefore\n',
'to\n',
'I\n',
'that\n',
'I\n',
'failing\n',
'unequalled\n',
'a\n',
'Captain\n',
'had\n',
'with\n',
'for\n',
'the\n',
'mine\n',
'be\n',
'same\n',
'me\n',
'one\n',
'you\n',
'usually\n',
'might\n',
'things\n',
'would\n',
'powers\n',
'series\n',
'offered\n',
'by\n',
'hear\n',
'strong\n',
'expressed\n',
'is\n',
'I\n',
'my\n',
'my\n',
'is\n',
'I\n',
'have\n',
'to\n',
'during\n',
'This\n',
'who\n',
'and\n',
'my\n',
'me\n',
'in\n',
'soul\n',
'which\n',
'1\n',
'most\n',
'years\n',
'public\n',
'who\n',
'public\n',
'the\n',
'his\n',
'a\n',
'cannot\n',
'a\n',
'numerous\n',
'a\n',
'poverty\n',
'been\n',
'his\n',
'in\n',
'and\n',
'conduct\n',
'in\n',
'begin\n',
'ten\n',
'the\n',
'Beaufort\n',
'but\n',
'in\n',
'a\n',
'for\n',
'end\n',
'saw\n',
'that\n',
'Beaufort\n',
'support\n',
'and\n',
'to\n',
'time\n',
'subsistence\n',
'leaving\n',
'knelt\n',
'the\n',
'who\n',
'he\n',
'a\n',
'but\n',
'devoted\n',
'mind\n',
'love\n',
'the\n',
'set\n',
'and\n',
'the\n',
'her\n',
'recompensing\n',
'grace\n',
'wishes\n',
'is\n',
'her\n',
'and\n',
'hitherto\n',
'During\n',
'had\n',
'after\n',
'change\n',
'born\n',
'remained\n',
'each\n',
'very\n',
'and\n',
'my\n',
'something\n',
'on\n',
'in\n',
'fulfilled\n',
'owed\n',
'spirit\n',
'during\n',
'but\n',
'a\n',
'five\n',
'they\n',
'benevolent\n',
'my\n',
'a\n',
'been\n',
'the\n',
'vale\n',
'number\n',
'worst\n',
'to\n',
'far\n',
'were\n',
'Her\n',
'her\n',
'was\n',
'of\n',
'behold\n',
'and\n',
'was\n',
'a\n',
'with\n',
'been\n',
'their\n',
'glory\n',
'exerted\n',
'its\n',
'Austria\n',
'and\n',
'rude\n',
'of\n',
'seemed\n',
'lighter\n',
'his\n',
'their\n',
'seemed\n',
'poverty\n',
'They\n',
'Lavenza\n',
'than\n',
'and\n',
'reverential\n',
'my\n',
'to\n',
'my\n',
'she\n',
'childish\n',
'Elizabeth\n',
'on\n',
'other\n',
'body\n',
'than\n',
'2\n',
'in\n',
'of\n',
'and\n',
'us\n',
'concentrated\n',
'intense\n',
'Swiss\n',
'of\n',
'the\n',
'their\n',
'the\n',
'gave\n',
'native\n',
'a\n',
'the\n',
'my\n',
'was\n',
'united\n',
'Henry\n',
'singular\n',
'for\n',
'He\n',
'and\n',
'into\n',
'of\n',
'chivalrous\n',
'hands\n',
'My\n',
'to\n',
'delights\n',
'distinctly\n',
'assisted\n',
'some\n',
'pursuits\n',
'things\n',
'states\n',
'earth\n',
'of\n',
'man\n',
'moral\n',
'was\n',
'the\n',
'soul\n',
'of\n',
'was\n',
'become\n',
'that\n',
'And\n',
'Yet\n',
'his\n',
'for\n',
'of\n',
'soaring\n',
'of\n',
'which\n',
'would\n',
'my\n',
'almost\n',
'torrent\n',
'my\n',
'went\n',
'the\n',
'I\n',
'it\n',
'wonderful\n',
'new\n',
'my\n',
'my\n',
'waste\n',
'me\n',
'modern\n',
'powers\n',
'while\n',
'I\n',
'my\n',
'my\n',
'never\n',
'glance\n',
'was\n',
'greatest\n',
'this\n',
'and\n',
'me\n',
'always\n',
'of\n',
'modern\n',
'picking\n',
'his\n',
'acquainted\n',
'same\n',
'acquainted\n',
'little\n',
'immortal\n',
'causes\n',
'I\n',
'keep\n',
'and\n',
'knew\n',
'their\n',
'eighteenth\n',
'of\n',
'favourite\n',
'a\n',
'greatest\n',
'elixir\n',
'an\n',
'could\n',
'but\n',
'a\n',
'which\n',
'I\n',
'a\n',
'was\n',
'thousand\n',
'of\n',
'childish\n',
'near\n',
'It\n',
'once\n',
'an\n',
'so\n',
'nothing\n',
'found\n',
'the\n',
'beheld\n',
'of\n',
'natural\n',
'on\n',
'of\n',
'by\n',
'my\n',
'ever\n',
'grew\n',
'perhaps\n',
'former\n',
'deformed\n',
'a\n',
'of\n',
'the\n',
'as\n',
'ligaments\n',
'me\n',
'the\n',
'effort\n',
'even\n',
'was\n',
'which\n',
'tormenting\n',
'with\n',
'and\n',
'3\n',
'I\n',
'had\n',
'it\n',
'made\n',
'My\n',
'day\n',
'life\n',
'was\n',
'to\n',
'first\n',
'her\n',
'She\n',
'malignity\n',
'this\n',
'mother\n',
'the\n',
'her\n',
'desert\n',
'My\n',
'were\n',
'the\n',
'to\n',
'happy\n',
'are\n',
'to\n',
'rent\n',
'the\n',
'so\n',
'day\n',
'departed\n',
'been\n',
'ear\n',
'of\n',
'the\n',
'has\n',
'I\n',
'at\n',
'and\n',
'a\n',
'still\n',
'the\n',
'the\n',
'of\n',
'of\n',
'was\n',
'above\n',
'and\n',
'call\n',
'last\n',
'permit\n',
'His\n',
'the\n',
'misfortune\n',
'when\n',
'a\n',
'details\n',
'nor\n',
'we\n',
'the\n',
'the\n',
'father\n',
'to\n',
'last\n',
'in\n',
'by\n',
'mutual\n',
'I\n',
'hitherto\n',
'invincible\n',
'and\n',
'myself\n',
'as\n',
'I\n',
'had\n',
'to\n',
'my\n',
'the\n',
'was\n',
'to\n',
'evil\n',
'me\n',
's\n',
'He\n',
'He\n',
'branches\n',
'and\n',
'principal\n',
'he\n',
'with\n',
'utterly\n',
'systems\n',
'you\n',
'they\n',
'scientific\n',
'dear\n',
'books\n',
'and\n',
'following\n',
'natural\n',
'fellow\n',
'he\n',
'long\n',
'I\n',
'any\n',
'a\n',
'in\n',
'a\n',
'come\n',
'been\n',
'natural\n',
'my\n',
'the\n',
'the\n',
'sought\n',
'now\n',
'limit\n',
'in\n',
'of\n',
'my\n',
'becoming\n',
'new\n',
'information\n',
'I\n',
'deliver\n',
'lecturing\n',
'very\n',
'an\n',
'his\n',
'person\n',
'and\n',
'pronouncing\n',
'took\n',
'of\n',
'he\n',
'I\n',
'masters\n',
'that\n',
'seem\n',
'or\n',
'recesses\n',
'the\n',
'of\n',
'even\n',
'of\n',
'soul\n',
'were\n',
'was\n',
'of\n',
'steps\n',
'and\n',
'of\n',
'I\n',
'to\n',
'a\n',
'His\n',
'in\n',
'I\n',
'had\n',
'little\n',
'Cornelius\n',
'had\n',
'zeal\n',
'their\n',
'names\n',
'a\n',
'The\n',
'ever\n',
'I\n',
'presumption\n',
'my\n',
'measured\n',
'his\n',
'have\n',
'intended\n',
'to\n',
'a\n',
'of\n',
'the\n',
'that\n',
'not\n',
'sorry\n',
'your\n',
'petty\n',
'natural\n',
'his\n',
'and\n',
'in\n',
'of\n',
'4\n',
'the\n',
'the\n',
'the\n',
'sense\n',
'repulsive\n',
'In\n',
'by\n',
'and\n',
'ways\n',
'abstruse\n',
'at\n',
'and\n',
'the\n',
'progress\n',
'and\n',
'Waldman\n',
'years\n',
'was\n',
'I\n',
'conceive\n',
'as\n',
'in\n',
'must\n',
'who\n',
'was\n',
'two\n',
'chemical\n',
'the\n',
'well\n',
'as\n',
'my\n',
'thought\n',
'incident\n',
'was\n',
'with\n',
'a\n',
'becoming\n',
'our\n',
'determined\n',
'of\n',
'been\n',
'this\n',
'the\n',
'became\n',
'I\n',
'my\n',
'ever\n',
'feared\n',
'and\n',
'of\n',
'become\n',
'of\n',
'and\n',
'most\n',
'the\n',
'of\n',
'worm\n',
'and\n',
'change\n',
'this\n',
'and\n',
'immensity\n',
'so\n',
'same\n',
'a\n',
'not\n',
'is\n',
'the\n',
'of\n',
'of\n',
'bestowing\n',
'discovery\n',
'in\n',
'the\n',
'so\n',
'been\n',
'creation\n',
'it\n',
'a\n',
'them\n',
'already\n',
'dead\n',
'seemingly\n',
'eyes\n',
'with\n',
'end\n',
'that\n',
'my\n',
'of\n',
'town\n',
'nature\n',
'hesitated\n',
'to\n',
'of\n',
'inconceivable\n',
'the\n',
'my\n',
'to\n',
'wonderful\n',
'appeared\n',
'should\n',
'my\n',
'be\n',
'takes\n',
'present\n',
'Nor\n',
'any\n',
'I\n',
'parts\n',
'first\n',
'having\n',
'successfully\n',
'like\n',
'death\n',
'and\n',
'bless\n',
'would\n',
'his\n',
'these\n',
'lifeless\n',
'undertaking\n',
'my\n',
'very\n',
'the\n',
'alone\n',
'moon\n',
'breathless\n',
'conceive\n',
'damps\n',
'lifeless\n',
'but\n',
'seemed\n',
'was\n',
'renewed\n',
'had\n',
'and\n',
'human\n',
'from\n',
'The\n',
'I\n',
'in\n',
'fields\n',
'luxuriant\n',
'the\n',
'also\n',
'had\n',
'I\n',
'are\n',
'shall\n',
'any\n',
'duties\n',
...]
The start-of-text anchor operator ^
is useful for filtering out characters.
It checks whether a sequence begins with the characters that follow it. Below,
we select characters that are neither alphanumeric nor spaces.
re.findall(r"[^\w\s]+", frankenstein)
['.',
',',
'.',
'.',
',',
'.',
',',
'—.',
'.',
',',
'.',
',',
',',
',',
'.',
'?',
',',
',',
'.',
',',
'.',
';',
'.',
',',
',',
',',
'.',
'—',
',',
',',
'—',
';',
',',
',',
'.',
',',
'.',
'?',
'.',
',',
'.',
',',
',',
',',
'.',
',',
',',
',',
',',
';',
',',
',',
',',
'.',
',',
',',
'—',
'.',
'.',
'.',
'’',
'.',
',',
'.',
',',
',',
',',
'’',
'.',
',',
',',
'.',
';',
'.',
'.',
',',
'.',
'.',
',',
',',
'.',
'.',
'-',
';',
',',
',',
',',
';',
',',
',',
'.',
'-',
',',
'.',
',',
'.',
',',
',',
'?',
',',
'.',
',',
'!',
';',
',',
'.',
',',
':',
',',
',',
'.',
'.',
';',
',',
',',
',',
'.',
',',
'—',
',',
',',
'.',
'-',
'.',
'.',
';',
',',
',',
'-',
'.',
';',
'?',
',',
',',
'?',
',',
',',
',',
',',
'.',
',',
',',
'.',
',',
',',
'.',
',',
',',
'.',
',',
'.',
'.',
',',
'.',
',',
',',
'—.',
',',
'!',
'.',
';',
'.',
',',
',',
',',
':',
',',
';',
',',
'.',
',',
';',
'.',
',',
'.',
',',
',',
'.',
',',
',',
',',
',',
'.',
'!',
'.',
'-',
':',
'’',
'.',
';',
'.',
'-',
'.',
',',
'(',
')',
';',
',',
'.',
',',
';',
',',
',',
'.',
',',
',',
'.',
',',
',',
';',
',',
',',
',',
'.',
',',
',',
',',
'.',
';',
',',
'.',
'.',
',',
'-',
',',
'.',
',',
',',
':',
',',
',',
'.',
',',
'.',
',',
',',
'.',
',',
'-',
',',
'.',
';',
',',
',',
',',
',',
',',
'.',
',',
',',
'.',
',',
';',
',',
'-',
',',
'’',
'.',
',',
',',
',',
',',
',',
'.',
'“',
'!”',
'.',
';',
':',
',',
',',
',',
',',
'.',
',',
',',
'.',
',',
'.',
',',
',',
',',
'.',
':',
'.',
'.',
',',
',',
'.',
',',
'“',
',”',
';',
'“',
'.”',
',',
'.',
',',
',',
'.',
'.',
'—',
',',
'—',
',',
',',
',',
',',
'.',
'.',
',',
',',
'?',
',',
'.',
':',
'.',
'.',
',',
'.',
',',
'.',
',',
'.',
',',
'—.',
',',
'—',
'.',
';',
',',
',',
',',
'.',
',',
',',
':',
',',
',',
',',
'.',
';',
',',
',',
',',
',',
'.',
'.',
',',
'.',
',',
'.',
',',
',',
'.',
',',
',',
'.',
'.',
'?',
',',
',',
'.',
'?',
'?',
'.',
'.',
'!',
'.',
'.',
'.',
',',
'.',
',',
'—.',
',',
'.',
'(',
')',
',',
',',
'-',
'.',
',',
'.',
',',
'.',
'’',
',',
',',
',',
',',
'.',
',',
',',
'.',
',',
',',
',',
';',
',',
',',
'.',
'.',
'.',
',',
',',
';',
',',
',',
'.',
',',
',',
',',
',',
'.',
',',
'.',
',',
',',
',',
'.',
'.',
',',
',',
',',
',',
'.',
',',
',',
',',
',',
'.',
';',
'.',
',',
',',
',',
'.',
',',
'“',
',',
'.”',
',',
',',
'.',
'“',
',”',
',',
'“',
'?”',
'.',
',',
',',
'.',
'.',
'!',
',',
',',
'.',
',',
'.',
'.',
',',
'.',
'.',
'.',
',',
'.',
',',
'.',
',',
'.',
':',
',',
',',
',',
',',
',',
',',
'.',
',',
',',
'.',
',',
';',
',',
'.',
',',
',',
'.',
',',
',',
'“',
'.”',
'“',
'?”',
'“',
'.”',
'“',
',',
',',
',',
'.”',
'’',
',',
',',
',',
'.',
',',
',',
',',
'“',
',',
',',
',',
';',
'.”',
'“',
';',
'.”',
'“',
';',
'.”',
'.',
',',
',',
';',
'.',
'.',
';',
',',
'.',
'.',
'.',
'.',
',',
'.',
',',
',',
'.',
',',
'.',
',',
',',
';',
',',
',',
'.',
',',
'.',
',',
'—.',
'.',
'.',
'?',
',',
';',
',',
',',
',',
'.',
',',
'.',
',',
',',
'.',
',',
'.',
'.',
',',
',',
',',
',',
',',
',',
'.',
'’',
',',
'.',
',',
'’',
'.',
';',
',',
';',
'.',
';',
',',
':',
'“',
'!',
'?',
'?',
';',
',',
'!”',
',',
',',
';',
',',
'.',
',',
';',
',',
'.',
'.',
',',
'.',
',',
',',
'.',
'“',
',”',
';',
'“',
',',
',',
',',
',',
'—',
'—',
'.',
',',
',',
',',
',',
'.',
',',
',',
'.',
'—',
'.”',
',',
'.',
'.',
',',
'.',
',',
',',
'.',
':',
',',
',',
',',
'.',
'?',
'.',
',',
';',
'.',
'.',
',',
'-',
',',
',',
';',
'-',
'.',
',',
'—.',
',',
'“',
',',
',',
'.',
',',
'.',
',',
';',
',',
'.',
';',
',',
',',
',',
',',
'.',
'.',
',',
';',
'-',
';',
'.”',
',',
'.',
',',
'.',
'.',
'“',
',”',
',',
'“',
',',
';',
'.',
',',
'.',
',”',
',',
';',
'“',
',',
',',
';',
';',
',',
'.”',
'.',
'.',
',',
',',
',',
',',
'.',
',',
'.',
';',
',',
',',
'—',
'!',
',',
',',
'-',
';',
';',
',',
'.',
',',
'—',
'!',
',',
'.',
',',
'.',
'.',
';',
',',
'.',
',',
'.',
',',
',',
',',
',',
'.',
',',
',',
'.',
',',
',',
',',
',',
'.',
'.',
'.',
',',
'.',
',',
'.',
',',
',',
'.',
',',
'.',
',',
',',
'’',
'.',
',',
',',
';',
',',
',',
'.',
',',
'.',
',',
'.',
';',
'.',
'.',
';',
';',
';',
',',
'.',
',',
'’',
',',
'.',
',',
';',
'.',
'.',
',',
'.',
'’',
'.',
'-',
'.',
',',
',',
',',
',',
',',
'.',
'.',
',',
',',
'.',
',',
',',
'.',
';',
',',
',',
'.',
'.',
',',
',',
',',
'.',
'.',
',',
'.',
'’',
'’',
'.',
',',
'—',
',',
',',
',',
',',
'.',
',',
',',
',',
',',
'-',
',',
'.',
'.',
',',
'.',
',',
',',
'.',
'.',
',',
',',
';',
',',
'—',
',',
'—',
'.',
',',
'-',
'.',
',',
',',
',',
',',
'.',
',',
',',
',',
'.',
'.',
'.',
'-',
',',
';',
'.',
',',
',',
'.',
',',
',',
',',
'-',
',',
'.',
',',
',',
'.',
',',
'.',
'.',
':',
'.',
',',
'.',
'—',
',',
'.',
'.',
'.',
';',
'.',
',',
'-',
'.',
',',
'—',
'.',
'.',
'.',
'.',
',',
'.',
',',
'’',
'—',
'—',
'.',
'.',
',',
',',
'.',
',',
',',
'“',
'—',
'.”',
',',
',',
',',
',',
',',
'—',
',',
',',
'.',
'.',
'.',
',',
'—',
',',
'.',
';',
'.',
'.',
',',
'.',
';',
',',
',',
'.',
';',
'—',
',',
',',
',',
',',
'—',
'.',
',',
'.',
'.',
',',
',',
',',
',',
'.',
',',
',',
'.',
',',
',',
',',
'.',
',',
'.',
'.',
',',
',',
'-',
';',
'.',
'.',
'.',
',',
',',
'.',
'.',
'.',
',',
',',
',',
'.',
'.',
'.',
',',
'.',
',',
'.',
...]
The sub()
function will substitute regex matches with another sequence. If we
use the same pattern above, we can remove all punctuation. Note that we also
need to tack on the extra underscore character, as it is technically counted in
the character class.
cleaned = re.sub(r"[^\w\s]+|_", " ", frankenstein)
This is one way of getting around those variants from above.
token_freq = Counter(cleaned.split())
variants = ["ship,", "ship.", "ship"]
for tok in variants:
print(tok, "->", token_freq.get(tok, None))
ship,
-> None
ship. -> None
ship -> 8
We actually scooped up even more tokens from this substitution pattern. An even
better picture of our counts would emerge if we changed our text to lowercase
so that the Counter
can count case variants together.
cleaned = cleaned.lower()
token_freq = Counter(cleaned.split())
print("Unique tokens after substitution and case change:", len(token_freq))
Unique tokens after substitution and case change: 7003
We’ll leave off on text preprocessing for now but will pick it up in the next chapter. We’ve covered most of the main regexes, though there are a few more advanced ones that you may find useful. See this cheatsheet for an extensive overview.
2.6. Functions#
So far we have relied on external functions, but we can also write our own. Writing functions greatly reduces redundancy in code, because you can reuse them as many times as you want, in whatever contexts you want. More, functions can keep your code organized. In complex processes, it often helps to break your code up into individual steps and associate each with a function. That also makes rewriting code much easier later on.
Before we write a function, here is a review of vocabulary associated with them:
The placeholder variables for inputs are parameters
Arguments are the values assigned to parameters during a call
To call a function means using it to compute something
The body is the code inside a function
A function’s scope is the local context in which it runs code
The return value is the output of a function
In Python, a function begins with the def
keyword, followed by:
The name of the function
A list of parameters surrounded by parentheses
A colon
:
There is no practical limit to the number of parameters. Code in the body of
the function should be indented according to the same conventions for loops and
conditionals (four spaces). To return a result from the function, use the
return
keyword.
Here is a very simple function that returns True
/False
depending on whether
text
starts with character
.
def starts_with(text, character):
first_char = text[0]
return first_char == character
Call the function by writing out its name and supplying it with arguments.
starts_with("Book", "B")
True
Here are some more examples:
starts_with("natural language processing", "w")
False
to_test = [("token", "t"), ("Character", "c")]
for testing_pair in to_test:
text = testing_pair[0]
character = testing_pair[1]
text_starts_with_char = starts_with(text, character)
print(text, "starts with", character, "is", text_starts_with_char)
token starts with t is True
Character starts with c is False
When we speak of a function’s scope, we are talking about the local context for that function. Typically, variables inside the function or its arguments are not the same as the ones you set elsewhere in your code, even if the names of those variables match.
For instance, starts_with()
creates a new variable, first_char
. If we have
another variable with the same name outside this function, it won’t be
overwritten when we call starts_with()
.
first_char = "5"
print("Value of first_char:", first_char)
starts_with("Book", "B")
print("Value of first_char:", first_char)
Value of first_char: 5
Value of first_char: 5
You’ll often find yourself transforming code you’ve already written into a function. The function below re-implements the for-loop we wrote above to print regex matches and their context. It has three parameters:
The regex match is
match
The text where we found the match is
string
Our
offset
is the number of characters to extract on either side of the match
def show_match_context(match, string, offset):
# Get the match text
span = match.group()
# Get its start and end, then offset both
start = match.start() - offset
end = match.end() + offset
# Ensure our expanded start/end locations don't overshoot the string
if start < 0:
start = 0
if end > len(string):
end = len(string)
# Print the results
print(span, "->", string[start:end])
With our function defined, we can call it once.
match = re.search(r"lightning", frankenstein)
show_match_context(match, frankenstein, 5)
lightning -> the lightning play
Or as many times as we please.
found = re.finditer(r"lightning", frankenstein)
for match in found:
show_match_context(match, frankenstein, 5)
lightning -> the lightning play
lightning -> s of lightning dazz
lightning -> h of lightning
illu
lightning -> llid
lightnings tha
lightning -> s of lightning,
plu
It’s somewhat annoying to have to write out the offset every time we call the function. To circumvent this, you can specify a default value for a parameter. Your function will use that if you do not supply an argument for that parameter.
def show_match_context(match, string, offset = 5):
# Get the match text
span = match.group()
# Get its start and end, then offset both
start = match.start() - offset
end = match.end() + offset
# Ensure our expanded start/end locations don't overshoot the string
if start < 0:
start = 0
if end > len(string):
end = len(string)
# Print the results
print(span, "->", string[start:end])
With no argument supplied:
match = re.search(r"lightning", frankenstein)
show_match_context(match, frankenstein)
lightning -> the lightning play
Supplying an argument:
match = re.search(r"lightning", frankenstein)
show_match_context(match, frankenstein, 15)
lightning -> yage I saw the lightning playing on the
Recall that the output of finditer()
is a special Match
object, which has
properties that extend beyond typical strings. If you forget this and try to
call your function on an object it doesn’t expect, you’ll run into an error:
show_match_context("lightning", frankenstein)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[99], line 1
----> 1 show_match_context("lightning", frankenstein)
Cell In[96], line 3, in show_match_context(match, string, offset)
1 def show_match_context(match, string, offset = 5):
2 # Get the match text
----> 3 span = match.group()
5 # Get its start and end, then offset both
6 start = match.start() - offset
AttributeError: 'str' object has no attribute 'group'
The more code you write, the harder it is to keep this sort of thing in your
mind. That’s why it’s helpful to document your function with a docstring.
Docstrings are descriptions of what your function does and what kinds of
parameters it expects. They go on the first line of your function’s body and
are surrounded by triple quotes """
.
def starts_with(text, character):
"""Determine whether a string starts with a character."""
first_char = text[0]
return first_char == character
Once you’ve written a docstring, you can use help()
in a Python console or
?
in a Jupyter Notebook to display this information.
help(starts_with)
Help on function starts_with in module __main__:
starts_with(text, character)
Determine whether a string starts with a character.
There are several styles for writing docstrings, but the NumPy conventions are good ones. They specify docstrings like so:
def func(x, y):
"""A short summary description of a function ending with a period.
A longer description if necessary.
Parameters
----------
x : x's datatype
A description of what x is
y : y's datatype
A description of what y is
Returns
-------
value : value's datatype
A description of what value is (only supply if the function returns a
value)
"""
Let’s document show_match_context()
with a docstring.
def show_match_context(match, string, offset = 5):
"""Print a regex match's surrounding characters.
Parameters
----------
match : re.Match
A regex match from re.search() or re.finditer()
string : str
The string in which the match was found
offset : int
The number of surrounding characters to the left/right of match
"""
# Get the match text
span = match.group()
# Get its start and end, then offset both
start = match.start() - offset
end = match.end() + offset
# Ensure our expanded start/end locations don't overshoot the string
if start < 0:
start = 0
if end > len(string):
end = len(string)
# Print the results
print(span, "->", string[start:end])
Now our function is fully documented and ready for later use.
help(show_match_context)
Help on function show_match_context in module __main__:
show_match_context(match, string, offset=5)
Print a regex match's surrounding characters.
Parameters
----------
match : re.Match
A regex match from re.search() or re.finditer()
string : str
The string in which the match was found
offset : int
The number of surrounding characters to the left/right of match