2. Python Basics#

This chapter overviews key functionality and concepts in Python. We will learn how to load files into Python, store data in various data structures, and iterate through that data. Additionally, we will discuss regular expressions for string matching and write our own functions.

  • Data: A plain text version of Mary Shelley’s Frankenstein

  • Credits: Portions of this chapter are adapted from the UC Davis DataLab’s Python Basics

2.1. Preliminaries#

To do our work, we will import an entire module and a single object from another module.

import re
from collections import Counter

Recall from the last chapter that importing an entire module gives us access to all of its functionality. We specify which part of that module we want to use with dot . notation.

re.findall
<function re.findall(pattern, string, flags=0)>

To use the object we imported, initialize it by assigning it to a variable:

c = Counter()
c
Counter()

2.2. Loading Data#

Loading a file into your Python environment requires writing a path to the file’s location. Below, we assign a path to Mary Shelley’s Frankenstein, which currently sits in data/, a subdirectory of our current working directory.

path = "data/texts/shelley_frankenstein.txt"

Use open to open a connection to the file. This function requires you to specify a value to the mode argument. We use r because are working with plain text data; rb would for binary data.

fin = open(path, mode = "r") 

With the connection established, read the data and assign it to a new variable:

frankenstein = fin.read()

Finally, close the file connection:

fin.close()

You can accomplish these operations with less lines of code using the with open pattern. There’s no need to close the file connection once you’ve read the data. Using this pattern, Python will do it for you.

with open(path, mode = "r") as fin:
    frankenstein = fin.read()

Python represents plain text files as streams of characters. That is, every keystroke you would use to type out a text corresponds to a character. Calling len, or length, on our data makes this apparent:

len(frankenstein)
418917

The Penguin edition of Frankenstein is about 220 pages. Assuming each page has 350 words, that would put the book’s word count in the neighborhood of 77,000 words—far less than the number above. But we see this number because Python is counting characters, not words.

2.3. Data Structures#

Usually, however, we want to work with words. This requires us to change how Python represents our text data. And, because there is no inherent concept of what a word is in the language, it falls on us to define how to make words out of characters. This process is called tokenization. Tokenizing a text means breaking its continuous sequence of characters into separate substrings, or tokens. There are many different ways to do this, but for now we start with a very simple approach: we break the character sequence along whitespace characters, characters like \s, for space, but also \n (for new lines), \t (for tab), and so on.

Use the .split() method to break frankenstein along whitespace characters:

tokens = frankenstein.split()

2.3.1. Lists#

The result of .split() is a list, a general-purpose, one-dimensional container for storing data. Lists are probably the most common data structure in Python. They make very little assumptions about the kind of data they store, and they store this data in an ordered manner. That is, lists have a first element, a second element, and so on up until the full length of the list.

len(tokens)
74975

To select an element, or a group of elements, from a list, you index the list. The square brackets [ ] are Python’s index operator. Use them in conjunction with the index position of the element(s) you want to select. The index position is simply a number that corresponds to where in the list an element is located.

tokens[42]
'is'

Python uses zero-based indexing. That means the positions of elements are counted from 0, not 1.

tokens[0]
'Letter'

Use the colon : to select multiple elements.

tokens[10:20]
['17—.',
 'You',
 'will',
 'rejoice',
 'to',
 'hear',
 'that',
 'no',
 'disaster',
 'has']

Setting no starting position takes all elements in the list up to your index position:

tokens[:10]
['Letter',
 '1',
 '_To',
 'Mrs.',
 'Saville,',
 'England._',
 'St.',
 'Petersburgh,',
 'Dec.',
 '11th,']

While leaving off the ending position takes all elements from an index position to the end of the list:

tokens[74970:]
['lost', 'in', 'darkness', 'and', 'distance.']

Alternatively, count backwards from the end of a list with a negative number:

tokens[-5:]
['lost', 'in', 'darkness', 'and', 'distance.']

Add another colon to take every n-th element in your selection. Below, we take every second element from index positions 100-200.

tokens[100:200:2]
['This',
 'which',
 'travelled',
 'the',
 'towards',
 'I',
 'advancing,',
 'me',
 'foretaste',
 'those',
 'climes.',
 'by',
 'wind',
 'promise,',
 'daydreams',
 'more',
 'and',
 'I',
 'in',
 'to',
 'persuaded',
 'the',
 'is',
 'seat',
 'frost',
 'desolation;',
 'ever',
 'itself',
 'my',
 'as',
 'region',
 'beauty',
 'delight.',
 'Margaret,',
 'sun',
 'for',
 'visible,',
 'broad',
 'just',
 'the',
 'and',
 'a',
 'splendour.',
 'with',
 'leave,',
 'sister,',
 'will',
 'some',
 'in',
 'navigators—there']

Leave the first selection unspecified to take every n-th element across the whole list:

tokens[::1000]
['Letter',
 'stagecoach.',
 'he',
 'shape',
 'promised',
 'music.',
 'plain',
 'known.',
 'sweet',
 'reasoning,',
 'time',
 'repulsive',
 'disciple;',
 'yet',
 'charnel-houses',
 'filled',
 'little',
 'hardly',
 'your',
 'commenced',
 'niece,',
 'narrower',
 'knew',
 'being,',
 'interpretation',
 'hardly',
 'true',
 'often',
 'me.',
 'to',
 'hell',
 'But',
 'as',
 'earth.',
 'for',
 'more',
 '13',
 'days',
 'various',
 'and',
 'the',
 '‘Accursed',
 'the',
 'beast',
 'the',
 'flesh',
 'place,',
 'to',
 'like',
 'all',
 'standing',
 'irksome',
 'of',
 'miserable',
 'trembling',
 'ocean;',
 'mounted',
 'the',
 'witnesses.',
 'me.',
 'that',
 'her.',
 'cousin',
 'lessons',
 'its',
 'the',
 'listened',
 'which',
 'even',
 'misery.',
 'them',
 'terrible',
 'continue',
 'midnight;',
 'loathing']

You can also use [ ] to create a list manually. Here’s an empty list:

[]
[]

And here is one with all kinds of data types—and another list! Lists can contain lists.

l = [8, "x", False, ["a", "b", "c"]]

To index an element in this sublist, you’ll need to select the index position of the sublist, then select the one for the element you want.

l[3][1]
'b'

You can set the element of a list by assigning a value at that index:

l[2] = True
l
[8, 'x', True, ['a', 'b', 'c']]

Assigning elements of a container is not without complication. Below, we use list()—another method of creating a list—to break a character string into individual pieces. We assign the output of this to x. Then, we create a new variable, y, from x.

x = list("abc")
y = x
y
['a', 'b', 'c']

Assigning a new value to an index position in x will propagate the change to y.

x[2] = "d"
print(x, y)
['a', 'b', 'd'] ['a', 'b', 'd']

Why did this happen? When you create a list and assign it to a variable, the variable points, or refers, to the location of the list in your computer’s memory. If you create a second variable from the first, it will refer to the first variable, which in turn refers to the data. As a result, operations called on one variable will affect the other, and vice versa.

When in doubt, use .copy() to prevent this.

x = list("abc")
y = x.copy()
x[2] = "d"
print(x, y)
['a', 'b', 'd'] ['a', 'b', 'c']

2.3.2. Tuples#

References can be confusing. If you know that the elements of a container shouldn’t change, you can also avoid the problem above by creating a tuple. Like a list, a tuple is a one-dimensional container for general data storage. The key difference is that tuples are immutable: once you create a tuple, you are neither able to alter it nor its elements.

Make a tuple by enclosing comma-separated values in parentheses ( ).

tup = (1, 2, 3)
tup
(1, 2, 3)

Alternatively, convert a list to a tuple using tuple().

x = list("abc")
x = tuple(x)
x
('a', 'b', 'c')

You will get an error if you attempt to change this tuple.

x[2] = "d"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[30], line 1
----> 1 x[2] = "d"

TypeError: 'tuple' object does not support item assignment

2.3.3. Sets#

Unlike lists and tuples, sets cannot contain multiple instances of the same element. They only have unique elements. Create them using curly brackets { } or set().

set_a = {"a", "b", "c"}
set_b = set("aabc")
set_a == set_b
True

Sets are useful containers for keeping track of features in your data. For example, converting our list of tokens to a set will automatically prune out all repeated tokens. The result will be a set of in NLP are called types: the unique elements in a document. In effect, it is the vocabulary of the document.

types = set(tokens)
print("Number of types:", len(types))
Number of types: 11590

Sets also offer additional functionality for performing comparisons. We won’t touch on this too much in the following chapters, but it’s useful to know about. Given the following two sentences from Frankenstein, for example:

a = "I am surrounded by mountains of ice which admit of no escape and threaten every moment to crush my vessel."
b = "This ice is not made of such stuff as your hearts may be; it is mutable and cannot withstand you if you say that it shall not." 

We split them into tokens and convert both to sets:

a = set(a.split())
b = set(b.split())

Now, we find their intersection. This is where the two sentences’ vocabularies overlap:

a.intersection(b)
{'and', 'ice', 'of'}

We can also find their difference, or the set of tokens that do not overlap:

a.difference(b)
{'I',
 'admit',
 'am',
 'by',
 'crush',
 'escape',
 'every',
 'moment',
 'mountains',
 'my',
 'no',
 'surrounded',
 'threaten',
 'to',
 'vessel.',
 'which'}

Finally, we can build a new set that combines our two sets:

c = a.union(b)
c
{'I',
 'This',
 'admit',
 'am',
 'and',
 'as',
 'be;',
 'by',
 'cannot',
 'crush',
 'escape',
 'every',
 'hearts',
 'ice',
 'if',
 'is',
 'it',
 'made',
 'may',
 'moment',
 'mountains',
 'mutable',
 'my',
 'no',
 'not',
 'not.',
 'of',
 'say',
 'shall',
 'stuff',
 'such',
 'surrounded',
 'that',
 'threaten',
 'to',
 'vessel.',
 'which',
 'withstand',
 'you',
 'your'}

The downside of sets, however, is that they are unordered. This means they cannot be indexed.

c[5]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[38], line 1
----> 1 c[5]

TypeError: 'set' object is not subscriptable

2.3.4. Dictionaries#

Finally, there are dictionaries. Like sets, dictionaries store unique elements, but they associate those elements with a particular value. These can be individual values, like numbers, or containers, like lists, tuples, and so on. Every element in a dictionary is therefore a key–value pair. This makes dictionaries powerful data structures for associating values in your data with metadata of one kind or another.

Create a dictionary with curly brackets { } and colons : that separate the key–value pairs.

counts = {"x": 4, "y": 1, "z": 6}
counts
{'x': 4, 'y': 1, 'z': 6}

Alternatively, use dict():

counts = dict(x = 4, y = 1, z = 6)
counts
{'x': 4, 'y': 1, 'z': 6}

Unlike sets, dictionaries can be indexed by their keys. This returns the value stored at a particular key.

counts["y"]
1

The .get() method also works.

counts.get("y")
1

You can set a default value to control for cases when the dictionary doesn’t have a requested key.

counts.get("a", None)

Assign a new value to a key to update it.

counts["z"] = counts["z"] - 1
counts
{'x': 4, 'y': 1, 'z': 5}

Or use the .update() method in conjunction with the curly bracket and colon syntax. Note that this is an in place operation. You do not need to reassign the result to a variable.

counts.update({"x": 10})
counts
{'x': 10, 'y': 1, 'z': 5}

Either method also enables you to add new keys to a dictionary.

counts["w"] = 7
counts
{'x': 10, 'y': 1, 'z': 5, 'w': 7}

Using .pop() removes a key–value pair.

counts.pop("w")
counts
{'x': 10, 'y': 1, 'z': 5}

The .keys() method returns all keys in a dictionary. Functionally, this is a set.

counts.keys()
dict_keys(['x', 'y', 'z'])

Alternatively the .values() method returns a dictionary’s values.

counts.values()
dict_values([10, 1, 5])

At the beginning of the chapter we imported a Counter object. This is a special kind of dictionary. It counts its input and stores the results as key–value pairs.

A Counter can work on characters:

Counter(frankenstein)
Counter({' ': 68672,
         'e': 43982,
         't': 28282,
         'a': 25362,
         'o': 23768,
         'n': 23223,
         'i': 20389,
         's': 20132,
         'r': 19603,
         'h': 18909,
         'd': 16233,
         'l': 12164,
         'm': 9958,
         'u': 9880,
         'c': 8490,
         'f': 8188,
         'y': 7442,
         '\n': 7315,
         'w': 7088,
         'p': 5602,
         'g': 5474,
         ',': 4956,
         'b': 4528,
         'v': 3681,
         'I': 3091,
         '.': 2920,
         'k': 1592,
         ';': 971,
         'x': 649,
         'T': 550,
         '“': 480,
         'A': 351,
         'j': 346,
         'q': 313,
         '”': 293,
         'H': 287,
         'M': 276,
         'W': 274,
         'S': 272,
         '!': 238,
         '?': 220,
         'B': 219,
         'z': 211,
         'E': 190,
         'C': 153,
         'F': 151,
         '’': 144,
         'Y': 134,
         '—': 124,
         '-': 123,
         'O': 107,
         'D': 92,
         'G': 90,
         '_': 84,
         'N': 77,
         'L': 72,
         'P': 69,
         'J': 66,
         ':': 48,
         '‘': 43,
         'R': 39,
         'V': 36,
         '1': 35,
         'K': 24,
         'æ': 21,
         '7': 16,
         'U': 16,
         '2': 15,
         '(': 15,
         ')': 15,
         '3': 6,
         'ê': 6,
         '8': 5,
         '4': 4,
         '5': 4,
         '9': 4,
         '[': 3,
         ']': 3,
         '6': 3,
         'ô': 2,
         '0': 2,
         'é': 1,
         'è': 1})

But it also works on containers like lists. That makes them highly useful for our purposes. Below we calculate the token counts in Frankenstein

token_freq = Counter(tokens)

…and use the .most_common() method to get the top-10 most frequent tokens in the novel:

top_ten = token_freq.most_common(10)
top_ten
[('the', 3897),
 ('and', 2903),
 ('I', 2719),
 ('of', 2634),
 ('to', 2072),
 ('my', 1631),
 ('a', 1338),
 ('in', 1071),
 ('was', 992),
 ('that', 974)]

2.4. Iteration#

The output of .most_common() is a list of tuples. You’ll see patterns like this frequently: data structures wrapping other data structures. But while we could work with this list as we could any list, indexing it to retrieve tuples, which we could then index again, that would be inefficient for many operations. More, it might require us to know in advance which elements are at what index positions. This information is not always easily available, especially when writing general-purpose code.

It would be better to work with our data in a more programmatic fashion. We can do this with the above containers because they are all iterables: that is, they enable us to step through each of their elements and do things like perform checks, run calculations, or even move elements to other parts of our code. This is called iterating through our data; each step is one iteration.

2.4.1. For-loops#

The standard method for advancing through an iterable is a for-loop. Even if you’ve never written a line of code before, you’ve probably heard of them. A for-loop begins with the for keyword, followed by:

  • A placeholder variable, which will be automatically assigned to an element at the beginning of each iteration

  • The in keyword

  • An object with elements

  • A colon :

Code in the body of the loop must be indented. An equivalent of four spaces for indentation is standard.

Below, we iterate through each tuple in top_ten. At the start of the iteration, a tuple is assigned to tup; we then print this tuple.

for tup in top_ten:
    print(tup)
('the', 3897)
('and', 2903)
('I', 2719)
('of', 2634)
('to', 2072)
('my', 1631)
('a', 1338)
('in', 1071)
('was', 992)
('that', 974)

For-loops can be nested inside of for-loops. Let’s re-implement the above with two for-loops.

for tup in top_ten:
    for part in tup:
        print(part)
    print("\n")
the
3897


and
2903


I
2719


of
2634


to
2072


my
1631


a
1338


in
1071


was
992


that
974

See how the outer print statement only triggers once the inner for-loop has finished? Every iteration of the first for-loop kicks off the second for-loop anew.

Within the indented portion of a for-loop you can perform checks and computations. In every iteration below, we assign the token in the tuple to a variable tok and its value to val. Then, we check whether val is even. If it is, we print tok and val.

for tup in top_ten:
    tok, val = tup[0], tup[1]
    if val % 2 == 0:
        print(tok, val)
of 2634
to 2072
a 1338
was 992
that 974

Oftentimes you want to save the result of a check. The easiest way to do this is by creating a new, empty list and using .append() to add elements to it.

is_even = []
for tup in top_ten:
    val = tup[1]
    if val % 2 == 0:
        is_even.append(tup)

print(is_even)
[('of', 2634), ('to', 2072), ('a', 1338), ('was', 992), ('that', 974)]

Other data structures are iterable in Python. In addition to lists, you’ll find yourself iterating through dictionaries with some frequency. Use .keys() or .values() to iterate, respectively, through the keys and values of a dictionary. Or, use .items() to iterate through both at the same time. Note that .items() requires using two placeholder variables separated by a comma ,.

for key, value in counts.items():
    print(key, "->", value)
x -> 10
y -> 1
z -> 5

Below, we divide every count in token_freq by the total number of tokens in Frankenstein to express counts as percentages, using a new Counter to store our results.

num_tokens = token_freq.total()
percentages = Counter()

for token, count in token_freq.items():
    percent = count / num_tokens
    percentages[token] = percent

Here is the equivalent of top_ten, but with percentages:

percentages.most_common(10)
[('the', 0.05197732577525842),
 ('and', 0.038719573191063686),
 ('I', 0.03626542180726909),
 ('of', 0.035131710570190065),
 ('to', 0.027635878626208737),
 ('my', 0.021753917972657553),
 ('a', 0.01784594864954985),
 ('in', 0.014284761587195731),
 ('was', 0.013231077025675225),
 ('that', 0.012990996998999667)]

2.4.2. Comprehensions#

Comprehensions are idiomatic to Python. They allow you to perform operations across an iterable without needing to pre-allocate an empty copy to store the results. This makes them both concise and efficient. You will most frequently see comprehensions used in the context of lists (i.e. “list comprehensions”), but you can also use them for dictionaries and sets.

The syntax for comprehension includes the keywords for and in, just like a for-loop. The difference is that in the list comprehension, the repeated code comes before the for keyword rather than after it, and the entire expression is enclosed in square brackets [ ].

Below, we use the .istitle() method to find capitalized tokens in Frankenstein. This method returns a Boolean value, so the resultant list will contain True and False values that specify capitalization at a certain index.

is_title = [token.istitle() for token in tokens]
is_title[:10]
[True, False, True, True, True, True, True, True, True, False]

That should be straightforward enough, but we don’t know which tokens these values reference. With comprehensions, an easy way around this is to use an if statement embedded in the comprehension. Put that statement and a conditional check at the end of the comprehension to filter a list.

is_title = [token for token in tokens if token.istitle()]
is_title[:10]
['Letter',
 '_To',
 'Mrs.',
 'Saville,',
 'England._',
 'St.',
 'Petersburgh,',
 'Dec.',
 'You',
 'I']

Comprehensions become particularly powerful when you use them to manipulate each element in an iterable. Below, we change all tokens to their lowercase variants using .lower().

lowercase = [token.lower() for token in tokens]
lowercase[:10]
['letter',
 '1',
 '_to',
 'mrs.',
 'saville,',
 'england._',
 'st.',
 'petersburgh,',
 'dec.',
 '11th,']

2.4.3. While-loops#

While-loops continue iterating until a condition is met. Whereas a for-loop only iterates through your data once, a while-loop iterates indefinitely. That means you need to specify an exit condition to break out of your while-loop, otherwise your code will get trapped and eventually your computer will kill the process.

The syntax for a while-loop is quite simple: start it with the while keyword and a condition. Below, we increment a counter to print the first ten tokens in Frankenstein.

current_index = 0
while current_index < 10:
    print(tokens[current_index])

    current_index += 1
Letter
1
_To
Mrs.
Saville,
England._
St.
Petersburgh,
Dec.
11th,

Note that we must specify, and then manually increment, the counter. If we didn’t, the loop would have no reference telling it when it should break.

Here is a more open-ended loop. We set the condition to True, keeping the loop running until we reach an exit condition. Then, for each iteration, we index our list of tokens and check whether the token at that index matches the one we’re looking for. If it does, the code prints that index position and stops the iteration with a break statement. If it doesn’t, we increment the counter and try again.

find_first = "Frankenstein"
current_index = 0
while True:
    token = tokens[current_index]
    if token == find_first:
        print("The first occurrence of", find_first, "is at", current_index)
        break

    current_index += 1
The first occurrence of Frankenstein is at 18961

2.5. Regular Expressions#

You have likely noticed by now that our tokenization strategy has created some strange tokens. Most notably, punctuation sticks to words because there was no whitespace to separate them. This means that, for our Counter, the following two tokens are counted separately, even though they’re the same word:

variants = ["ship,", "ship."]
for tok in variants:
    print(tok, "->", token_freq[tok])
ship, -> 2
ship. -> 2

We can handle this in a number of ways. Many rely on writing out regular expressions, or regexes. Regexes are special sequences of characters that represent patterns for matching in text; these sequences are comprised of regular old characters in text, or literals, and metacharacters, special characters that stand for whole classes of literals. Regexes work as a search mechanism, and they become highly useful in text processing for their ability to find variants like the tokens above.

2.5.1. Literals#

The following regex will match on the string “ship”:

ship = r"ship"

Note how we prepend our string with r. That tells Python to treat the string as a regex sequence. Using findall() from the re module will return a list of all matches on this regex:

re.findall(ship, frankenstein)
['ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship']

When you use literals like this, Python will match only on the exact sequence. But that’s a problem for us, because there’s no way to know whether the above output refers to “ship” and any following punctuation, or if our regex has also matched on words that contain “ship,” like “relationship” and “shipment.”

The latter will most certainly be the case. We’ll see this if we search with finditer(). It finds all matches and also returns where they start and end in the character sequence (the object returned is a Match). Below, we use those start/end positions to glimpse the context of matches.

found = re.finditer(ship, frankenstein)
for match in found:
    # Get the match text
    span = match.group()

    # Get its start and end, then offset both
    start = match.start() - 2
    end = match.end() + 2

    # Ensure our expanded start/end locations don't overshoot the string
    if start < 0:
        start = 0
    if end > len(frankenstein):
        end = len(frankenstein)

    print(span, "->", frankenstein[start:end])
ship -> rdship. 
ship -> a ship t
ship -> e
ship f
ship -> d ship: 
ship -> e ship o
ship -> r ship. 
ship -> ndship. 
ship -> ndship a
ship -> orship i
ship -> onship, 
ship -> ndship t
ship -> rdship, 
ship -> ndship, 
ship -> onships 
ship -> ndship? 
ship -> onship w
ship -> e shippi
ship -> ndship w
ship -> rdships

ship -> e ship, 
ship -> ndship o
ship -> rdships.
ship -> r ship. 
ship -> rdships 
ship -> rdships 
ship -> r ship, 
ship -> rdships.
ship -> owship, 

2.5.2. Metacharacters#

Controlling for cases where our regex returns more than what we want requires metacharacters.

The . metacharacter stands for any character except a newline \n.

re.findall(r"ship.", frankenstein)
['ship.',
 'ship ',
 'ship ',
 'ship:',
 'ship ',
 'ship.',
 'ship.',
 'ship ',
 'ship ',
 'ship,',
 'ship ',
 'ship,',
 'ship,',
 'ships',
 'ship?',
 'ship ',
 'shipp',
 'ship ',
 'ships',
 'ship,',
 'ship ',
 'ships',
 'ship.',
 'ships',
 'ships',
 'ship,',
 'ships',
 'ship,']

If you want the literal period ., you need to use an escape character \.

re.findall(r"ship\.", frankenstein)
['ship.', 'ship.', 'ship.', 'ship.']

Note that this won’t work:

re.findall(r"\", frankenstein)
  Cell In[71], line 1
    re.findall(r"\", frankenstein)
               ^
SyntaxError: unterminated string literal (detected at line 1)

Instead, escape the escape character:

re.findall(r"\\", frankenstein)
[]

No such characters in this text, however.

Use + as a repetition operator to find instances where the preceding character is repeated at least once, but with no limit up to a newline character. If we use it with ., it returns strings up to the ends of lines.

re.findall(r"ship.+", frankenstein)
['ship. I',
 'ship there, which can easily be done by paying the',
 'ship for his gentleness and the mildness of his discipline. This',
 'ship: I have never believed it to be',
 'ship on all sides, scarcely leaving her the sea-room in which',
 'ship. We, however, lay to until the',
 'ship. You have hope, and the world before you, and have no cause for',
 'ship and',
 'ship in his attachment to my mother, differing wholly from the',
 'ship, and',
 'ship to one among them. Henry',
 'ship, and even danger for',
 'ship, nor the beauty of earth, nor of',
 'ships which',
 'ship? I resolved, at least, not to despair, but in every way',
 'ship with them. Yet even thus I',
 'shipping for London. During this',
 'ship was of that',
 'ships',
 'ship, but he escaped, I know not how.',
 'ship of the villagers',
 'ships. During the day I was',
 'ship. I had determined, if you were going southwards,',
 'ships into a death which I still dread, for my task is unfulfilled.',
 'ships that I have undergone?',
 'ship, brought to me a',
 'ships.',
 'ship, and I was still spurned. Was there no']

Related to + is ? and *. The first means “match zero or one”, while the second means “match zero or more”. An example of * is below. Note we change our regex slightly to demonstrate the zero matching.

re.findall(r"ship*", frankenstein)
['ship',
 'ship',
 'shi',
 'ship',
 'ship',
 'shi',
 'ship',
 'ship',
 'shi',
 'shi',
 'shi',
 'ship',
 'shi',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'ship',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'ship',
 'ship',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'ship',
 'shi',
 'shi',
 'shipp',
 'shi',
 'shi',
 'ship',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'shi',
 'ship',
 'ship',
 'ship',
 'ship',
 'shi',
 'ship',
 'ship',
 'ship',
 'ship',
 'ship',
 'shi',
 'shi',
 'ship']

Use curly brackets { } in conjunction with numbers to specify a limit for how many repetitions you want. Here is “match three to five”:

re.findall(r"ship.{3,5}", frankenstein)
['ship. I',
 'ship ther',
 'ship for ',
 'ship: I h',
 'ship on a',
 'ship. We,',
 'ship. You',
 'ship and',
 'ship in h',
 'ship, and',
 'ship to o',
 'ship, and',
 'ship, nor',
 'ships whi',
 'ship? I r',
 'ship with',
 'shipping ',
 'ship was ',
 'ship, but',
 'ship of t',
 'ships. Du',
 'ship. I h',
 'ships int',
 'ships tha',
 'ship, bro',
 'ship, and']

Want to constrain your search to particular characters? Parentheses ( ) specify groups of characters, including metacharacters. Use them in conjunction with the “or” operator | to get two (or more) variants of a string.

re.findall(r"(ship\.|ship,)", frankenstein)
['ship.',
 'ship.',
 'ship.',
 'ship,',
 'ship,',
 'ship,',
 'ship,',
 'ship.',
 'ship,',
 'ship,']

Or, use square brackets [ ] to specify literals following an “or” logic, e.g. “character X or character Y or…”.

re.findall(r"ship[.,]", frankenstein)
['ship.',
 'ship.',
 'ship.',
 'ship,',
 'ship,',
 'ship,',
 'ship,',
 'ship.',
 'ship,',
 'ship,']

Note that literals are case-sensitive.

re.findall(r"[Ss]everal", frankenstein)
['several',
 'several',
 'Several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'Several',
 'several',
 'several',
 'several',
 'Several',
 'several',
 'Several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'Several',
 'several',
 'several',
 'several',
 'several',
 'Several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'Several',
 'several',
 'several',
 'Several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several',
 'several']

Including a space character is valid here:

re.findall(r"ship[., ]", frankenstein)
['ship.',
 'ship ',
 'ship ',
 'ship ',
 'ship.',
 'ship.',
 'ship ',
 'ship ',
 'ship,',
 'ship ',
 'ship,',
 'ship,',
 'ship ',
 'ship ',
 'ship,',
 'ship ',
 'ship.',
 'ship,',
 'ship,']

But you can also use \s. This specifies a character class: whole types of characters (in this case, spaces).

re.findall(r"ship[.,\s]", frankenstein)
['ship.',
 'ship ',
 'ship ',
 'ship ',
 'ship.',
 'ship.',
 'ship ',
 'ship ',
 'ship,',
 'ship ',
 'ship,',
 'ship,',
 'ship ',
 'ship ',
 'ship,',
 'ship ',
 'ship.',
 'ship,',
 'ship,']

Below, we find all spaces (and multiple space sequences) in the novel:

re.findall(r"\s+", frankenstein)
[' ',
 '\n\n',
 ' ',
 ' ',
 ' ',
 '\n\n\n',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n\n\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 '\n\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ...]

There are also character classes for digits \d and alphanumeric characters \w. Here is an example with digits, which you could use to find chapter breaks:

re.findall(r"Chapter \d+", frankenstein)
['Chapter 1',
 'Chapter 2',
 'Chapter 3',
 'Chapter 4',
 'Chapter 5',
 'Chapter 6',
 'Chapter 7',
 'Chapter 8',
 'Chapter 9',
 'Chapter 10',
 'Chapter 11',
 'Chapter 12',
 'Chapter 13',
 'Chapter 14',
 'Chapter 15',
 'Chapter 16',
 'Chapter 17',
 'Chapter 18',
 'Chapter 19',
 'Chapter 20',
 'Chapter 21',
 'Chapter 22',
 'Chapter 23',
 'Chapter 24']

Using \w, the pattern below specifies alphanumeric characters followed by a newline.

re.findall(r"\w+\n", frankenstein)
['1\n',
 '_\n',
 'the\n',
 'evil\n',
 'assure\n',
 'success\n',
 'of\n',
 'which\n',
 'this\n',
 'towards\n',
 'fervent\n',
 'of\n',
 'the\n',
 'ever\n',
 'a\n',
 'put\n',
 'in\n',
 'habitable\n',
 'the\n',
 'undiscovered\n',
 'I\n',
 'may\n',
 'this\n',
 'I\n',
 'world\n',
 'by\n',
 'to\n',
 'this\n',
 'little\n',
 'his\n',
 'you\n',
 'all\n',
 'pole\n',
 'are\n',
 'at\n',
 'my\n',
 'me\n',
 'as\n',
 'intellectual\n',
 'I\n',
 'have\n',
 'Ocean\n',
 'a\n',
 'the\n',
 'study\n',
 'which\n',
 'injunction\n',
 'poets\n',
 'also\n',
 'the\n',
 'well\n',
 'my\n',
 'I\n',
 'this\n',
 'I\n',
 'often\n',
 'my\n',
 'those\n',
 'derive\n',
 'an\n',
 'I\n',
 'second\n',
 'greatest\n',
 'to\n',
 'encouraging\n',
 'is\n',
 'am\n',
 'which\n',
 'spirits\n',
 'fly\n',
 'in\n',
 'The\n',
 'have\n',
 'the\n',
 'exercise\n',
 'no\n',
 'and\n',
 'my\n',
 'the\n',
 'necessary\n',
 'to\n',
 'how\n',
 'your\n',
 'Walton\n',
 '2\n',
 '_\n',
 'a\n',
 'have\n',
 'certainly\n',
 'the\n',
 'no\n',
 'there\n',
 'no\n',
 'thoughts\n',
 'of\n',
 'whose\n',
 'I\n',
 'yet\n',
 'whose\n',
 'a\n',
 'execution\n',
 'me\n',
 'wild\n',
 'own\n',
 'its\n',
 'the\n',
 'native\n',
 'many\n',
 'my\n',
 'painters\n',
 'sense\n',
 'to\n',
 'the\n',
 'Yet\n',
 'these\n',
 'courage\n',
 'phrase\n',
 'an\n',
 'of\n',
 'assist\n',
 'the\n',
 'This\n',
 'made\n',
 'years\n',
 'the\n',
 'to\n',
 'be\n',
 'kindliness\n',
 'felt\n',
 'heard\n',
 'the\n',
 'loved\n',
 'considerable\n',
 'saw\n',
 'in\n',
 'friend\n',
 'his\n',
 'he\n',
 'his\n',
 'young\n',
 'old\n',
 'returned\n',
 'her\n',
 'is\n',
 'kind\n',
 'conduct\n',
 'which\n',
 'can\n',
 'am\n',
 'voyage\n',
 'The\n',
 'it\n',
 'sail\n',
 'me\n',
 'the\n',
 'my\n',
 'of\n',
 'which\n',
 'the\n',
 'not\n',
 'and\n',
 'I\n',
 'my\n',
 'that\n',
 'something\n',
 'practically\n',
 'and\n',
 'belief\n',
 'out\n',
 'unvisited\n',
 'after\n',
 'of\n',
 'to\n',
 'to\n',
 'when\n',
 'Walton\n',
 '3\n',
 '_\n',
 'advanced\n',
 'on\n',
 'not\n',
 'good\n',
 'the\n',
 'dangers\n',
 'We\n',
 'of\n',
 'desire\n',
 'not\n',
 'a\n',
 'are\n',
 'and\n',
 'as\n',
 'I\n',
 'stars\n',
 'not\n',
 'the\n',
 'must\n',
 '4\n',
 '_\n',
 'forbear\n',
 'before\n',
 'closed\n',
 'which\n',
 'we\n',
 'out\n',
 'to\n',
 'to\n',
 'suddenly\n',
 'own\n',
 'by\n',
 'a\n',
 'progress\n',
 'the\n',
 'that\n',
 'by\n',
 'the\n',
 'before\n',
 'the\n',
 'which\n',
 'to\n',
 'and\n',
 'apparently\n',
 'we\n',
 'large\n',
 'human\n',
 'of\n',
 'the\n',
 'perish\n',
 'a\n',
 'addressed\n',
 'have\n',
 'not\n',
 'I\n',
 'the\n',
 'for\n',
 'were\n',
 'and\n',
 'attempted\n',
 'fresh\n',
 'and\n',
 'to\n',
 'we\n',
 'the\n',
 'often\n',
 'he\n',
 'and\n',
 'more\n',
 'of\n',
 'anyone\n',
 'most\n',
 'with\n',
 'he\n',
 'his\n',
 'off\n',
 'not\n',
 'body\n',
 'ice\n',
 'and\n',
 'we\n',
 'of\n',
 'had\n',
 'good\n',
 'to\n',
 'have\n',
 'the\n',
 'answer\n',
 'near\n',
 'safety\n',
 'the\n',
 'for\n',
 'in\n',
 'instant\n',
 'the\n',
 'very\n',
 'all\n',
 'communication\n',
 'his\n',
 'must\n',
 'wreck\n',
 'friend\n',
 'been\n',
 'brother\n',
 'my\n',
 'so\n',
 'poignant\n',
 'and\n',
 'although\n',
 'he\n',
 'frequently\n',
 'without\n',
 'my\n',
 'taken\n',
 'the\n',
 'soul\n',
 'would\n',
 'my\n',
 'for\n',
 'should\n',
 'a\n',
 'I\n',
 'before\n',
 'trickle\n',
 'I\n',
 'you\n',
 'the\n',
 'weakened\n',
 'were\n',
 'despise\n',
 'of\n',
 'asked\n',
 'it\n',
 'a\n',
 'than\n',
 'could\n',
 'are\n',
 'than\n',
 'to\n',
 'most\n',
 'respecting\n',
 'for\n',
 'life\n',
 'settled\n',
 'presently\n',
 'he\n',
 'sight\n',
 'of\n',
 'he\n',
 'he\n',
 'a\n',
 'divine\n',
 'and\n',
 'therefore\n',
 'to\n',
 'I\n',
 'that\n',
 'I\n',
 'failing\n',
 'unequalled\n',
 'a\n',
 'Captain\n',
 'had\n',
 'with\n',
 'for\n',
 'the\n',
 'mine\n',
 'be\n',
 'same\n',
 'me\n',
 'one\n',
 'you\n',
 'usually\n',
 'might\n',
 'things\n',
 'would\n',
 'powers\n',
 'series\n',
 'offered\n',
 'by\n',
 'hear\n',
 'strong\n',
 'expressed\n',
 'is\n',
 'I\n',
 'my\n',
 'my\n',
 'is\n',
 'I\n',
 'have\n',
 'to\n',
 'during\n',
 'This\n',
 'who\n',
 'and\n',
 'my\n',
 'me\n',
 'in\n',
 'soul\n',
 'which\n',
 '1\n',
 'most\n',
 'years\n',
 'public\n',
 'who\n',
 'public\n',
 'the\n',
 'his\n',
 'a\n',
 'cannot\n',
 'a\n',
 'numerous\n',
 'a\n',
 'poverty\n',
 'been\n',
 'his\n',
 'in\n',
 'and\n',
 'conduct\n',
 'in\n',
 'begin\n',
 'ten\n',
 'the\n',
 'Beaufort\n',
 'but\n',
 'in\n',
 'a\n',
 'for\n',
 'end\n',
 'saw\n',
 'that\n',
 'Beaufort\n',
 'support\n',
 'and\n',
 'to\n',
 'time\n',
 'subsistence\n',
 'leaving\n',
 'knelt\n',
 'the\n',
 'who\n',
 'he\n',
 'a\n',
 'but\n',
 'devoted\n',
 'mind\n',
 'love\n',
 'the\n',
 'set\n',
 'and\n',
 'the\n',
 'her\n',
 'recompensing\n',
 'grace\n',
 'wishes\n',
 'is\n',
 'her\n',
 'and\n',
 'hitherto\n',
 'During\n',
 'had\n',
 'after\n',
 'change\n',
 'born\n',
 'remained\n',
 'each\n',
 'very\n',
 'and\n',
 'my\n',
 'something\n',
 'on\n',
 'in\n',
 'fulfilled\n',
 'owed\n',
 'spirit\n',
 'during\n',
 'but\n',
 'a\n',
 'five\n',
 'they\n',
 'benevolent\n',
 'my\n',
 'a\n',
 'been\n',
 'the\n',
 'vale\n',
 'number\n',
 'worst\n',
 'to\n',
 'far\n',
 'were\n',
 'Her\n',
 'her\n',
 'was\n',
 'of\n',
 'behold\n',
 'and\n',
 'was\n',
 'a\n',
 'with\n',
 'been\n',
 'their\n',
 'glory\n',
 'exerted\n',
 'its\n',
 'Austria\n',
 'and\n',
 'rude\n',
 'of\n',
 'seemed\n',
 'lighter\n',
 'his\n',
 'their\n',
 'seemed\n',
 'poverty\n',
 'They\n',
 'Lavenza\n',
 'than\n',
 'and\n',
 'reverential\n',
 'my\n',
 'to\n',
 'my\n',
 'she\n',
 'childish\n',
 'Elizabeth\n',
 'on\n',
 'other\n',
 'body\n',
 'than\n',
 '2\n',
 'in\n',
 'of\n',
 'and\n',
 'us\n',
 'concentrated\n',
 'intense\n',
 'Swiss\n',
 'of\n',
 'the\n',
 'their\n',
 'the\n',
 'gave\n',
 'native\n',
 'a\n',
 'the\n',
 'my\n',
 'was\n',
 'united\n',
 'Henry\n',
 'singular\n',
 'for\n',
 'He\n',
 'and\n',
 'into\n',
 'of\n',
 'chivalrous\n',
 'hands\n',
 'My\n',
 'to\n',
 'delights\n',
 'distinctly\n',
 'assisted\n',
 'some\n',
 'pursuits\n',
 'things\n',
 'states\n',
 'earth\n',
 'of\n',
 'man\n',
 'moral\n',
 'was\n',
 'the\n',
 'soul\n',
 'of\n',
 'was\n',
 'become\n',
 'that\n',
 'And\n',
 'Yet\n',
 'his\n',
 'for\n',
 'of\n',
 'soaring\n',
 'of\n',
 'which\n',
 'would\n',
 'my\n',
 'almost\n',
 'torrent\n',
 'my\n',
 'went\n',
 'the\n',
 'I\n',
 'it\n',
 'wonderful\n',
 'new\n',
 'my\n',
 'my\n',
 'waste\n',
 'me\n',
 'modern\n',
 'powers\n',
 'while\n',
 'I\n',
 'my\n',
 'my\n',
 'never\n',
 'glance\n',
 'was\n',
 'greatest\n',
 'this\n',
 'and\n',
 'me\n',
 'always\n',
 'of\n',
 'modern\n',
 'picking\n',
 'his\n',
 'acquainted\n',
 'same\n',
 'acquainted\n',
 'little\n',
 'immortal\n',
 'causes\n',
 'I\n',
 'keep\n',
 'and\n',
 'knew\n',
 'their\n',
 'eighteenth\n',
 'of\n',
 'favourite\n',
 'a\n',
 'greatest\n',
 'elixir\n',
 'an\n',
 'could\n',
 'but\n',
 'a\n',
 'which\n',
 'I\n',
 'a\n',
 'was\n',
 'thousand\n',
 'of\n',
 'childish\n',
 'near\n',
 'It\n',
 'once\n',
 'an\n',
 'so\n',
 'nothing\n',
 'found\n',
 'the\n',
 'beheld\n',
 'of\n',
 'natural\n',
 'on\n',
 'of\n',
 'by\n',
 'my\n',
 'ever\n',
 'grew\n',
 'perhaps\n',
 'former\n',
 'deformed\n',
 'a\n',
 'of\n',
 'the\n',
 'as\n',
 'ligaments\n',
 'me\n',
 'the\n',
 'effort\n',
 'even\n',
 'was\n',
 'which\n',
 'tormenting\n',
 'with\n',
 'and\n',
 '3\n',
 'I\n',
 'had\n',
 'it\n',
 'made\n',
 'My\n',
 'day\n',
 'life\n',
 'was\n',
 'to\n',
 'first\n',
 'her\n',
 'She\n',
 'malignity\n',
 'this\n',
 'mother\n',
 'the\n',
 'her\n',
 'desert\n',
 'My\n',
 'were\n',
 'the\n',
 'to\n',
 'happy\n',
 'are\n',
 'to\n',
 'rent\n',
 'the\n',
 'so\n',
 'day\n',
 'departed\n',
 'been\n',
 'ear\n',
 'of\n',
 'the\n',
 'has\n',
 'I\n',
 'at\n',
 'and\n',
 'a\n',
 'still\n',
 'the\n',
 'the\n',
 'of\n',
 'of\n',
 'was\n',
 'above\n',
 'and\n',
 'call\n',
 'last\n',
 'permit\n',
 'His\n',
 'the\n',
 'misfortune\n',
 'when\n',
 'a\n',
 'details\n',
 'nor\n',
 'we\n',
 'the\n',
 'the\n',
 'father\n',
 'to\n',
 'last\n',
 'in\n',
 'by\n',
 'mutual\n',
 'I\n',
 'hitherto\n',
 'invincible\n',
 'and\n',
 'myself\n',
 'as\n',
 'I\n',
 'had\n',
 'to\n',
 'my\n',
 'the\n',
 'was\n',
 'to\n',
 'evil\n',
 'me\n',
 's\n',
 'He\n',
 'He\n',
 'branches\n',
 'and\n',
 'principal\n',
 'he\n',
 'with\n',
 'utterly\n',
 'systems\n',
 'you\n',
 'they\n',
 'scientific\n',
 'dear\n',
 'books\n',
 'and\n',
 'following\n',
 'natural\n',
 'fellow\n',
 'he\n',
 'long\n',
 'I\n',
 'any\n',
 'a\n',
 'in\n',
 'a\n',
 'come\n',
 'been\n',
 'natural\n',
 'my\n',
 'the\n',
 'the\n',
 'sought\n',
 'now\n',
 'limit\n',
 'in\n',
 'of\n',
 'my\n',
 'becoming\n',
 'new\n',
 'information\n',
 'I\n',
 'deliver\n',
 'lecturing\n',
 'very\n',
 'an\n',
 'his\n',
 'person\n',
 'and\n',
 'pronouncing\n',
 'took\n',
 'of\n',
 'he\n',
 'I\n',
 'masters\n',
 'that\n',
 'seem\n',
 'or\n',
 'recesses\n',
 'the\n',
 'of\n',
 'even\n',
 'of\n',
 'soul\n',
 'were\n',
 'was\n',
 'of\n',
 'steps\n',
 'and\n',
 'of\n',
 'I\n',
 'to\n',
 'a\n',
 'His\n',
 'in\n',
 'I\n',
 'had\n',
 'little\n',
 'Cornelius\n',
 'had\n',
 'zeal\n',
 'their\n',
 'names\n',
 'a\n',
 'The\n',
 'ever\n',
 'I\n',
 'presumption\n',
 'my\n',
 'measured\n',
 'his\n',
 'have\n',
 'intended\n',
 'to\n',
 'a\n',
 'of\n',
 'the\n',
 'that\n',
 'not\n',
 'sorry\n',
 'your\n',
 'petty\n',
 'natural\n',
 'his\n',
 'and\n',
 'in\n',
 'of\n',
 '4\n',
 'the\n',
 'the\n',
 'the\n',
 'sense\n',
 'repulsive\n',
 'In\n',
 'by\n',
 'and\n',
 'ways\n',
 'abstruse\n',
 'at\n',
 'and\n',
 'the\n',
 'progress\n',
 'and\n',
 'Waldman\n',
 'years\n',
 'was\n',
 'I\n',
 'conceive\n',
 'as\n',
 'in\n',
 'must\n',
 'who\n',
 'was\n',
 'two\n',
 'chemical\n',
 'the\n',
 'well\n',
 'as\n',
 'my\n',
 'thought\n',
 'incident\n',
 'was\n',
 'with\n',
 'a\n',
 'becoming\n',
 'our\n',
 'determined\n',
 'of\n',
 'been\n',
 'this\n',
 'the\n',
 'became\n',
 'I\n',
 'my\n',
 'ever\n',
 'feared\n',
 'and\n',
 'of\n',
 'become\n',
 'of\n',
 'and\n',
 'most\n',
 'the\n',
 'of\n',
 'worm\n',
 'and\n',
 'change\n',
 'this\n',
 'and\n',
 'immensity\n',
 'so\n',
 'same\n',
 'a\n',
 'not\n',
 'is\n',
 'the\n',
 'of\n',
 'of\n',
 'bestowing\n',
 'discovery\n',
 'in\n',
 'the\n',
 'so\n',
 'been\n',
 'creation\n',
 'it\n',
 'a\n',
 'them\n',
 'already\n',
 'dead\n',
 'seemingly\n',
 'eyes\n',
 'with\n',
 'end\n',
 'that\n',
 'my\n',
 'of\n',
 'town\n',
 'nature\n',
 'hesitated\n',
 'to\n',
 'of\n',
 'inconceivable\n',
 'the\n',
 'my\n',
 'to\n',
 'wonderful\n',
 'appeared\n',
 'should\n',
 'my\n',
 'be\n',
 'takes\n',
 'present\n',
 'Nor\n',
 'any\n',
 'I\n',
 'parts\n',
 'first\n',
 'having\n',
 'successfully\n',
 'like\n',
 'death\n',
 'and\n',
 'bless\n',
 'would\n',
 'his\n',
 'these\n',
 'lifeless\n',
 'undertaking\n',
 'my\n',
 'very\n',
 'the\n',
 'alone\n',
 'moon\n',
 'breathless\n',
 'conceive\n',
 'damps\n',
 'lifeless\n',
 'but\n',
 'seemed\n',
 'was\n',
 'renewed\n',
 'had\n',
 'and\n',
 'human\n',
 'from\n',
 'The\n',
 'I\n',
 'in\n',
 'fields\n',
 'luxuriant\n',
 'the\n',
 'also\n',
 'had\n',
 'I\n',
 'are\n',
 'shall\n',
 'any\n',
 'duties\n',
 ...]

The start-of-text anchor operator ^ is useful for filtering out characters. It checks whether a sequence begins with the characters that follow it. Below, we select characters that are neither alphanumeric nor spaces.

re.findall(r"[^\w\s]+", frankenstein)
['.',
 ',',
 '.',
 '.',
 ',',
 '.',
 ',',
 '—.',
 '.',
 ',',
 '.',
 ',',
 ',',
 ',',
 '.',
 '?',
 ',',
 ',',
 '.',
 ',',
 '.',
 ';',
 '.',
 ',',
 ',',
 ',',
 '.',
 '—',
 ',',
 ',',
 '—',
 ';',
 ',',
 ',',
 '.',
 ',',
 '.',
 '?',
 '.',
 ',',
 '.',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 ',',
 ';',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 '—',
 '.',
 '.',
 '.',
 '’',
 '.',
 ',',
 '.',
 ',',
 ',',
 ',',
 '’',
 '.',
 ',',
 ',',
 '.',
 ';',
 '.',
 '.',
 ',',
 '.',
 '.',
 ',',
 ',',
 '.',
 '.',
 '-',
 ';',
 ',',
 ',',
 ',',
 ';',
 ',',
 ',',
 '.',
 '-',
 ',',
 '.',
 ',',
 '.',
 ',',
 ',',
 '?',
 ',',
 '.',
 ',',
 '!',
 ';',
 ',',
 '.',
 ',',
 ':',
 ',',
 ',',
 '.',
 '.',
 ';',
 ',',
 ',',
 ',',
 '.',
 ',',
 '—',
 ',',
 ',',
 '.',
 '-',
 '.',
 '.',
 ';',
 ',',
 ',',
 '-',
 '.',
 ';',
 '?',
 ',',
 ',',
 '?',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 '.',
 '.',
 ',',
 '.',
 ',',
 ',',
 '—.',
 ',',
 '!',
 '.',
 ';',
 '.',
 ',',
 ',',
 ',',
 ':',
 ',',
 ';',
 ',',
 '.',
 ',',
 ';',
 '.',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 ',',
 '.',
 '!',
 '.',
 '-',
 ':',
 '’',
 '.',
 ';',
 '.',
 '-',
 '.',
 ',',
 '(',
 ')',
 ';',
 ',',
 '.',
 ',',
 ';',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 ';',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 '.',
 ';',
 ',',
 '.',
 '.',
 ',',
 '-',
 ',',
 '.',
 ',',
 ',',
 ':',
 ',',
 ',',
 '.',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 '-',
 ',',
 '.',
 ';',
 ',',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ';',
 ',',
 '-',
 ',',
 '’',
 '.',
 ',',
 ',',
 ',',
 ',',
 ',',
 '.',
 '“',
 '!”',
 '.',
 ';',
 ':',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 '.',
 ',',
 ',',
 ',',
 '.',
 ':',
 '.',
 '.',
 ',',
 ',',
 '.',
 ',',
 '“',
 ',”',
 ';',
 '“',
 '.”',
 ',',
 '.',
 ',',
 ',',
 '.',
 '.',
 '—',
 ',',
 '—',
 ',',
 ',',
 ',',
 ',',
 '.',
 '.',
 ',',
 ',',
 '?',
 ',',
 '.',
 ':',
 '.',
 '.',
 ',',
 '.',
 ',',
 '.',
 ',',
 '.',
 ',',
 '—.',
 ',',
 '—',
 '.',
 ';',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 ':',
 ',',
 ',',
 ',',
 '.',
 ';',
 ',',
 ',',
 ',',
 ',',
 '.',
 '.',
 ',',
 '.',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 '.',
 '?',
 ',',
 ',',
 '.',
 '?',
 '?',
 '.',
 '.',
 '!',
 '.',
 '.',
 '.',
 ',',
 '.',
 ',',
 '—.',
 ',',
 '.',
 '(',
 ')',
 ',',
 ',',
 '-',
 '.',
 ',',
 '.',
 ',',
 '.',
 '’',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 ';',
 ',',
 ',',
 '.',
 '.',
 '.',
 ',',
 ',',
 ';',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 '.',
 ',',
 ',',
 ',',
 '.',
 '.',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 ',',
 '.',
 ';',
 '.',
 ',',
 ',',
 ',',
 '.',
 ',',
 '“',
 ',',
 '.”',
 ',',
 ',',
 '.',
 '“',
 ',”',
 ',',
 '“',
 '?”',
 '.',
 ',',
 ',',
 '.',
 '.',
 '!',
 ',',
 ',',
 '.',
 ',',
 '.',
 '.',
 ',',
 '.',
 '.',
 '.',
 ',',
 '.',
 ',',
 '.',
 ',',
 '.',
 ':',
 ',',
 ',',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ';',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 '“',
 '.”',
 '“',
 '?”',
 '“',
 '.”',
 '“',
 ',',
 ',',
 ',',
 '.”',
 '’',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 '“',
 ',',
 ',',
 ',',
 ';',
 '.”',
 '“',
 ';',
 '.”',
 '“',
 ';',
 '.”',
 '.',
 ',',
 ',',
 ';',
 '.',
 '.',
 ';',
 ',',
 '.',
 '.',
 '.',
 '.',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 '.',
 ',',
 ',',
 ';',
 ',',
 ',',
 '.',
 ',',
 '.',
 ',',
 '—.',
 '.',
 '.',
 '?',
 ',',
 ';',
 ',',
 ',',
 ',',
 '.',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 '.',
 '.',
 ',',
 ',',
 ',',
 ',',
 ',',
 ',',
 '.',
 '’',
 ',',
 '.',
 ',',
 '’',
 '.',
 ';',
 ',',
 ';',
 '.',
 ';',
 ',',
 ':',
 '“',
 '!',
 '?',
 '?',
 ';',
 ',',
 '!”',
 ',',
 ',',
 ';',
 ',',
 '.',
 ',',
 ';',
 ',',
 '.',
 '.',
 ',',
 '.',
 ',',
 ',',
 '.',
 '“',
 ',”',
 ';',
 '“',
 ',',
 ',',
 ',',
 ',',
 '—',
 '—',
 '.',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 '—',
 '.”',
 ',',
 '.',
 '.',
 ',',
 '.',
 ',',
 ',',
 '.',
 ':',
 ',',
 ',',
 ',',
 '.',
 '?',
 '.',
 ',',
 ';',
 '.',
 '.',
 ',',
 '-',
 ',',
 ',',
 ';',
 '-',
 '.',
 ',',
 '—.',
 ',',
 '“',
 ',',
 ',',
 '.',
 ',',
 '.',
 ',',
 ';',
 ',',
 '.',
 ';',
 ',',
 ',',
 ',',
 ',',
 '.',
 '.',
 ',',
 ';',
 '-',
 ';',
 '.”',
 ',',
 '.',
 ',',
 '.',
 '.',
 '“',
 ',”',
 ',',
 '“',
 ',',
 ';',
 '.',
 ',',
 '.',
 ',”',
 ',',
 ';',
 '“',
 ',',
 ',',
 ';',
 ';',
 ',',
 '.”',
 '.',
 '.',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 '.',
 ';',
 ',',
 ',',
 '—',
 '!',
 ',',
 ',',
 '-',
 ';',
 ';',
 ',',
 '.',
 ',',
 '—',
 '!',
 ',',
 '.',
 ',',
 '.',
 '.',
 ';',
 ',',
 '.',
 ',',
 '.',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 ',',
 '.',
 '.',
 '.',
 ',',
 '.',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 '.',
 ',',
 ',',
 '’',
 '.',
 ',',
 ',',
 ';',
 ',',
 ',',
 '.',
 ',',
 '.',
 ',',
 '.',
 ';',
 '.',
 '.',
 ';',
 ';',
 ';',
 ',',
 '.',
 ',',
 '’',
 ',',
 '.',
 ',',
 ';',
 '.',
 '.',
 ',',
 '.',
 '’',
 '.',
 '-',
 '.',
 ',',
 ',',
 ',',
 ',',
 ',',
 '.',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ';',
 ',',
 ',',
 '.',
 '.',
 ',',
 ',',
 ',',
 '.',
 '.',
 ',',
 '.',
 '’',
 '’',
 '.',
 ',',
 '—',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 ',',
 '-',
 ',',
 '.',
 '.',
 ',',
 '.',
 ',',
 ',',
 '.',
 '.',
 ',',
 ',',
 ';',
 ',',
 '—',
 ',',
 '—',
 '.',
 ',',
 '-',
 '.',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 '.',
 '.',
 '.',
 '-',
 ',',
 ';',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 '-',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 '.',
 '.',
 ':',
 '.',
 ',',
 '.',
 '—',
 ',',
 '.',
 '.',
 '.',
 ';',
 '.',
 ',',
 '-',
 '.',
 ',',
 '—',
 '.',
 '.',
 '.',
 '.',
 ',',
 '.',
 ',',
 '’',
 '—',
 '—',
 '.',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 '“',
 '—',
 '.”',
 ',',
 ',',
 ',',
 ',',
 ',',
 '—',
 ',',
 ',',
 '.',
 '.',
 '.',
 ',',
 '—',
 ',',
 '.',
 ';',
 '.',
 '.',
 ',',
 '.',
 ';',
 ',',
 ',',
 '.',
 ';',
 '—',
 ',',
 ',',
 ',',
 ',',
 '—',
 '.',
 ',',
 '.',
 '.',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 '.',
 ',',
 '.',
 '.',
 ',',
 ',',
 '-',
 ';',
 '.',
 '.',
 '.',
 ',',
 ',',
 '.',
 '.',
 '.',
 ',',
 ',',
 ',',
 '.',
 '.',
 '.',
 ',',
 '.',
 ',',
 '.',
 ...]

The sub() function will substitute regex matches with another sequence. If we use the same pattern above, we can remove all punctuation. Note that we also need to tack on the extra underscore character, as it is technically counted in the character class.

cleaned = re.sub(r"[^\w\s]+|_", " ", frankenstein)

This is one way of getting around those variants from above.

token_freq = Counter(cleaned.split())

variants = ["ship,", "ship.", "ship"]
for tok in variants:
    print(tok, "->", token_freq.get(tok, None))
ship,
 -> None
ship. -> None
ship -> 8

We actually scooped up even more tokens from this substitution pattern. An even better picture of our counts would emerge if we changed our text to lowercase so that the Counter can count case variants together.

cleaned = cleaned.lower()
token_freq = Counter(cleaned.split())

print("Unique tokens after substitution and case change:", len(token_freq))
Unique tokens after substitution and case change: 7003

We’ll leave off on text preprocessing for now but will pick it up in the next chapter. We’ve covered most of the main regexes, though there are a few more advanced ones that you may find useful. See this cheatsheet for an extensive overview.

2.6. Functions#

So far we have relied on external functions, but we can also write our own. Writing functions greatly reduces redundancy in code, because you can reuse them as many times as you want, in whatever contexts you want. More, functions can keep your code organized. In complex processes, it often helps to break your code up into individual steps and associate each with a function. That also makes rewriting code much easier later on.

Before we write a function, here is a review of vocabulary associated with them:

  • The placeholder variables for inputs are parameters

  • Arguments are the values assigned to parameters during a call

  • To call a function means using it to compute something

  • The body is the code inside a function

  • A function’s scope is the local context in which it runs code

  • The return value is the output of a function

In Python, a function begins with the def keyword, followed by:

  • The name of the function

  • A list of parameters surrounded by parentheses

  • A colon :

There is no practical limit to the number of parameters. Code in the body of the function should be indented according to the same conventions for loops and conditionals (four spaces). To return a result from the function, use the return keyword.

Here is a very simple function that returns True/False depending on whether text starts with character.

def starts_with(text, character):
    first_char = text[0]
    return first_char == character

Call the function by writing out its name and supplying it with arguments.

starts_with("Book", "B")
True

Here are some more examples:

starts_with("natural language processing", "w")
False
to_test = [("token", "t"), ("Character", "c")]
for testing_pair in to_test:
    text = testing_pair[0]
    character = testing_pair[1]
    text_starts_with_char = starts_with(text, character)

    print(text, "starts with", character, "is", text_starts_with_char)
token starts with t is True
Character starts with c is False

When we speak of a function’s scope, we are talking about the local context for that function. Typically, variables inside the function or its arguments are not the same as the ones you set elsewhere in your code, even if the names of those variables match.

For instance, starts_with() creates a new variable, first_char. If we have another variable with the same name outside this function, it won’t be overwritten when we call starts_with().

first_char = "5"

print("Value of first_char:", first_char)
starts_with("Book", "B")
print("Value of first_char:", first_char)
Value of first_char: 5
Value of first_char: 5

You’ll often find yourself transforming code you’ve already written into a function. The function below re-implements the for-loop we wrote above to print regex matches and their context. It has three parameters:

  1. The regex match is match

  2. The text where we found the match is string

  3. Our offset is the number of characters to extract on either side of the match

def show_match_context(match, string, offset):
    # Get the match text
    span = match.group()

    # Get its start and end, then offset both
    start = match.start() - offset
    end = match.end() + offset

    # Ensure our expanded start/end locations don't overshoot the string
    if start < 0:
        start = 0
    if end > len(string):
        end = len(string)

    # Print the results
    print(span, "->", string[start:end])

With our function defined, we can call it once.

match = re.search(r"lightning", frankenstein)
show_match_context(match, frankenstein, 5)
lightning ->  the lightning play

Or as many times as we please.

found = re.finditer(r"lightning", frankenstein)
for match in found:
    show_match_context(match, frankenstein, 5)
lightning ->  the lightning play
lightning -> s of lightning dazz
lightning -> h of lightning
illu
lightning -> llid
lightnings tha
lightning -> s of lightning,
plu

It’s somewhat annoying to have to write out the offset every time we call the function. To circumvent this, you can specify a default value for a parameter. Your function will use that if you do not supply an argument for that parameter.

def show_match_context(match, string, offset = 5):
    # Get the match text
    span = match.group()

    # Get its start and end, then offset both
    start = match.start() - offset
    end = match.end() + offset

    # Ensure our expanded start/end locations don't overshoot the string
    if start < 0:
        start = 0
    if end > len(string):
        end = len(string)

    # Print the results
    print(span, "->", string[start:end])

With no argument supplied:

match = re.search(r"lightning", frankenstein)
show_match_context(match, frankenstein)
lightning ->  the lightning play

Supplying an argument:

match = re.search(r"lightning", frankenstein)
show_match_context(match, frankenstein, 15)
lightning -> yage I saw the lightning playing on the

Recall that the output of finditer() is a special Match object, which has properties that extend beyond typical strings. If you forget this and try to call your function on an object it doesn’t expect, you’ll run into an error:

show_match_context("lightning", frankenstein)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[99], line 1
----> 1 show_match_context("lightning", frankenstein)

Cell In[96], line 3, in show_match_context(match, string, offset)
      1 def show_match_context(match, string, offset = 5):
      2     # Get the match text
----> 3     span = match.group()
      5     # Get its start and end, then offset both
      6     start = match.start() - offset

AttributeError: 'str' object has no attribute 'group'

The more code you write, the harder it is to keep this sort of thing in your mind. That’s why it’s helpful to document your function with a docstring. Docstrings are descriptions of what your function does and what kinds of parameters it expects. They go on the first line of your function’s body and are surrounded by triple quotes """.

def starts_with(text, character):
    """Determine whether a string starts with a character."""
    first_char = text[0]
    return first_char == character

Once you’ve written a docstring, you can use help() in a Python console or ? in a Jupyter Notebook to display this information.

help(starts_with)
Help on function starts_with in module __main__:

starts_with(text, character)
    Determine whether a string starts with a character.

There are several styles for writing docstrings, but the NumPy conventions are good ones. They specify docstrings like so:

def func(x, y):
    """A short summary description of a function ending with a period.
    
    A longer description if necessary.

    Parameters
    ----------
    x : x's datatype
        A description of what x is
    y : y's datatype
        A description of what y is

    Returns
    -------
    value : value's datatype
        A description of what value is (only supply if the function returns a
        value)
    """

Let’s document show_match_context() with a docstring.

def show_match_context(match, string, offset = 5):
    """Print a regex match's surrounding characters.

    Parameters
    ----------
    match : re.Match
        A regex match from re.search() or re.finditer()
    string : str
        The string in which the match was found
    offset : int
        The number of surrounding characters to the left/right of match
    """
    # Get the match text
    span = match.group()

    # Get its start and end, then offset both
    start = match.start() - offset
    end = match.end() + offset

    # Ensure our expanded start/end locations don't overshoot the string
    if start < 0:
        start = 0
    if end > len(string):
        end = len(string)

    # Print the results
    print(span, "->", string[start:end])

Now our function is fully documented and ready for later use.

help(show_match_context)
Help on function show_match_context in module __main__:

show_match_context(match, string, offset=5)
    Print a regex match's surrounding characters.

    Parameters
    ----------
    match : re.Match
        A regex match from re.search() or re.finditer()
    string : str
        The string in which the match was found
    offset : int
        The number of surrounding characters to the left/right of match