Python: RegEx Module

It’s been opinionated that a string operation is made complete only with RegEx. Hence, Python sports this module to perform complex string manipulations with precision and efficiency. A general overview of RegEx has been discussed in this blog here. A 3rd-party library ( can be used in conjunction with this re module to enhance coding readability.

The authoritative source to review this module is Although this library is currently consisting of less than 400 lines, it packs a handful of intuitive functions. Quoting from the horse’s mouth,

This module exports the following functions:
match Match a regular expression pattern to the beginning of a string.
fullmatch Match a regular expression pattern to all of a string.
search Search a string for the presence of a pattern.
sub Substitute occurrences of a pattern found in a string.
subn Same as sub, but also return the number of substitutions made.
split Split a string by the occurrences of a pattern.
findall Find all occurrences of a pattern in a string.
finditer Return an iterator yielding a Match object for each match.
compile Compile a pattern into a Pattern object.
purge Clear the regular expression cache.
escape Backslash all non-alphanumerics in a string.

Let’s explore the usage of each of these functions:

Compile(pattern[, index])

Converts RegEx pattern into RegEx object with access to re functions via the ‘dot notation.’ This enhances coding brevity by eliminating the repeats of the pre-compiled patterns when calling re functions.

Match(pattern, string[, string index])

will only match the specific index character of a string. The default is index=0 or first character of a string.

pattern = re.compile("i") #call the re module with function compile() to convert a character into RegEx and pass that to variable named pattern
pattern.match("fish") #result: no match because "f" does not equal "i"
pattern.match("fish",1) #result: matched "i" at index 1

Different to a match() in that fullmatch() will process the full string against the regex pattern

phoneNumPattern = regex.compile(r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}') #Explain: American phone number
phoneNumPattern.fullmatch('1 (714)', partial=True) #Result: will match because partial has been set to true
phoneNumPattern.fullmatch('1 (714) 555-5555-6', partial=True) #Result: no match because of the trailing '-6'

Performs a mini-Google search on a provided string

t ="\w+", "Taco Bell") #Result: Taco
t.string #Result: Taco Bell
t.detach_string() #Release string from memory while keeping previously processed results. This is an important practice if the string value is a heavy document

Substitutes matches found in a string. This is effectively a ‘replace’ command

someString = "Gone Fishing"
re.sub("\s", "... ", someString) #Result: someString = "Gone... Fishing"

Performs sub() and also return the number of substitutions made.

# substitute spaces with dots '... '
re.subn("\s", "... ", someString) #Result: someString = "Gone... Fishing", matches 1 time

Split a string by the occurrences of a pattern. This functions takes 4 arguments with latter 2 with default values if not specified

def split(pattern, string, maxsplit=0, flags=0):
someString = "Gone Fishing"
re.split("\s", someString) #Result: "Gone" "Fishing"

“If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.”

# Takes 3 arguments with 3rd to default 0 (meaning unlimited)
def findall(pattern, string, flags=0)
someString = "Gone Fishing"
# find all four consecutive characters in someString
re.findall("\w{4}", someString) #Result: "Gone"

“Return an iterator over all non-overlapping matches in the
string. For each match, the iterator returns a Match object.”

# Function takes 3 arguments
def finditer(pattern, string, flags=0)

The following script reads a webpage. Puts it through the finditer() function. Prints out string in the format of 2 clauses from the resulting set per iteration:

import re
import urllib2

webpage = urllib2.urlopen('').read()
# Pattern with no white spaces, no periods, full matches, limiting TLD to alphanumerics
emailPattern = re.compile(r'[^@\s]+@[^@\s]+\.[a-zA-Z0-9]+$')
for match in emailPattern.finditer(webpage):
print "%s: %s" % (match.start(),

Sample Output:


Clears the regular expression cache. Since there is usually an abundant amount of RAM in modern computers, this function is unnecessary.


Returns a string with special characters backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression meta characters within.


import re
re.escape('') #Result: 'https\\:\\/\\/\\.kimconnect\\.com'

Leave a Reply

Your email address will not be published. Required fields are marked *