Python: RegEx Module

It’s been opinionated that a string operation is made complete only with RegEx. Hence, Python sports this module to perform complex string manipulations with precision and efficiency. A general overview of RegEx has been discussed in this blog here. A 3rd-party library (https://pypi.org/project/regex) can be used in conjunction with this re module to enhance coding readability.

The authoritative source to review this module is https://github.com/python/cpython/blob/3.7/Lib/re.py. Although this library is currently consisting of less than 400 lines, it packs a handful of intuitive functions. Quoting from the horse’s mouth,

This module exports the following functions:
match Match a regular expression pattern to the beginning of a string.
fullmatch Match a regular expression pattern to all of a string.
search Search a string for the presence of a pattern.
sub Substitute occurrences of a pattern found in a string.
subn Same as sub, but also return the number of substitutions made.
split Split a string by the occurrences of a pattern.
findall Find all occurrences of a pattern in a string.
finditer Return an iterator yielding a Match object for each match.
compile Compile a pattern into a Pattern object.
purge Clear the regular expression cache.
escape Backslash all non-alphanumerics in a string.

Let’s explore the usage of each of these functions:

Compile(pattern[, index])

Converts RegEx pattern into RegEx object with access to re functions via the ‘dot notation.’ This enhances coding brevity by eliminating the repeats of the pre-compiled patterns when calling re functions.

Match(pattern, string[, string index])

will only match the specific index character of a string. The default is index=0 or first character of a string.

pattern = re.compile("i") #call the re module with function compile() to convert a character into RegEx and pass that to variable named pattern
pattern.match("fish") #result: no match because "f" does not equal "i"
pattern.match("fish",1) #result: matched "i" at index 1
Fullmatch()

Different to a match() in that fullmatch() will process the full string against the regex pattern

phoneNumPattern = regex.compile(r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}') #Explain: American phone number
phoneNumPattern.fullmatch('1 (714)', partial=True) #Result: will match because partial has been set to true
phoneNumPattern.fullmatch('1 (714) 555-5555-6', partial=True) #Result: no match because of the trailing '-6'
Search()

Performs a mini-Google search on a provided string

t = regex.search(r"\w+", "Taco Bell")
t.group() #Result: Taco
t.string #Result: Taco Bell
t.detach_string() #Release string from memory while keeping previously processed results. This is an important practice if the string value is a heavy document
Sub()

Substitutes matches found in a string. This is effectively a ‘replace’ command

someString = "Gone Fishing"
re.sub("\s", "... ", someString) #Result: someString = "Gone... Fishing"
Subn()

Performs sub() and also return the number of substitutions made.

# substitute spaces with dots '... '
re.subn("\s", "... ", someString) #Result: someString = "Gone... Fishing", matches 1 time
Split()

Split a string by the occurrences of a pattern. This functions takes 4 arguments with latter 2 with default values if not specified

def split(pattern, string, maxsplit=0, flags=0):
someString = "Gone Fishing"
re.split("\s", someString) #Result: "Gone" "Fishing"
Findall()

“If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.”

# Takes 3 arguments with 3rd to default 0 (meaning unlimited)
def findall(pattern, string, flags=0)
someString = "Gone Fishing"
# find all four consecutive characters in someString
re.findall("\w{4}", someString) #Result: "Gone"
Finditer()

“Return an iterator over all non-overlapping matches in the
string. For each match, the iterator returns a Match object.”

# Function takes 3 arguments
def finditer(pattern, string, flags=0)

The following script reads a webpage. Puts it through the finditer() function. Prints out string in the format of 2 clauses from the resulting set per iteration:

import re
import urllib2

webpage = urllib2.urlopen('https://kimconnect.com/professional-it-credentials/').read()
# Pattern with no white spaces, no periods, full matches, limiting TLD to alphanumerics
emailPattern = re.compile(r'[^@\s]+@[^@\s]+\.[a-zA-Z0-9]+$')
for match in emailPattern.finditer(webpage):
print "%s: %s" % (match.start(), match.group(1))

Sample Output:

85: admin@kimconnect.com
Purge()

Clears the regular expression cache. Since there is usually an abundant amount of RAM in modern computers, this function is unnecessary.

Escape()

Returns a string with special characters backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression meta characters within.

Example:

import re
re.escape('https://kimconnect.com') #Result: 'https\\:\\/\\/\\.kimconnect\\.com'

Leave a Reply

Your email address will not be published. Required fields are marked *