Python Lecture 16: Mastering Regular Expressions for Powerful Text Processing
Welcome to a lecture that will give you superpowers for working with text! Regular expressions (regex) are a powerful language for describing and matching text patterns. They're used everywhere: validating user input (email addresses, phone numbers), searching through documents, extracting data from logs, cleaning text, and much more. While regex syntax can look intimidating at first, understanding it opens up incredibly efficient solutions to text processing problems that would otherwise require complex code.
Think about common programming tasks: validating that an email address is properly formatted, extracting all phone numbers from a document, finding all dates in text, replacing multiple spaces with single spaces, checking if a password meets complexity requirements. Without regex, these require lengthy string operations with many conditionals. With regex, each becomes a single pattern that clearly expresses what you're looking for. Mastering regex is a force multiplier for text processing.
By the end of this comprehensive lecture, you'll understand regex syntax, know how to use Python's re module, be able to create patterns for real-world tasks, and understand when regex is the right tool. We'll build your knowledge systematically with detailed explanations and practical examples. Regular expressions are challenging but incredibly rewarding to learn. Let's demystify regex together!
Understanding Regular Expressions - What They Really Are
A regular expression is a sequence of characters that defines a search pattern. It's like a specialized language for describing text patterns. Instead of searching for exact strings ("find 'python'"), you describe patterns ("find any word starting with 'p'", "find any valid email address", "find three digits followed by a hyphen").
Why Regex Exists: String methods like find() and replace() work with literal text. But what if you need to find "any number between 1 and 100" or "any email address" or "any word with 3-5 letters"? You'd need complex code with loops and conditions. Regex provides a concise way to express these patterns. One regex pattern replaces dozens of lines of code.
The Trade-off: Regex is powerful but has a steep learning curve. The syntax is terse and cryptic - a pattern like \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b looks like gibberish at first. But this compactness is intentional - regex packs a lot of meaning into few characters. Once you learn the syntax, you can read and write patterns efficiently. The investment in learning pays dividends for the rest of your programming career.
Python's re Module: Python's built-in 're' module provides regex functionality. You import it, define patterns as strings (often raw strings with r""), and use functions like search(), match(), findall(), and sub() to work with text. Understanding the module's functions is as important as understanding regex syntax itself.
Basic Regex Syntax - Building Blocks of Patterns
Let's start with the fundamental building blocks. Regex patterns are built from literal characters (which match themselves) and special metacharacters (which have special meanings).
Literal Characters: Most characters match themselves. The pattern "cat" matches the literal string "cat" in text. Letters, numbers, and most punctuation work this way. This is your starting point - if you just want to find exact text, use literals.
Metacharacters: Certain characters have special meanings: . (any character), * (zero or more), + (one or more), ? (zero or one), ^ (start of string), $ (end of string), [] (character class), () (grouping), | (or), {} (repetition), \ (escape). These are the power of regex but also what makes it cryptic. Each has a specific, precise meaning.
import re
# Literal matching
text = "The cat sat on the mat"
pattern = "cat"
result = re.search(pattern, text)
if result:
print(f"Found '{pattern}' at position {result.start()}")
# Dot (.) matches any single character
pattern = "c.t" # Matches cat, cot, cut, c@t, etc.
matches = re.findall(pattern, "cat cot cut c9t")
print(f"Pattern 'c.t' matches: {matches}")
# Asterisk (*) - zero or more of preceding
pattern = "co*l" # Matches cl, col, cool, coool, etc.
matches = re.findall(pattern, "cl col cool coool")
print(f"Pattern 'co*l' matches: {matches}")
# Plus (+) - one or more of preceding
pattern = "co+l" # Matches col, cool, coool (but not cl)
matches = re.findall(pattern, "cl col cool coool")
print(f"Pattern 'co+l' matches: {matches}")
# Question mark (?) - zero or one of preceding
pattern = "colou?r" # Matches color or colour
matches = re.findall(pattern, "color colour")
print(f"Pattern 'colou?r' matches: {matches}")
Raw Strings: Use raw strings (r"pattern") for regex patterns. Python's backslash is used for escape sequences (\n, \t), and regex also uses backslashes extensively (\d, \w). Raw strings prevent Python from interpreting backslashes, letting regex handle them. Always write r"\d+" not "\d+".
Character Classes - Matching Sets of Characters
Character classes let you specify a set of characters to match. This is more flexible than literal matching but more specific than "any character" (dot).
Basic Character Classes: Square brackets [] define a set. [abc] matches a, b, or c. [0-9] matches any digit. [a-z] matches any lowercase letter. [A-Z] matches any uppercase letter. You can combine ranges: [a-zA-Z0-9] matches any letter or digit.
Predefined Character Classes: Regex provides shortcuts for common sets: \d matches digits [0-9], \w matches word characters [a-zA-Z0-9_], \s matches whitespace (space, tab, newline). Capital versions are negations: \D matches non-digits, \W matches non-word chars, \S matches non-whitespace.
| Pattern | Description | Example Matches |
|---|---|---|
| [abc] | Matches a, b, or c | a, b, c |
| [a-z] | Matches any lowercase letter | a, m, z |
| [0-9] | Matches any digit | 0, 5, 9 |
| [^abc] | Matches anything except a, b, c | x, 1, @ |
| \d | Matches any digit | 0-9 |
| \w | Matches word characters | a-z, A-Z, 0-9, _ |
| \s | Matches whitespace | space, tab, newline |
import re
# Character class - specific characters
text = "The cat, bat, and rat"
pattern = r"[cbr]at" # Matches cat, bat, or rat
matches = re.findall(pattern, text)
print(f"Animals: {matches}")
# Range character class
text = "Call 123-456-7890 or 987-654-3210"
pattern = r"[0-9]+" # Matches one or more digits
matches = re.findall(pattern, text)
print(f"Numbers: {matches}")
# Predefined character class - \d for digits
pattern = r"\d+" # Same as [0-9]+
matches = re.findall(pattern, text)
print(f"Numbers with \\d: {matches}")
# \w for word characters
text = "user_name123 and email@domain.com"
pattern = r"\w+" # Matches word sequences
matches = re.findall(pattern, text)
print(f"Words: {matches}")
# \s for whitespace
text = "Hello World\t\nPython"
pattern = r"\s+" # Matches whitespace sequences
matches = re.findall(pattern, text)
print(f"Whitespace found: {len(matches)} times")
# Negated character class [^...]
text = "abc123xyz"
pattern = r"[^0-9]+" # Matches non-digits
matches = re.findall(pattern, text)
print(f"Non-digits: {matches}")
Quantifiers - Specifying How Many
Quantifiers specify how many times the preceding element should match. They make patterns flexible - matching varying amounts of text.
Basic Quantifiers: * (zero or more), + (one or more), ? (zero or one), {n} (exactly n times), {n,} (n or more times), {n,m} (between n and m times). These follow the element they quantify: \d+ means "one or more digits", \w{3,5} means "3 to 5 word characters".
Greedy vs Non-Greedy: By default, quantifiers are greedy - they match as much text as possible. Adding ? makes them non-greedy - they match as little as possible. For example, .* matches everything to the end of the string, but .*? matches as little as possible. This matters when extracting content between delimiters.
import re
# Exact repetition {n}
text = "Call 123-456-7890"
pattern = r"\d{3}-\d{3}-\d{4}" # Phone number format
match = re.search(pattern, text)
if match:
print(f"Phone found: {match.group()}")
# Range repetition {n,m}
text = "Passwords: abc, defgh, ijklmnop"
pattern = r"\b\w{5,8}\b" # Words with 5-8 characters
matches = re.findall(pattern, text)
print(f"5-8 char words: {matches}")
# One or more (+)
text = "Find all numbers: 5, 42, 100, 7"
pattern = r"\d+" # One or more digits
matches = re.findall(pattern, text)
print(f"Numbers: {matches}")
# Zero or more (*)
text = "Matches: ab, aab, aaab, b"
pattern = r"a*b" # Zero or more 'a' followed by 'b'
matches = re.findall(pattern, text)
print(f"Pattern 'a*b': {matches}")
# Greedy vs non-greedy
html = "Content 1Content 2"
greedy = r".*" # Matches entire string
non_greedy = r".*?" # Matches each div separately
print(f"Greedy: {re.findall(greedy, html)}")
print(f"Non-greedy: {re.findall(non_greedy, html)}")
Anchors and Boundaries - Position Matching
Anchors match positions in text rather than characters. They're crucial for precise matching.
Start and End Anchors: ^ matches the start of string, $ matches the end. ^abc matches "abc" only at the beginning. xyz$ matches "xyz" only at the end. ^abc$ matches only if the entire string is exactly "abc". These prevent partial matches when you need exact matches.
Word Boundaries: \b matches word boundaries (transition between \w and \W or start/end of string). \bcat\b matches "cat" as a whole word but not in "catalog" or "scat". This is essential for finding complete words, not substrings.
import re
# Start of string anchor (^)
text = "Python is great. Python rocks!"
pattern = r"^Python" # Matches only at start
match = re.search(pattern, text)
if match:
print("Python found at start")
# End of string anchor ($)
pattern = r"rocks!$" # Matches only at end
match = re.search(pattern, text)
if match:
print("Found 'rocks!' at end")
# Word boundary (\b)
text = "The cat sat in the catalog"
pattern = r"\bcat\b" # Matches 'cat' as whole word
matches = re.findall(pattern, text)
print(f"Whole word 'cat' found {len(matches)} time(s)")
# Without boundary
pattern = r"cat" # Matches 'cat' anywhere
matches = re.findall(pattern, text)
print(f"'cat' substring found {len(matches)} time(s)")
# Validating exact match
def validate_username(username):
"""Username must be 3-16 alphanumeric characters"""
pattern = r"^[a-zA-Z0-9]{3,16}$"
if re.match(pattern, username):
return "Valid username"
return "Invalid username"
print(validate_username("john_doe")) # Invalid (underscore)
print(validate_username("john123")) # Valid
print(validate_username("ab")) # Invalid (too short)
📚 Related Python Tutorials:
Python's re Module Functions
Understanding regex syntax is half the battle. The other half is knowing Python's re module functions that apply patterns to text.
re.search(): Searches entire string for pattern, returns first match object (or None). Use when you need to find if pattern exists anywhere and get details about the match.
re.match(): Checks if pattern matches at the start of string. Returns match object or None. Use for validation - checking if string conforms to pattern from the beginning.
re.findall(): Returns list of all non-overlapping matches as strings. Use when you need all occurrences - extracting all emails, phone numbers, etc.
re.sub(): Replaces matches with new text. Returns modified string. Use for text cleanup and transformation - removing unwanted characters, standardizing formats.
import re
text = "Contact us at support@example.com or sales@example.com"
# re.search() - find first occurrence
pattern = r"\w+@\w+\.\w+"
match = re.search(pattern, text)
if match:
print(f"First email found: {match.group()}")
print(f"Position: {match.start()}-{match.end()}")
# re.findall() - find all occurrences
emails = re.findall(pattern, text)
print(f"All emails: {emails}")
# re.match() - check if starts with pattern
text2 = "python@example.com is my email"
if re.match(pattern, text2):
print("Text starts with email")
else:
print("Text doesn't start with email")
# re.sub() - replace matches
text3 = "Phone: 123-456-7890"
# Hide middle digits
pattern = r"(\d{3})-(\d{3})-(\d{4})"
result = re.sub(pattern, r"\1-***-\4", text3)
print(f"Masked: {result}")
# Remove extra whitespace
text4 = "Too many spaces"
result = re.sub(r"\s+", " ", text4)
print(f"Cleaned: {result}")
Real-World Regex Applications
import re
def validate_email(email):
"""Validate email address format"""
# Pattern breakdown:
# ^[a-zA-Z0-9._%+-]+ - username part
# @ - literal @
# [a-zA-Z0-9.-]+ - domain name
# \. - literal dot
# [a-zA-Z]{2,}$ - domain extension (2+ letters)
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
if re.match(pattern, email):
return f"'{email}' is valid"
return f"'{email}' is invalid"
# Test emails
emails = [
"user@example.com",
"test.email@domain.co.uk",
"invalid.email",
"@example.com",
"user@.com"
]
for email in emails:
print(validate_email(email))
import re
def extract_phone_numbers(text):
"""Extract and format US phone numbers"""
# Patterns for different formats
patterns = [
r"\d{3}-\d{3}-\d{4}", # 123-456-7890
r"\(\d{3}\)\s*\d{3}-\d{4}", # (123) 456-7890
r"\d{3}\.\d{3}\.\d{4}", # 123.456.7890
r"\d{10}" # 1234567890
]
phones = []
for pattern in patterns:
matches = re.findall(pattern, text)
phones.extend(matches)
# Format all to standard format
formatted = []
for phone in phones:
# Remove all non-digits
digits = re.sub(r"\D", "", phone)
if len(digits) == 10:
# Format as (123) 456-7890
formatted_phone = f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
formatted.append(formatted_phone)
return formatted
# Test
text = """
Contact information:
Office: 123-456-7890
Mobile: (987) 654-3210
Fax: 555.123.4567
Direct: 8005551234
"""
numbers = extract_phone_numbers(text)
print("Found phone numbers:")
for num in numbers:
print(f" {num}")
Summary and Regex Mastery
Regular expressions are a powerful tool for text processing. You've learned:
✓ What regex is and why it's valuable
✓ Basic syntax: literals and metacharacters
✓ Character classes for matching sets
✓ Quantifiers for repetition
✓ Anchors and boundaries for position
✓ Python's re module functions
✓ Real-world validation and extraction patterns
Practice Regex: The only way to master regex is practice. Start simple and gradually tackle more complex patterns. Test patterns thoroughly with edge cases. Use online regex testers to experiment. Build a library of useful patterns for common tasks. Regex becomes intuitive with experience.
Practice Challenge: Create a log analyzer that uses regex to extract: timestamps, error levels (ERROR, WARNING, INFO), IP addresses, and user IDs from log files. Parse dates in various formats. Count occurrences of each error type. This combines everything you've learned!
