Skip to content

Latest commit

 

History

History
1003 lines (649 loc) · 25.3 KB

23.Regular-expressions.md

File metadata and controls

1003 lines (649 loc) · 25.3 KB

Lesson 23: Regular Expressions

The power of pattern matching with precision.

Content

  1. Regular Expressions
  2. Basic Patterns
  3. Character Classes
  4. Quantifiers
  5. Anchors and Boundaries
  6. Grouping and Capturing
  7. Practice
  8. Homework

1. Regular Expressions

Regular expressions (regex or regexp) are a powerful tool for processing text.

They allow you to specify a pattern of text to search for in a string.

1.1 Use cases

Generally, we need regular expressions for string data and the follwoing use cases:

Use Case Description Notes/Examples
Validation Ensuring that strings match a specific format to ensure data integrity and consistency. Validating formats like email addresses, phone numbers, URLs.
Search and Replace Finding and replacing substrings within a larger text body. Useful in code editing, data cleaning, and log processing.
Data Extraction Extracting information from structured text for analysis or format migration. Parsing data from log files, spreadsheets, or HTML documents.
Text Parsing Splitting text into tokens or segments based on patterns. Aids in analyzing complex strings or constructing parsers.
Complex Pattern Matching Identifying patterns within text not easily described by standard string methods. Identifying specific word sequences or characters with variable amounts of whitespace.

1.2 Example

I want to scary you in advance and show a complicated regular expression used to validate an email address, so that you know what you could expect from the lesson:

import re

# Email validation pattern
email_pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"

email = "example@test.com"

# Check if the pattern matches the email
if re.match(email_pattern, email):
    print("Valid email address.")
else:
    print("Invalid email address.")

It's extremely important and useful tool to have in a programming arsenal, but don't get scary we will find out how to work with regex very soon.

2. Basic Patterns

Regular expressions use a combination of literal characters and special characters to define patterns for matching strings.

For working with regular expressions in Python we would need to import and use re module. During this lesson, I will try to introduce some functionality from this module and explain everything within inline comments above.

2.1 Literal Characters

Literal characters are the simplest form of pattern matching in regular expressions.

They match exactly the characters specified in the regex pattern. For instance, the regex pattern cat will match the string "cat" in any larger string.

2.2 Example

import re

text = "The cat sat on the mat."
pattern = "cat"

# Finding "cat" in the text
match = re.search(pattern, text)
if match:
    print("Match found:", match.group())
else:
    print("No match found.")

Output

Match found: cat

Assigment 1:

Find words UK and Britain in the following poem.

Poem

poem =
"""
In the heart of the UK, where stories entwine,
Across the grand landscapes, where the sun dares to shine.
A tapestry rich, in green and in grey,
Where history whispers, in Britain, they say.
A land where the past and future remain.
"""

2.2 Special Characters and Sequences

Special characters and sequences represent specific instructions in regex (I mean that how the regex iitself should look like).

Character Description Example Pattern Example Match
. (Dot) Matches any single character except newline (\n). a.c Matches "abc", "arc", "a3c".
^ (Caret) Matches the start of a string. ^Hello Matches "Hello" at the beginning of a string.
$ (Dollar) Matches the end of a string. end$ Matches "The end" in "This is the end".
* (Asterisk) Matches 0 or more occurrences of the preceding element. bo* Matches "boooo" in "A ghost said boooo".
+ (Plus) Matches 1 or more occurrences of the preceding element. a+ Matches "a" and "aaaa" in "caaaaaat".
? (Question Mark) Makes the preceding element optional, matching 0 or 1 occurrence. colou?r Matches both "color" and "colour".
{n} (Curly Brackets) Matches exactly n occurrences of the preceding element. a{2} Only matches "aa" in "caaat".

Let's use a regular expression to find all instances that match a pattern in a given text. We'll combine several special characters to demonstrate their usage.

IMPORTANT: Try not to move to the explanation parts for the following examples, and based on the table, identify what exactly this pattern wants to achieve.

2.3 Example

Pattern: ^T.*end$

Text: "The quick brown fox jumps over the lazy dog to reach the end"

import re

text = "The quick brown fox jumps over the lazy dog to reach the end"
pattern = "^T.*end$"

# Using re.findall to search for the pattern
matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['The quick brown fox jumps over the lazy dog to reach the end']

Explanation

This pattern seeks strings that start with "T", followed by any characters (including none), and end with "end".

2.4 Example

Pattern: ^a.*z$

Text: "A to z"

import re

text = "A to z"
pattern = "^a.*z$"

matches = re.findall(pattern, text, re.IGNORECASE)
print("Matches found:", matches)

Output

Matches found: ['A to z']

Explanation

This regex pattern is designed to find strings that start with "a" and end with "z", with any characters in between.

2.5 Example

Pattern: \d+

Text: "There are 123 apples"

import re

text = "There are 123 apples"
pattern = "\d+"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['123']

Explanation

This regex pattern aims to find one or more digits in the text.

2.6 Example

Pattern: f.?o

Text: "The quick brown fo jumps over the lazy dog to reach the end. ffo also matches."

import re

text = "The quick brown fo jumps over the lazy dog to reach the end. ffo also matches."
pattern = "f.?o"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['fo', 'ffo']

Explanation

The pattern f.?o matches strings that start with "f", followed by This regex pattern is designed to find "fo" with zero or one character between "f" and "o".

3. Character Classes

Character classes allow you to match specific sets of characters within a string. There are 2 types of character classes: predefined and custom.

Let's take a look on them:

3.1 Predefined

That is the default way, I would even say, a built in option of woriking regular expressions

Class Description Equivalent
\d Matches any digit [0-9]
\D Matches any non-digit character [^0-9]
\s Matches any whitespace character (including spaces, tabs, and line breaks)
\S Matches any non-whitespace character
\w Matches any word character (letters, digits, and underscores)
\W Matches any non-word character

3.2 Custom

Class Description
[abc] Matches any one of the characters a, b, or c.
[^abc] Matches any character that is not a, b, or c.
[a-z] Matches any lowercase letter.
[A-Z] Matches any uppercase letter.
[0-9] Matches any digit (same as \d).

To be honest, there is nothing much can be told in terms of theory, anywhere across this lesson. Let's dive into practice!

3.3 Example

Pattern: \d{2,4}

Text: "The year 2023 marks the event."

import re

text = "The year 2023 marks the event."
pattern = "\d{2,4}"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['2023']

Explanation

This regex pattern aims to find sequences of digits where there are 2 to 4 digits in a row, matching years or other numerical data within that range.

3.4 Example

Pattern: [A-Za-z]+

Text: "Regex101 is a great tool!"

import re

text = "Regex101 is a great tool!"
pattern = "[A-Za-z]+"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['Regex', 'is', 'a', 'great', 'tool']

Explanation

The pattern [A-Za-z]+ matches one or more consecutive letters, capturing words in the sentence.

Example 3.5

Pattern: [^0-9\s]+

Text: "Error 404: Not Found."

import re

text = "Error 404: Not Found."
pattern = "[^0-9\s]+"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['Error', ':', 'Not', 'Found.']

Explanation

Matches sequences of characters that are not digits or whitespaces.

Example 3.6

Pattern: \d+

Text: "There are 15 apples in 4 baskets."

import re

text = "There are 15 apples in 4 baskets."
pattern = "\d+"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['15', '4']

Explanation

Find one or more digits in the text, effectively extracting all numerical values.


3.7 Example

Pattern: \w+

Text: "Hello, world! Regex is amazing."

import re

text = "Hello, world! Regex is amazing."
pattern = "\w+"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['Hello', 'world', 'Regex', 'is', 'amazing']

Explanation

Matches sequences of word characters (letters, digits, underscores), capturing words in the sentence.


3.8 Example

Pattern: [aeiou]

Text: "Regular expressions are powerful."

import re

text = "Regular expressions are powerful."
pattern = "[aeiou]"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['e', 'u', 'a', 'e', 'e', 'i', 'o', 'a', 'e', 'o', 'e', 'u']

Explanation

Matches any vowel in the text, targeting specific subsets of characters.

4. Quantifiers

Quantifiers in regular expressions define how many instances of a character, group, or character class must be present for a match to occur.

Quantifier Description Example Pattern Matches
* Matches 0 or more occurrences of the preceding element. a*b "b", "ab", "aab", "aaab", ...
+ Matches 1 or more occurrences of the preceding element. a+b "ab", "aab", "aaab", ...
? Matches 0 or 1 occurrence of the preceding element, making it optional. a?b "b", "ab"
{n} Matches exactly n occurrences of the preceding element. a{2}b "aab"
{n,} Matches n or more occurrences of the preceding element. a{2,}b "aab", "aaab", "aaaab", ...
{n,m} Matches between n and m occurrences of the preceding element, inclusive. a{2,3}b "aab", "aaab"

4.1 Greedy vs Lazy Quantification

Quantifiers are greedy by default, meaning they match as many occurrences of the pattern as possible.

For a better performance, we could make them lazy by appending a ? to them, causing them to match as few characters as needed for the pattern to succeed.

I guess, let's dive into practice again?

4.2 Example

Pattern: a.*c

Text: "abcabc"

import re

text = "abcabc"
pattern = "a.*c"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['abcabc']

Explanation

The greedy quantifier .* matches as many characters as possible between a and c, capturing the entire string "abcabc".


4.3 Example

Pattern: a.*?c

Text: "abcabc"

import re

text = "abcabc"
pattern = "a.*?c"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['abc', 'abc']

Explanation

By making the quantifier lazy (.*?), the pattern matches as little as possible, resulting in two separate matches "abc" and "abc", instead of the entire string.


4.4 Example

Pattern: a{2,3}

Text: "aaaaa"

import re

text = "aaaaa"
pattern = "a{2,3}"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['aaa', 'aa']

Explanation

Matches between 2 and 3 occurrences of a. Actually, that is a great example which shows how to specify exact or ranges of repetitions in a pattern itself.

5. Anchors and Boundaries

Anchors and boundaries do not match characters but rather match positions within the input text.

They are used to assert that the required match is at a particular position in the text.

5.1 Boundaries

The word boundary anchor \b is used to denote the boundaries of words. It allows a regular expression to specify that a given pattern must occur at the beginning or end of a word within the text.

  • \bWORD\b: Matches the exact word "WORD".

5.2 Example

Pattern: \bcat\b

Text: "The cat sat on the mat."

import re

text = "The cat sat on the mat."
pattern = r"\bcat\b"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['cat']

Explanation

This pattern uses word boundaries to find "cat" as a whole word in the text, ensuring it doesn't match substrings of larger words.

5.3 Start and End Anchors

The start ^ and end $ of string anchors are used to match the beginning and end of the entire text, respectively.

  • ^Start: Matches "Start" at the beginning of the text.
  • End$: Matches "End" at the end of the text.

5.4 Example

Pattern: ^The

Text: "The start of the sentence."

import re

text = "The start of the sentence."
pattern = "^The"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['The']

Explanation

The pattern ^The matches the word "The" only if it appears at the start of the text.

5.5 Example

Pattern: sentence.$

Text: "This is the end of the sentence."

import re

text = "This is the end of the sentence."
pattern = "sentence.$"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['sentence.']

Explanation

The pattern sentence.$ matches "sentence." only if it is at the very end of the text.

6. Grouping and Capturing

Grouping and capturing allow you to treat multiple characters as a single unit, extract information from matches, and perform operations on captured groups.

6.1 Parentheses

Parentheses () are used in regular expressions to group parts of the pattern.

This can be useful for applying quantifiers to a sequence of characters or for isolating parts of a pattern for capturing or backreferencing.

6.2 Example

Pattern: (ab)+

Text: "ababab is a repetitive pattern."

import re

text = "ababab is a repetitive pattern."
pattern = r"(ab)+"

matches = re.findall(pattern, text)
print("Matches found:", matches)

Output

Matches found: ['ab']

Explanation

This pattern uses parentheses to group ab and apply the + quantifier, matching one or more occurrences of the "ab" sequence.

IMPORTANT: The output shows the last captured group.

6.4 Example

Capturing Groups: By default, groups created with parentheses not only group parts of the pattern but also capture the matched text for later use, such as extracting data.

Pattern: (ab)(cd) Text: "abcdef"

import re

text = "abcdef"
pattern = r"(ab)(cd)"

match = re.search(pattern, text)
if match:
    print("First group:", match.group(1))
    print("Second group:", match.group(2))

Output

First group: ab
Second group: cd

Explanation

This pattern captures two groups: (ab) and (cd). The .group(1) and .group(2) methods are used to access these captured groups.

6.5 Example

Non-Capturing Groups: If you want to use parentheses to group parts of your pattern without capturing the matched text, you can use non-capturing groups syntax (?:pattern).

Pattern: (ab)(?:cd)

Text: "abcdef"

import re

text = "abcdef"
pattern = r"(ab)(?:cd)"

match = re.search(pattern, text)
if match:
    print("First group:", match.group(1))
    # Second group is not captured

Output

First group: ab

Explanation

The pattern (ab)(?:cd) captures "ab" as the first group and uses (?:cd) to group "cd" without capturing it. This allows for grouping without affecting the numbering of other capturing groups.

In some cases we can use the following flags:

Flag Modifier Description Example Use Case
Case Insensitivity re.IGNORECASE or re.I Makes the match case-insensitive, allowing patterns to match letters regardless of case. Matching "cat", "Cat", "CAT", etc., with the same pattern.
Multiline re.MULTILINE or re.M Treats the start (^) and end ($) characters as working across multiple lines, allowing them to match at the start or end of any line within a string. Matching patterns at the beginning or end of each line in a text block, rather than only at the very start or end.
Dot Matches All re.DOTALL or re.S Makes the dot (.) special character match all characters, including the newline character, which it does not match by default. Matching across lines with a single pattern, where the newline character would typically stop the match.

7. Practice

I have created a large_text.txt file in Intermediate/assets.

We will take a look at the couple of objectives, and the rest will be included into homework.

Objective #1:

Use regular expressions to identify and separate different sections of the text (e.g., Overview, Departments, Courses Offered, Notable Alumni, Contact Information).

Objective #2:

Find all instances of "University", "Department", and "Course" in the text, regardless of case.

Objective #3:

List all department names, assuming each is listed at the start of a new line following "Departments:".

Objective #4:

Extract the university's overview paragraph, including any newlines within it.

Objective #5:

Capture course names and categorize them as either "Introduction" or "Advanced".

Objective #6

Identify and extract email addresses and phone numbers

Objective #7

Highlight all exact occurrences of the words "Regex", "Text", and "Pattern"

Example

"""
I decided to take [Objective #1, Objective #4 and Objective #6]
"""

def read_file_content(filepath):
    """Reads and returns the content of the specified file."""
    with open(filepath, 'r', encoding='utf-8') as file:
        return file.read()


# Objective 1:
def split_sections(text):
    """Splits the text into sections based on headers.
    
    The pattern of regex matches headers followed by a newline
    """
    
    pattern = r"^(.*?):\n"
    sections = re.split(pattern, text, flags=re.MULTILINE)
    sections = dict(zip(sections[1::2], sections[2::2]))  
    
    for title, content in sections.items():
        print(f"Section: {title}\nContent: {content}\n\n")


# Objective 4:
def extract_overview(text):
    """Extracts the overview paragraph from the text."""
    pattern = r"University of Regex\n(.*?)(?=\n\n|\nDepartments:)"
    match = re.search(pattern, text, flags=re.DOTALL)
    if match:
        print("Overview:", match.group(1))
    return "Overview not found."


# Objective 6:
def extract_contact_info(text):
    """Extracts email addresses and phone numbers from the text."""
    email_pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
    phone_pattern = r"\+\d{11}"
    
    emails = re.findall(email_pattern, text)
    phones = re.findall(phone_pattern, text)
    print("Emails:", emails)
    print("Phones:", phones)


text = read_file_content('assets/large_text.txt')

split_sections(text)
extract_overview(text)
extract_contact_info(text)

Output

Section: Departments
Content: - Computer Science
- Linguistics
- Data Science




Section: Courses Offered
Content: - Introduction to Regular Expressions
- Advanced Pattern Matching
- Text Processing with Python




Section: Notable Alumni
Content: - Jane Doe, Data Scientist at Regex Inc.
- John Smith, Software Engineer at Patterns Co.




Section: Contact Us
Content: Visit our website: www.universityofregex.edu
Email: info@universityofregex.edu
Phone: +447921419244
"""



Overview: Established in 2021, the University of Regex offers comprehensive courses on regular expressions, text processing, 
and pattern matching across various programming languages. 
Emails: ['info@universityofregex.edu']
Phones: ['+44792141924']

Complete the remaining assigments as the first part of your homework, and think were they could be applicable in already existing projects you've written. Happy Coding!

8. Homework

Task 1: Email Validator

Objective: Write a function to validate email addresses using a regular expression.

Requirements:

  • Your function should verify that the email addresses have the correct format: username@domain.com.
  • The username part should contain only letters, numbers, dots, hyphens, and underscores.
  • The domain should contain only letters and dots.
  • Validate a list of email addresses and print out whether each is valid or invalid.
import re

def validate_email(email):
    pattern = r''
    if re.match(pattern, email):
        return "Valid email address."
    else:
        return "Invalid email address."

emails = ["example@test.com", "bademail@com", "another.email@test.co.uk"]
for email in emails:
    print(validate_email(email))

Task 2: Extract Dates

Objective: Extract all dates from a text in the format dd/mm/yyyy.

Requirements:

  • Write a function that searches a given text for dates matching the specified format.
  • Ensure that your regex accounts for the variations in day and month (e.g., 01/01/2022 and 1/1/2022).
def extract_dates(text):
    pattern = r''
    dates = re.findall(pattern, text)
    return dates

text = "Important dates include 12/12/2022, 01/01/2023 and 31/12/2022."
print(extract_dates(text))

Task 3: Log File Analysis

Objective: Analyze a server log file to extract and count warning and error messages.

Requirements:

  • Assume the log entries for warnings start with "WARN" and errors with "ERROR".
  • Your function should return the count of each type of message.
log_entries = """
INFO Successful operation.
WARN System instability detected.
ERROR Unable to retrieve data.
WARN Check your network connection.
ERROR Disk full.
"""

def analyze_log(log):
    # Define regex here, btw it's a good case for using re.findall() ;)
    # Though it might be tricky, I beleive in you
    warn_count = None
    error_count = None
    return {"Warnings": warn_count, "Errors": error_count}

print(analyze_log(log_entries))

Task 4: Password Strength Checker

Objective: Create a function to assess the strength of a password.

Requirements:

  • A strong password must have at least eight characters, include uppercase and lowercase letters, numbers, and at least one special character.
  • Use regular expressions to validate the password and return whether it's strong or not.
def check_password_strength(password):
    pattern = r'^'      # This is very hard.. 
    if re.match(pattern, password):
        return "Strong password."
    return "Weak password."

print(check_password_strength("StrongPass1$"))
print(check_password_strength("weakpass"))

Task 5: Phone Number Formatter

Objective: Write a function that formats a 10-digit phone number as (xxx) xxx-xxxx.

Requirements:

  • The input will be a string of 10 digits.
  • Your function should format it into the pattern described.
def format_phone_number(number):
    pattern = r''
    formatted = re.sub()
    return formatted

print(format_phone_number("1234567890"))

Task 6: Extract Hashtags

Objective: Extract all hashtags from a given text.

Requirements:

  • Hashtags start with # and contain alphanumeric characters without spaces.
  • Return all found hashtags in a list.
def extract_hashtags(text):
    pattern = r''
    hashtags = re.findall(pattern, text)
    return hashtags

text = "Loving the #sunshine and perfect weather in #California!"
print(extract_hashtags(text))