What is RegEx?
- Regular Expressions are a language of their own for matching patterns
- They are highly useful in text data processing
The official Python source defines the Regex in the following way:
An expression containing ‘meta’ characters and literals to identify and/or replace a pattern matching that expression Meta Characters: these characters have a special meaning Literals: these characters are the actual characters that are to be matched
Use Cases - To search a string pattern - To split a string based on a pattern - To replace a part of the string
A Selected List of Advanced Regex Usages in Python
Case 1: Extract Username and Domain from Email
Key Concepts: Use ofgroup()attribute inre.searchand numbered captures using proper paranthesispattern: “(+).(+)@(+).+”
email = "senthil.kumar@gutszen.com"
pattern = "(\w+)\.(\w+)@(\w+)\.\w+"
match = re.search(pattern, email)
if match:
first_name = match.group(1)
last_name = match.group(2)
company = match.group(3)
print(f"first_name: {first_name}")
print(f"last_name: {last_name}")
print(f"company: {company}")first_name: senthil
last_name: kumar
company: gutszenCase 2: A Regex Gotcha - An example where raw_string_literal is needed
- In most cases without or without a raw literal, the python pattern works fine. stackoverflow comment
- But for the followiing example where
textis a raw literal string with a in it
text = r"Can you capture? this\that"
pattern = r"\w+\\\w+"
matches = re.findall(pattern, text)
for match in matches:
print(f"Matched String: {match}")Matched String: this\that- What happens if I try below example where both text and pattern are devoid of raw literal?
- Do notice the
hatword in the end of the matches
text = "Can you capture? this\that"
pattern = "\w+\\w+"
matches = re.findall(pattern, text)
for match in matches:
print(f"Matched String: {match}")Matched String: Can
Matched String: you
Matched String: capture
Matched String: this
Matched String: hat- What if I try below example?
- Do notice the capture of
this<tab_space>hat
text = "Can you capture? this\that"
pattern = r"\w+\t\w+"
matches = re.findall(pattern, text)
for match in matches:
print(f"Matched String: {match}")Matched String: this hatCase 3A: Importance of Greedy Operator !!
- Use of
?as a greedy operator
text = "She said, 'Hello', and he replied, 'Hi'"
pattern = "'(.+?)'"
matches = re.findall(pattern, text)
for match in matches:
print(f"Matched String: {match}")Matched String: Hello
Matched String: Hitext = "She said, 'Hello', and he replied, 'Hi'"
pattern = "'(.+)'"
matches = re.findall(pattern, text)
for match in matches:
print(f"Matched String: {match}")Matched String: Hello', and he replied, 'HiCase 3B: Importance of Escaping Paranthesis!!
- What if you want to capture text within paranthesis?
text = "She said, (Hello), and he replied, (Hi)"
pattern = "\((.+?)\)"
matches = re.findall(pattern, text)
for match in matches:
print(f"Matched String: {match}")Matched String: Hello
Matched String: HiCase 4: Splitting Sentences using Regex
- Use of
[^<patterns>]to look for Negative matches until we meet any of the<patterns>
text = "Hello! My Name is Senthil. How are you doing?"
pattern = r"([^.?!]+[.?!])"
sentences = re.findall(pattern, text)
for sentence in sentences:
print(f"Sentence: {sentence.strip()}")Sentence: Hello!
Sentence: My Name is Senthil.
Sentence: How are you doing?Case 5: Extraction of different URL Formats
- Multiple Concepts: Operator OR
|;?for 0 or 1 match;[^/\s]+means anything but a/or a space
text = "Visit my website at https://www.example.com and check out www.blog.example.com or http://blogspot.com"
pattern = r"https?://[^/\s]+|www.[^/\s]+"
matches = re.findall(pattern, text)
for match in matches:
print(f"URL: {match}")URL: https://www.example.com
URL: www.blog.example.com
URL: http://blogspot.comWhere Regex Fails!
Best Solution - Use specific modules (avoid regex)
- In this html parsing case, use
BeautifulSoup
from bs4 import BeautifulSoup
html = "<div><p>This is a paragraph</p><span>This is a span</span></div>"
soup = BeautifulSoup(html, 'html.parser')
def process_tags(element):
if not element.name.startswith(r"["):
print(f"Tag: {element.name}, Content: {element.get_text()}")
for child in element.children:
if child.name:
process_tags(child)
process_tags(soup)Tag: div, Content: This is a paragraphThis is a span
Tag: p, Content: This is a paragraph
Tag: span, Content: This is a spanInsisting on a Regex Solution?
fetch_tags_pattern = r"\<(\w+)\>"
tag_matches = re.findall(fetch_tags_pattern, html)
for tag in tag_matches:
tag_pattern = f"<({tag})>(.*?)</{tag}>"
matches = re.findall(tag_pattern, html)
for match in matches:
tag = match[0]
content = re.sub('(<.*?>)',' ',match[1])
print(f"Tag: {tag}, Content: {content}")Tag: div, Content: This is a paragraph This is a span
Tag: p, Content: This is a paragraph
Tag: span, Content: This is a spanConclusion
- I am sure, over years you had worked on Regex, you have many better examples
- Like Bash Scripting, the key with Regex is to know when to stop trying it and use some other solution
- Thank you for reading this blog piece, Happy Regexing !