Analyzing text data is one of the most common tasks in the life of a person who works in natural language processing, machine learning and related areas. We need to find patterns, search specific strings, replace a character with another character, and perform many such tasks. This article discusses how you can use regular expressions for text analysis in Python. The article discusses various concepts such as regex functions, patterns and character classes, and quantifiers.
Regex Functions For Text Analysis in Python
Python provides us with the re module for implementing and using regular expressions. The re module contains various functions that help you in text analysis in Python. Let us discuss the functions one by one.
The match() Function
The re.match()
function is used to check if a string starts with a certain pattern or not. The re.match()
function takes a pattern as its first input argument and an input string as its second input argument. After execution, it returns None
if the input string doesn’t start with the given pattern.
If the input string starts with the given pattern, the match()
function returns a match object that contains the span of the pattern in the string. You can observe this in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="ACAA"
match_obj=re.match(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("Match object is",match_obj)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: ACAA
Match object is <re.Match object; span=(0, 4), match='ACAA'>
Here, you can observe that the match object contains the matched string along with its position in the original string.
You can also print the pattern in the match object using the group()
method. The group()
method, when invoked on a match object, returns the matched text from the string as shown in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="ACAA"
match_obj=re.match(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("Text in match object is",match_obj.group())
print("Span of text is:",match_obj.span())
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: ACAA
Text in match object is ACAA
Span of text is: (0, 4)
In this example, we have also printed the position of the matched substring using the span()
method. The span()
method, when invoked on a match object, returns a tuple containing the start and end index of the matched substring in the original string.
The search() Function
The search()
function is used to check if a pattern exists anywhere in an input string or not. The search()
function takes a pattern as its first input argument and a string as its second input argument. After execution, it returns None
if the given pattern isn’t present anywhere in the string.
If the pattern is present in the string, the search()
function returns a match object containing the position of the first occurrence of the pattern in the string. You can observe this in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="AA"
print("The text is:",text)
print("The pattern is:",pattern)
output=re.search(pattern,text)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: AA
The output is: <re.Match object; span=(2, 4), match='AA'>
You can see that the search()
function is an upgrade to the match()
function. However, it only checks for the first occurrence of the given pattern in the string. To find all the occurrences of a given pattern in a string, we can use the findall()
function.
The findall() Function
The findall()
function in the re module is used to find all the substrings in a string that follow a given pattern. The findall()
function takes a pattern as its first input argument and a string as its second input argument. After execution, it returns an empty list if the given pattern isn’t present anywhere in the string.
If the input string contains substrings that follow the given pattern, the findall()
function returns a list containing all the substrings that match the given pattern. You can observe this in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="AA"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: AA
The output is: ['AA', 'AA', 'AA', 'AA', 'AA']
Here, the findall()
function has returned a list of all the matched substrings in the original string.
The finditer() Function
The findall()
function returns a list of all the substrings that match a given pattern. If we want to access the match objects for the substrings to find their span, we cannot do it using the findall()
function. For this, you can use the finditer()
function.
The finditer()
function takes a regex pattern as its first input argument and a string as its second input argument. After execution, it returns an iterable object containing match objects for non-overlapping pattern matches. You can observe this in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="AA"
output=re.finditer(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:")
for match_object in output:
print("Text:",match_object.group(), "Span:",match_object.span())
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: AA
The output is:
Text: AA Span: (2, 4)
Text: AA Span: (4, 6)
Text: AA Span: (11, 13)
Text: AA Span: (17, 19)
Text: AA Span: (42, 44)
In the above example, the finditer()
function returns a list of match objects for the matched substrings. We have obtained the text and position of the matched substrings using the group()
and the span()
method.
The split() Function
The split()
function in the re module is used to split a string into substrings at a given pattern. The split()
function takes a pattern as its first input argument and a string as its second input argument. After execution, it splits the given string at the positions where the pattern is found and returns a list of substrings as shown below.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="AA"
output=re.split(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: AA
The output is: ['AC', '', 'BCBCB', 'ABAC', 'DABCABDBACABACDBADBCACB', 'BCDDDDCABACBCDA']
As you can observe in the above example, the original string is split from positions where two As are present. After splitting, the list of remaining substrings is returned by the split()
function.
If the pattern occurs at the start of the string, the list returned by the split()
function also contains an empty string. You can observe this in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="AC"
output=re.split(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: AC
The output is: ['', 'AAAABCBCBAAAB', 'AADABCABDB', 'AB', 'DBADBC', 'BAABCDDDDCAB', 'BCDA']
If the input string doesn’t contain the given pattern, the list returned by the split()
function contains the input string as its only element as shown in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="AY"
output=re.split(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: AY
The output is: ['ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA']
As you can see, there are no occurrences of "AY"
in the input string. Hence, the input string isn’t split and the output list contains only one string.
The sub() Function
The sub()
function in the re module is used to replace a substring with another string in Python. The sub()
function has the following syntax.
re.sub(old_pattern, new_pattern, input_string)
The re.sub()
function takes three input arguments. After execution, it replaces the old_pattern
with new_pattern
in the input_string
and returns the modified string as shown in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="A"
replacement="S"
output=re.sub(pattern,replacement,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The new pattern is:",replacement)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: A
The new pattern is: S
The output is: SCSSSSBCBCBSSSBSCSSDSBCSBDBSCSBSCDBSDBCSCBSSBCDDDDCSBSCBCDS
Till now, we have discussed some of the regex functions in Python. For text analysis using regular expressions in Python, we also need to know some of the regex tools to create the required patterns for analysis. Let us discuss them one by one.
Suggested Reading: If you are into data mining and data analytics, you can read this article on k-means clustering using sklearn module in Python.
Regex Patterns and Character Classes For Text Analysis in Python
While performing text analysis using python, you might need to check two or more patterns at once. In such cases, you can use character classes.
Match Single Pattern Using Regex in Python
For instance, if you are given a string containing the grades of a student and you need to check the number of A’s in the string, you can do it using the findall()
function and the character ‘A’
pattern as shown below.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="A"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: A
The output is: ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
Here, all the occurrences of the character "A"
have been returned by the findall()
function.
Match Multiple Patterns Using Regex in Python
Now, suppose that you need to check how many A’s or B’s are in the string, you cannot use the pattern “AB”
in the findall()
function. The pattern “AB”
matches A immediately followed by B.
To match A or B, we will use character class. Hence, we will use the pattern “[AB]”
. Here, when we put AB
in the square brackets inside the pattern string, it behaves as a set. Hence, the pattern matches A or B. However, it won’t match AB. You can observe this in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="[AB]"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: [AB]
The output is: ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'A', 'B', 'A', 'B', 'A']
In the above example, the findall()
function searches for either A or B and returns all the matching substrings.
Now, you can find the pattern “A”, “B”, and “AB”. What if you have to find A followed by B or C? In other words, you have to match the pattern “AB”
or “AC”
. In such a case, we can use the pattern “[A][BC]”
. Here, we have put A in a separate square bracket as it is a mandatory character. In the following square bracket, we have put BC out of which only one character will be considered in the pattern. Hence, the pattern will match AB as well as AC. You can observe this in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="[A][BC]"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: [A][BC]
The output is: ['AC', 'AB', 'AB', 'AC', 'AB', 'AB', 'AC', 'AB', 'AC', 'AC', 'AB', 'AB', 'AC']
You can also use the pipe operator | to create the pattern “AB|AC”
. Here, the pipe operator works as OR operator and the pattern will match both AB and AC.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="AB|AC"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: AB|AC
The output is: ['AC', 'AB', 'AB', 'AC', 'AB', 'AB', 'AC', 'AB', 'AC', 'AC', 'AB', 'AB', 'AC']
Match All Patterns Except One Using Regex in Python
Suppose that you want to find all the grades in the input string except grade A. In such a case, we will introduce the caret character inside the square brackets before character A as in “[^A]”
. This pattern will match all the patterns except A as shown below.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="[^A]"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: [^A]
The output is: ['C', 'B', 'C', 'B', 'C', 'B', 'B', 'C', 'D', 'B', 'C', 'B', 'D', 'B', 'C', 'B', 'C', 'D', 'B', 'D', 'B', 'C', 'C', 'B', 'B', 'C', 'D', 'D', 'D', 'D', 'C', 'B', 'C', 'B', 'C', 'D']
Remember that the pattern “^A”
will not give the same result. When the caret character is not given in square brackets, it means that the pattern should start with the character immediately following the caret character. Hence, the pattern “^A”
will check for a pattern that starts with A.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="^A"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: ^A
The output is: ['A']
Matching Patterns by the Number of Characters in the Pattern Using Regex Quantifiers
If you want to match two consecutive A’s, you can use the pattern “AA”
or “[A][A]”
. What if you had to match 100 consecutive A’s in a large text? In such a case, you cannot manually create a pattern of 100 characters. Regex quantifiers can help you in this case.
Regex quantifiers are used to specify the number of consecutive characters in the pattern. It is expressed as pattern{m}
. Here, the pattern is the pattern we are looking for and m is the number of repetitions of the pattern. For example, you can 4 consecutive A’s using quantifiers as follows.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="A{4}"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: A{4}
The output is: ['AAAA']
In the above example, you need to make sure that the curly braces don’t contain any space characters. It should only contain the number that is used as the quantifier. Otherwise, you won’t get the desired result. You can observe this in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA"
pattern="A{4 }"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBADBCACBAABCDDDDCABACBCDA
The pattern is: A{4 }
The output is: []
Now, suppose that you want to match any pattern that has a minimum of 2 A’s and a maximum of 6 A’s. In such a case, you can express the pattern using the syntax pattern{m,n}. Here, m is the lower limit and n is the upper limit of the consecutive presence of the pattern.
For instance, you can match 2 to 6 A’s using the pattern “A{2,6}”
as shown in the following example.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBAAAAAAADBCACBAABCDDDDCABACBCDA"
pattern="A{2,6}"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBAAAAAAADBCACBAABCDDDDCABACBCDA
The pattern is: A{2,6}
The output is: ['AAAA', 'AAA', 'AA', 'AAAAAA', 'AA']
Again, you need to make sure that there are no spaces inside the parentheses. Otherwise, the program will not give the desired results.
You can also use repetitions of different patterns at once. For instance, if you want to match 2 to 5 A’s followed by 1 to 2 B’s, you can use the pattern “A{2,5}B{1,2}”
as shown below.
import re
text="ACAAAABCBCBAAABACAADABCABDBACABACDBAAAAAAADBCACBAABCDDDDCABACBCDA"
pattern="A{2,5}B{1,2}"
output=re.findall(pattern,text)
print("The text is:",text)
print("The pattern is:",pattern)
print("The output is:",output)
Output:
The text is: ACAAAABCBCBAAABACAADABCABDBACABACDBAAAAAAADBCACBAABCDDDDCABACBCDA
The pattern is: A{2,5}B{1,2}
The output is: ['AAAAB', 'AAAB', 'AAB']
Conclusion
In this article, we have discussed some of the regex functions for text analysis in Python. To learn more about text analysis in python , you can read this article on remove all occurrences of a character in a string. You might also like this article on list comprehension in Python.
Stay tuned for more informative articles.
Happy Learning!
Recommended Python Training
Course: Python 3 For Beginners
Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.