1. Overview

Regular expressions (regexes) are powerful pattern-matching tools that can represent both simple and complex text patterns. We commonly use them for searching, validating, and parsing.

Among these patterns, (a+b)* and (a*b*)* may appear similar but behave differently, potentially confusing users. Understanding these differences is essential for adequately applying them in real-world scenarios.

In this tutorial, we’ll look at the differences between the expressions (a+b)* and (a*b*)*.

2. Problem Statement

If we have an alphabet of only two symbols, a and b, which regular expression best represents all the strings we can make with these characters? We’ll consider two candidates: (a+b)* and (a*b*)*. At first glance, they appear to be similar because both involve a and b. However, their behavior differs depending on how we interpret the + operator.

Depending on the context, the + symbol can have two distinct meanings:

  • Union (logical or): In formal languages, + stands for the union of two sets. For example, a+b denotes a choice between a and b, which means either a or b can occur. Some authors denote the union using |, but we can also use +
  • The quantifier “one or more”: The quantifier + in various regex engines indicates one or more occurrences. For instance, one or more a‘s are matched by a+

This difference in meaning produces confusion because it results in different answers to whether (a+b)* and (a*b*)* are the same.

3. Equivalence of (a+b)* and (a*b*)*

In this section, the + operator denotes a union (logical “or”), following the common notation of the formal language theory. Therefore, a+b matches a or b.

The expressions (a+b)* and (a*b*)* can seem different at first glance, we’ll see they are equivalent in the formal language theory. They describe the strings we can form using the alphabet {a, b}, including the empty string ε.

3.1. Language of (a+b)*

The expression (a+b) specifies the union of sets {a} and {b}. It defines a language that solely uses the symbols a and b.

Applying the Kleene star * to (a+b) produces all possible sequences of these symbols, including the empty string ε. This means that (a+b)* includes strings such as a, b, ab, ba, aba, and so on, allowing for any sequence and number of repeats of valid patterns a+b.

3.2. Language of (a*b*)*

The expression (a*b*) matches strings that start with zero or more a’s and end with zero or more b‘s. For example: aaa, bbb, aaabbb, and the empty string ε.

*When we apply the Kleene star * to (a*b*), it allows for repetitive concatenation of such patterns*. As a result, (a*b*)* accepts the strings such as aaa, bbb, aabbb, abab, and others.

3.3. Subset Relationship

To prove the equivalence of (a+b)* and (a*b*)*, we must show that the languages they describe are subsets of one another.

First, we can break any string described by (a+b)* into segments, with a‘s and b‘s grouped consecutively. *Each segment fits the pattern (a*b*), demonstrating that every string in (a+b)* is also part of (a*b*)*.*

Further, *any string in (a*b*)* segment contains only a‘s and b‘s.* Therefore, all the strings recognized by (a*b*)* are also the strings of the language of (a+b)*.

Therefore, (a+b)* and (a*b*)* are equivalent in the formal language theory.

4. Interpretation of + as “One or More”

In this section, we focus on the + operator as a quantifier for one or more pattern occurrences.

4.1. (a+b)*

The regular expression (a+b)* looks for sequences with zero or more occurrences of the pattern a+b. In turn, that pattern denotes a string starting with one or more a‘s and ending with one b, including the empty strings:

import re

# Define the pattern
pattern = r"(a+b)*"

# Test some strings
test_strings = ["", "ab", "aab", "abab", "aabb", "b"]

for string in test_strings:
    match = re.fullmatch(pattern, string)
    print(f"String: '{string}' -> Match: {bool(match)}")

A screenshot of Python code that tests several strings against the regular expression (a+b) using re.fullmatch to see if they match the pattern.

Let’s interpret the results:

  • The empty string matches because the * quantifier accepts zero repetitions of (a+b)
  • The strings “ab” and “aab” follow the pattern of one or more a‘s followed by one b
  • Similarly, the string “abab” also matches the pattern since it contains valid repetitions of the group (a+b)
  • However, “aabb” doesn’t match since each group (a+b) requires one b to follow the a‘s, and the additional b violates this requirement
  • Similarly, “b” doesn’t fit since the + quantifier requires at least one a before b

4.2. (a*b*)*

The structure of (a*b*)* is less rigid than (a+b)*. Each group (a*b*) can have zero or more a‘s and zero or more b‘s. The only rule is that a‘s must come before b‘s. The outer * quantifier permits the group to repeat an unlimited number of times, even zero. This means that the pattern can accommodate a wide range of sequences:

import re

# Define the pattern
pattern = r"(a*b*)*"

# Test some strings
test_strings = ["", "a", "b", "ab", "aa", "bb", "abab", "aaabbb"]

for string in test_strings:
    match = re.fullmatch(pattern, string)
    print(f"String: '{string}' -> Match: {bool(match)}")

A screenshot of Python code that tests several strings against the regular expression (ab) using re.fullmatch to see if they match the pattern.

  • In this example, the empty string is accepted because the outer * allows zero repetitions of (a*b*)
  • The string “a” is valid because a is covered by a*, and b* accepts an empty string of b‘s
  • Similarly, “b” is valid because a* allows zero a‘s, and b* matches one b
  • The string “ab” is also valid because a* handles one a and b* handles one b
  • Strings like “aa” and “bb” are accepted because a* and b* can handle multiple a‘s and b‘s, respectively
  • The string “abab” is valid because it has two correct groups, one for “a” and the other for “b”
  • Finally, “aaabbb” works since a* handles “aaa” and b* handles “bbb”

5. Which Regex Should We Use?

From the formal-language point of view, (a+b)* and (a*b*)* are equal as both describe the language of strings containing a and b. However, other factors, such as readability, may influence the choice.

Furthermore, we must determine whether the empty string is considered part of the language. If we want to include the empty string, we keep the Kleene star (*). If not, we replace the outermost * with a postfix +, which represents “one or more.” To avoid confusion, we should use another symbol for the union (e.g., |) in this case. So, the expression will be (a|b)+.

6. Conclusion

In this article, we discussed the regular expressions (a+b)* and (a*b*)*.

They’re equivalent in the formal language theory if we interpret + as the union symbol. However, when + is used as a quantifier for “one or more,” these two expressions are different:  (a+b)* matches the strings whose consecutive substrings contain at least one a and exactly one b in a strict pairing, whereas (a*b*)* allows for any number of a’s and b’s.


原始标题:Difference Between Regular Expressions (a+b)* and (a*b*)*