正则表达式高阶用法解析（Python 实例讲解）

正则表达式（Regular Expression，简称 regex）是一种强大的文本处理工具。Python 提供了 re 模块来支持正则表达式。本文将介绍正则表达式的高阶用法，包括 贪婪匹配与懒惰匹配、() 和 [] 的区别、命名捕获组、回溯引用、零宽断言 等，并通过 Python 示例进行讲解。

元字符	作用
`.`	匹配任意单个字符（除换行符）
`^`	匹配字符串的开始
`$`	匹配字符串的结束
`*`	匹配前一个字符 0 次或更多次（贪婪）
`+`	匹配前一个字符 1 次或更多次（贪婪）
`?`	匹配前一个字符 0 次或 1 次（懒惰）
`{n}`	匹配前一个字符正好 n 次
`{n,}`	匹配前一个字符至少 n 次
`{n,m}`	匹配前一个字符 n 到 m 次
`\d`	匹配任何数字（等价于 `[0-9]`）
`\D`	匹配非数字字符（等价于 `[^0-9]`）
`\w`	匹配字母、数字、下划线（等价于 `[a-zA-Z0-9_]`）
`\W`	匹配非字母、数字、下划线（等价于 `[^a-zA-Z0-9_]`）
`\s`	匹配任何空白字符（空格、制表符等）
`\S`	匹配非空白字符

1. 贪婪匹配与懒惰匹配

1.1 贪婪匹配（Greedy）

贪婪匹配尽可能多地匹配字符。

import re

text = "<div>content</div>"
pattern = r"<.*>"
match = re.search(pattern, text)
print(match.group())  # 输出：<div>content</div>

解释：

.* 默认是贪婪匹配，它匹配了整个 "<div>content</div>"，而不仅仅是 "<div>"。

1.2 懒惰匹配（Lazy）

懒惰匹配尽可能少地匹配字符，可以使用 *?、+?、?? 等方式。

pattern = r"<.*?>"
match = re.search(pattern, text)
print(match.group())  # 输出：<div>

解释：

.*? 代表懒惰匹配，遇到 > 就停止匹配。

2. `()` 和 `[]` 的区别

2.1 `()`（捕获组）

() 用于分组匹配，并可通过 group() 方法提取匹配的子串。
还可用于 回溯引用 和 命名捕获。

pattern = r"(\d{3})-(\d{4})"
match = re.search(pattern, "123-4567")
print(match.group(1))  # 输出：123
print(match.group(2))  # 输出：4567

2.2 `[]`（字符类）

[] 用于匹配 某一范围内的单个字符。
不能用于分组，也不会捕获子串。

pattern = r"[abc]"
match = re.search(pattern, "apple")
print(match.group())  # 输出：a

3. 命名捕获组与回溯引用

3.1 命名捕获组

使用 (?P<name>...) 方式给捕获组命名，以便通过名称引用。

pattern = r"(?P<area_code>\d{3})-(?P<number>\d{4})"
match = re.search(pattern, "123-4567")
print(match.group("area_code"))  # 输出：123
print(match.group("number"))     # 输出：4567

3.2 回溯引用（Backreference）

回溯引用用于匹配前面捕获的相同内容。

pattern = r"(\w+) \1"
match = re.search(pattern, "hello hello world")
print(match.group())  # 输出：hello hello

解释：

\1 引用了前面的 \w+，要求匹配相同的单词。

4. 零宽断言（Lookahead 和 Lookbehind）

零宽断言用于 匹配特定位置，而不会消耗字符。

4.1 正向先行断言（Positive Lookahead）

语法：(?=...)

pattern = r"\w+(?=\d)"
match = re.search(pattern, "abc123")
print(match.group())  # 输出：abc

解释：

\w+ 需要后面紧跟 \d，但 \d 不会包含在匹配结果里。

4.2 负向先行断言（Negative Lookahead）

语法：(?!...)

pattern = r"\w+(?!\d)"
match = re.search(pattern, "abcXYZ")
print(match.group())  # 输出：abcXYZ

解释：

\w+ 不能后接 \d。

4.3 正向后行断言（Positive Lookbehind）

语法：(?<=...)

pattern = r"(?<=\$)\d+"
match = re.search(pattern, "$100")
print(match.group())  # 输出：100

解释：

只匹配 $ 后面的 100，但 $ 本身不包含在结果里。

4.4 负向后行断言（Negative Lookbehind）

语法：(?<!...)

pattern = r"(?<!\$)\d+"
match = re.search(pattern, "100 dollars")
print(match.group())  # 输出：100

解释：

只匹配不以 $ 开头的数字。

5. `re` 模块的高级用法

5.1 `re.findall()` vs `re.finditer()`

re.findall() 返回所有匹配项的列表。
re.finditer() 返回 迭代器，适用于大数据处理。

pattern = r"\d+"
text = "价格是 100 元，折扣后 80 元"

# 使用 findall
print(re.findall(pattern, text))  # 输出：['100', '80']

# 使用 finditer
for match in re.finditer(pattern, text):
    print(match.group())

5.2 `re.sub()` 高级替换

使用 re.sub() 可以替换字符串，支持回溯引用。

pattern = r"(\d+)-(\d+)"
text = "123-4567"
new_text = re.sub(pattern, r"(\1) \2", text)
print(new_text)  # 输出：(123) 4567

6. 非（否定）匹配

在正则表达式中，可以使用 否定字符类 [^...] 或 反义元字符 来匹配 不包含特定字符 的文本。

6.1 否定字符类 `[^...]`

[^...] 用于匹配 不在方括号中的字符。
例如，[^abc] 表示 匹配任何不是 a、b、c 的字符。

python复制编辑import re

text = "abcdef"
pattern = r"[^abc]"  # 匹配非 a、b、c 的字符
match = re.findall(pattern, text)
print(match)  # 输出：['d', 'e', 'f']

6.2 反义元字符

Python 正则表达式提供了一些 反义元字符，用于匹配 非某类字符：

反义元字符	作用	等价于
`\D`	匹配非数字	`[^0-9]`
`\W`	匹配非字母、数字、下划线	`[^a-zA-Z0-9_]`
`\S`	匹配非空白字符	`[^ \t\n\r\f\v]`

示例代码：

python复制编辑import re

text = "123 abc!@#"
pattern = r"\D+"  # 匹配非数字
match = re.findall(pattern, text)
print(match)  # 输出：[' abc!@#']

pattern = r"\W+"  # 匹配非字母、数字、下划线
match = re.findall(pattern, text)
print(match)  # 输出：[' ', '!@#']

pattern = r"\S+"  # 匹配非空白字符
match = re.findall(pattern, text)
print(match)  # 输出：['123', 'abc!@#']

6.3 否定匹配的应用场景

过滤掉某些特定字符（如去除所有数字，只保留字母和符号）。
匹配某类字符之外的内容（如匹配 非空白字符）。
处理 数据清洗 或 文本解析 任务时，排除不需要的内容。

目录CONTENT

正则表达式技巧

正则表达式高阶用法解析（Python 实例讲解）

1. 贪婪匹配与懒惰匹配

1.1 贪婪匹配（Greedy）

1.2 懒惰匹配（Lazy）

2. `()` 和 `[]` 的区别

2.1 `()`（捕获组）

2.2 `[]`（字符类）

3. 命名捕获组与回溯引用

3.1 命名捕获组

3.2 回溯引用（Backreference）

4. 零宽断言（Lookahead 和 Lookbehind）

4.1 正向先行断言（Positive Lookahead）

4.2 负向先行断言（Negative Lookahead）

4.3 正向后行断言（Positive Lookbehind）

4.4 负向后行断言（Negative Lookbehind）

5. `re` 模块的高级用法

5.1 `re.findall()` vs `re.finditer()`

5.2 `re.sub()` 高级替换

6. 非（否定）匹配

6.1 否定字符类 `[^...]`

6.2 反义元字符

6.3 否定匹配的应用场景

评论区

正则表达式技巧

正则表达式高阶用法解析（Python 实例讲解）

1. 贪婪匹配与懒惰匹配

1.1 贪婪匹配（Greedy）

1.2 懒惰匹配（Lazy）

2. () 和 [] 的区别

2.1 ()（捕获组）

2.2 []（字符类）

3. 命名捕获组与回溯引用

3.1 命名捕获组

3.2 回溯引用（Backreference）

4. 零宽断言（Lookahead 和 Lookbehind）

4.1 正向先行断言（Positive Lookahead）

4.2 负向先行断言（Negative Lookahead）

4.3 正向后行断言（Positive Lookbehind）

4.4 负向后行断言（Negative Lookbehind）

5. re 模块的高级用法

5.1 re.findall() vs re.finditer()

5.2 re.sub() 高级替换

6. 非（否定）匹配

6.1 否定字符类 [^...]

6.2 反义元字符

6.3 否定匹配的应用场景

评论区

2. `()` 和 `[]` 的区别

2.1 `()`（捕获组）

2.2 `[]`（字符类）

5. `re` 模块的高级用法

5.1 `re.findall()` vs `re.finditer()`

5.2 `re.sub()` 高级替换

6.1 否定字符类 `[^...]`