Python正则表达式教程-常用文本处理技巧(3)

发布时间：2019-10-29 12:29 所属栏目：21 来源：数据大视界

导读：如果字符串开头的零个或多个字符与模式匹配，则返回相应的匹配对象。否则None，如果字符串与给定的模式不匹配，则返回。 pattern=C sequence1=IceCream #NomatchsinceCisnotatthestartofIceCream re.match(pattern,

如果字符串开头的零个或多个字符与模式匹配，则返回相应的匹配对象。否则None，如果字符串与给定的模式不匹配，则返回。

pattern = "C" 
sequence1 = "IceCream" 
# No match since "C" is not at the start of "IceCream" 
re.match(pattern, sequence1) 
sequence2 = "Cake" 
re.match(pattern,sequence2).group() 
'C'

search() 与 match()

该match()函数仅在字符串的开头检查匹配项(默认情况下)，而该search()函数在字符串的任何位置检查匹配项。

findall(pattern, string, flags=0)

查找整个序列中所有可能的匹配项，并将它们作为字符串列表返回。每个返回的字符串代表一个匹配项。

email_address = "Please contact us at: support@datacamp.com, xyz@datacamp.com" 
#'addresses' is a list that stores all the possible match 
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', email_address)for address in addresses:  
 print(address) 
support@datacamp.com 
xyz@datacamp.com

sub(pattern, repl, string, count=0, flags=0)

这就是substitute功能。它返回通过用替换替换或替换字符串中最左边的非重叠模式所获得的字符串repl。如果找不到该模式，则该字符串将原样返回。

email_address = "Please contact us at: xyz@datacamp.com" 
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@datacamp.com', email_address) 
print(new_email_address) 
Please contact us at: support@datacamp.com

compile(pattern, flags=0)

将正则表达式模式编译为正则表达式对象。当您需要在单个程序中多次使用表达式时，使用该compile()函数保存生成的正则表达式对象以供重用会更有效。这是因为compile()缓存了传递给的最新模式的编译版本以及模块级匹配功能。

pattern = re.compile(r"cookie") 
sequence = "Cake and cookie" 
pattern.search(sequence).group() 
'cookie' 
# This is equivalent to: 
re.search(pattern, sequence).group() 
'cookie'

提示：可以通过指定flags值来修改表达式的行为。您可以flag在本教程中看到的各种功能中添加一个额外的参数。一些使用的标志是：IGNORECASE，DOTALL，MULTILINE，VERBOSE，等。

案例研究：使用正则表达式

通过学习一些示例，您已经了解了正则表达式在Python中的工作方式，是时候动手了!在本案例研究中，您将运用自己的知识。

import reimport requests 
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt' 
 
def get_book(url): 
 # Sends a http request to get the text from project Gutenberg 
 raw = requests.get(url).text 
 # Discards the metadata from the beginning of the book 
 start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .*\*\*\*",raw ).end() 
 # Discards the metadata from the end of the book 
 stop = re.search(r"II", raw).start() 
 # Keeps the relevant text 
 text = raw[start:stop] 
 return text 
 
def preprocess(sentence):  
 return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower() 
 
book = get_book(the_idiot_url) 
processed_book = preprocess(book) 
print(processed_book)

在语料库中找到代词" the"的编号。提示：使用len()功能。

len(re.findall(r'the', processed_book)) 
302

尝试将语料库中的每个" i"的独立实例转换为" I"。确保不要更改一个单词中出现的" i"：

processed_book = re.sub(r'\si\s', " I ", processed_book) 
print(processed_book)

查找""语料库中有人被引号()的次数。

len(re.findall(r'\"', book)) 
96

（编辑：ASP站长网）