腾讯云

开发者社区

文档建议反馈控制台

首页

文章/答案/技术大牛

发布

社区首页 >问答首页 >使用regex按顺序匹配多个单词

问使用regex按顺序匹配多个单词
EN

Stack Overflow用户

提问于 2020-01-05 06:28:51

回答 4查看 302关注 0票数 0

我正在尝试创建一个函数，该函数将根据以下条件从文本返回字符串：

如果字符串中的“定期付款授权”，在“on”之后获得第一条文本。
如果字符串中有“定期支付”，则在此之前获取所有内容。

目前，我写了以下几篇文章：

#will be used in an apply statement for a column in dataframe
def parser(x):
    x_list = x.split()
    if " recurring payment authorized on " in x and x_list[-1]!= "on":
         return x_list[x_list.index("on")+1]
     elif " recurring payment" in x:
         return ' '.join(x_list[:x_list.index("recurring")])
     else:
         return None

但是，这段代码看起来很笨拙，而且不健壮。我想使用regex来匹配这些字符串。

下面是这个函数应该返回的一些示例：

recurring payment authorized on usps abc应该返回usps
usps recurring payment abc应该返回usps

对于编写此函数的正则表达式的任何帮助都将不胜感激。输入字符串将只包含文本；不存在数字字符和特殊字符。

python

regex

pandas

腾讯特效SDK 2.5折起

美颜基础/原子能力套餐低至1500元/月，提供丰富的美化能力，支持全平台集成

回答 4

Stack Overflow用户

回答已采纳

发布于 2020-01-05 07:25:01

使用Regex与前瞻和后视模式匹配

import re

def parser(x):
    # Patterns to search
    pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
    pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')

    m = pattern_on.search(t)
    if m:
        return m.group(0)

    m = pattern_recur.search(t)
    if m:
        return m.group(0)

    return None

tests =  ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]


for t in tests:
    found = parser(t)
    if found:
        print("In text: {}\n Found: {}".format(t, found))

输出

In text: recurring payment authorized on usps abc
 Found: usps 
In text: usps recurring payment abc
 Found: usps 
In text: recurring payment authorized on att xxx xxx
 Found: att 
In text: recurring payment authorized on 25.05.1980 xxx xxx
 Found: 25.05.1980 
In text: att recurring payment xxxxx
 Found: att 
In text: 12.14.14. att recurring payment xxxxx
 Found: 12.14.14. att

解释

前瞻和后视模式匹配

Regex向后看 (?<=foo)查找断言字符串中当前位置前面的是foo

所以在模式上: r'(?<=授权的)(.*?)(\s+)‘

foo is " authorized on "
(.*?) - matches any character (? causes it not to be greedy)
(\s+) - matches at least one whitespace

因此，上述原因(.*?)在“授权开始”之后捕获所有字符，直到第一个空白字符。

Regex展望 (?=foo)查找前断言字符串中当前位置后面紧跟的是foo

因此: r'^(.*?)\s(?=recurring支付)‘

foo is 'recurring payment'
^ - means at beginning of the string
(.*?) - matches any character (non-greedy)
\s - matches white space

因此，(.*?)将匹配字符串开头的所有字符，直到得到空格和“定期支付”为止。

更好的性能是可取的，因为您正在应用Dataframe，它可能有很多列。

将模式编译从解析器中取出并放在模块中(时间减少33%)。

def parser(x):
    # Use predined patterns (pattern_on, pattern_recur) from globals
    m = pattern_on.search(t)
    if m:
        return m.group(0)

    m = pattern_recur.search(t)
    if m:
        return m.group(0)

    return None

 # Define patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')

tests =  ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]

票数 2

Stack Overflow用户

发布于 2020-01-05 07:51:11

我不确定这种复杂程度是否需要RegEx。

希望RegEx不是对您的严格要求，这里有一个不使用它的解决方案：

examples = [
'stuff more stuff recurring payment authorized on ExampleA useless data',
'other useless data ExampleB recurring payment',
'Not matching phrase payment example authorized'
]


def extract_data(phrase):
    result = None
    if "recurring payment authorized on" in phrase:
        result = phrase.split("recurring payment authorized on")[1].split()[0]
    elif "recurring payment" in phrase:
        result = phrase.split("recurring payment")[0]
    return result


for example in examples:
    print(extract_data(example))

输出

ExampleA
other useless data ExampleB
None

票数 0

Stack Overflow用户

发布于 2020-01-05 10:49:05

不确定这是否更快，但是Python有条件：

如果authorized on存在，那么
- 匹配其他非空格字符的下一个子字符串。
- 匹配在recurring之前发生的一切

请注意，结果将在捕获组2或3取决于哪个匹配。

import re

def xstr(s):
    if s is None:
        return ''
    return str(s)

def parser(x):
    # Patterns to search
    pattern = re.compile(r"(authorized\son\s)?(?(1)(\S+)|(^.*) recurring)")

    m = pattern.search(t)
    if m:
        return xstr(m.group(2)) + xstr(m.group(3))

    return None

tests =  ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]


for t in tests:
    found = parser(t)
    if found:
        print("In text: {}\n Found: {}".format(t, found))