我正在尝试创建一个函数,该函数将根据以下条件从文本返回字符串:
目前,我写了以下几篇文章:
#will be used in an apply statement for a column in dataframe
def parser(x):
x_list = x.split()
if " recurring payment authorized on " in x and x_list[-1]!= "on":
return x_list[x_list.index("on")+1]
elif " recurring payment" in x:
return ' '.join(x_list[:x_list.index("recurring")])
else:
return None
但是,这段代码看起来很笨拙,而且不健壮。我想使用regex来匹配这些字符串。
下面是这个函数应该返回的一些示例:
recurring payment authorized on usps abc
应该返回usps
usps recurring payment abc
应该返回usps
对于编写此函数的正则表达式的任何帮助都将不胜感激。输入字符串将只包含文本;不存在数字字符和特殊字符。
发布于 2020-01-05 07:25:01
使用Regex与前瞻和后视模式匹配
import re
def parser(x):
# Patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))
输出
In text: recurring payment authorized on usps abc
Found: usps
In text: usps recurring payment abc
Found: usps
In text: recurring payment authorized on att xxx xxx
Found: att
In text: recurring payment authorized on 25.05.1980 xxx xxx
Found: 25.05.1980
In text: att recurring payment xxxxx
Found: att
In text: 12.14.14. att recurring payment xxxxx
Found: 12.14.14. att
解释
Regex向后看 (?<=foo)查找断言字符串中当前位置前面的是foo
所以在模式上: r'(?<=授权的)(.*?)(\s+)‘
foo is " authorized on "
(.*?) - matches any character (? causes it not to be greedy)
(\s+) - matches at least one whitespace
因此,上述原因(.*?)在“授权开始”之后捕获所有字符,直到第一个空白字符。
Regex展望 (?=foo)查找前断言字符串中当前位置后面紧跟的是foo
因此: r'^(.*?)\s(?=recurring支付)‘
foo is 'recurring payment'
^ - means at beginning of the string
(.*?) - matches any character (non-greedy)
\s - matches white space
因此,(.*?)将匹配字符串开头的所有字符,直到得到空格和“定期支付”为止。
更好的性能是可取的,因为您正在应用Dataframe,它可能有很多列。
将模式编译从解析器中取出并放在模块中(时间减少33%)。
def parser(x):
# Use predined patterns (pattern_on, pattern_recur) from globals
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
# Define patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
发布于 2020-01-05 07:51:11
我不确定这种复杂程度是否需要RegEx。
希望RegEx不是对您的严格要求,这里有一个不使用它的解决方案:
examples = [
'stuff more stuff recurring payment authorized on ExampleA useless data',
'other useless data ExampleB recurring payment',
'Not matching phrase payment example authorized'
]
def extract_data(phrase):
result = None
if "recurring payment authorized on" in phrase:
result = phrase.split("recurring payment authorized on")[1].split()[0]
elif "recurring payment" in phrase:
result = phrase.split("recurring payment")[0]
return result
for example in examples:
print(extract_data(example))
输出
ExampleA
other useless data ExampleB
None
发布于 2020-01-05 10:49:05
不确定这是否更快,但是Python有条件:
authorized on
存在,那么recurring
之前发生的一切
请注意,结果将在捕获组2或3取决于哪个匹配。
import re
def xstr(s):
if s is None:
return ''
return str(s)
def parser(x):
# Patterns to search
pattern = re.compile(r"(authorized\son\s)?(?(1)(\S+)|(^.*) recurring)")
m = pattern.search(t)
if m:
return xstr(m.group(2)) + xstr(m.group(3))
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))
https://stackoverflow.com/questions/59600852
复制