我正在尝试将一些spaCy NLP函数应用到包含在熊猫数据中的文本中。对于简单的进程,lambda函数似乎可以工作。但是,当尝试执行需要在单独的函数中定义更复杂的语句的任务时,我很难使lambda方法正常工作。具体来说,对于包含在dataframe中的标记化文本,什么是过滤掉停止词的最佳方法?下面的示例寻求过滤并返回非停止词。我计划将其扩展到其他spaCy标记,但正在尝试使用token.is_stop属性来计算方法。
最起码的例子:
import numpy as np
import pandas as pd
import spacy
df = pd.DataFrame({'Text': ['This is the first text. It is two sentences.',
'This is the second text, with one sentence.']})
# check dataframe
# df
nlp = spacy.load("en_core_web_sm")
# create new col and fill with tokenized text
df['Tokens'] = ''
doc = df['Text']
doc = doc.apply(lambda x:
nlp(x)
)
df['Tokens'] = doc
# check dataframe
# df
# Confirming that the text is tokenized
doc = df.loc[0,'Tokens']
for token in doc:
print(token.text, token.pos_, token.tag_, token.is_alpha, token.is_stop)当我尝试使用lambda函数过滤停止词时,它将返回错误AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'is_stop'。
错误代码:
# Seeking to apply filter to tokens
def filter_stopwords(text):
tokens_no_stop = [token.text for token in doc if not token.is_stop]
return tokens_no_stop
df['No Stopwords'] = ''
doc = df['Tokens']
doc = doc.apply(lambda x:
filter_stopwords(x)
)
df['No Stopwords'] = doc什么是处理诸如过滤停止词或POS并将结果传递给新列的最佳方法?--我想我没有正确地访问spacy对象,但不确定如何访问。预先谢谢。
发布于 2020-06-12 17:14:53
过滤停止并加载回数据文件。
# Define a function, create a column, and apply the function to it
def remove_stops(tokens):
return [token.text for token in tokens if not token.is_stop]
df['No Stop'] = df['Tokens'].apply(remove_stops)结果

可泛化
该方法可推广到适合较长文档的其他函数。
def nouns_filtered(tokens):
return [token.text for token in tokens if token.is_alpha and not token.is_stop and token.pos_ == 'NOUN']
df['Filtered Nouns'] = df['Tokens'].apply(nouns_filtered)
def verbs_filtered(tokens):
return [token.text for token in tokens if token.is_alpha and not token.is_stop and token.pos_ == 'VERB']
df['Filtered Verbs'] = df['Tokens'].apply(verbs_filtered)
def adjectives_filtered(tokens):
return [token.text for token in tokens if token.is_alpha and not token.is_stop and token.pos_ == 'ADJ']
df['Filtered Adjectives'] = df['Tokens'].apply(adjectives_filtered)发布于 2020-06-12 00:07:56
这个能帮你修好吗?
def filter_stopwords(text):
# this will ;
# - apply spacy to the text
# - create one list of tokens without stop words
tokens_no_stop = [word.text for sentence in nlp(text) for word in sentence if not word.is_stop]
return tokens_no_stop
df['No Stopwords'] = df['Text'].apply(filter_stopwords)https://stackoverflow.com/questions/62266678
复制相似问题