我正试图用python中的nltk在句子中得到单词计数。
这是我写的代码
import nltk
data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."
for i in nltk.sent_tokenize(data):
print(nltk.word_tokenize(i))
这是输出
['Sample', 'sentence', ',', 'for', 'checking', '.']
['Here', 'is', 'an', 'exclamation', 'mark', '!']
['Here', 'is', 'a', 'question', '?']
['This', 'is', "n't", 'an', 'easy-task', '.']
有没有办法去掉标点符号,防止isn't
分裂成两个词,把easy-task
分割成两个?
我需要的答案是这样的:
['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy', 'task']
我可以用一些句号来管理标点符号,比如:
import nltk
data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."
stopwords = [',', '.', '?', '!']
for i in nltk.sent_tokenize(data):
for j in nltk.word_tokenize(i):
if j not in stopwords:
print(j, ', ', end="")
print('\n')
产出:
Sample , sentence , for , checking ,
Here , is , an , exclamation , mark ,
Here , is , a , question ,
This , is , n't , an , easy-task ,
但这并不能修复isn't
和easy-task
。有办法这样做吗?谢谢
发布于 2022-03-09 21:03:52
您可以使用不同的令牌程序来满足您的需求。
import nltk
import string
tokenizer = nltk.TweetTokenizer()
for i in nltk.sent_tokenize(data):
print(i)
print([x for x in tokenizer.tokenize(i) if x not in string.punctuation])
#op
['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy-task']
https://stackoverflow.com/questions/71401293
复制相似问题