前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >文本处理工具 - TextBlob

文本处理工具 - TextBlob

作者头像
种花家的奋斗兔
发布于 2020-11-12 15:57:16
发布于 2020-11-12 15:57:16
3.1K00
代码可运行
举报
运行总次数:0
代码可运行

TextBlob基本介绍

TextBlob是一个用Python编写的开源的文本处理库。它可以用来执行很多自然语言处理的任务,比如,词性标注,名词性成分提取,情感分析,文本翻译,等等。你可以在官方文档阅读TextBlog的所有特性。

基本功能

  • Noun phrase extraction 短语提取
  • Part-of-speech tagging 词汇标注
  • Sentiment analysis 情感分析
  • Classification (Naive Bayes, Decision Tree) 分类
  • Language translation and detection powered by Google Translate 语言翻译和检查(谷歌翻译支持)
  • Tokenization (splitting text into words and sentences) 分词、分句
  • Word and phrase frequencies 词、短语频率
  • Parsing 语法分析
  • n-grams N元标注
  • Word inflection (pluralization and singularization) and lemmatization 词反射及词干提取
  • Spelling correction 拼写准确性
  • Add new models or languages through extensions 添加新模型或语言通过表达
  • WordNet integration WordNet整合

快速开始:

Create a TextBlob(创建一个textblob对象)

First, the import. TextBlob 类

>>> from textblob import TextBlob

Let’s create our first TextBlob.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> wiki = TextBlob("Python is a high-level, general-purpose programming language.")

Part-of-speech Tagging(词性标注)

Part-of-speech tags can be accessed through the tags property.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> wiki.tags
[('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'), ('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN')]

Noun Phrase Extraction(名词短语列表)

Similarly, noun phrases are accessed through the noun_phrases property. 注意:只提取名词短语

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> wiki.noun_phrases
WordList(['python'])

Sentiment Analysis(情感分析)

返回一个元组 Sentiment(polarity, subjectivity).

The polarity score is a float within the range [-1.0, 1.0]. -1.0 消极,1.0积极

The subjectivity is a float within the range [0.0, 1.0] 0.0 表示客观,1.0表示主观.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
>>> testimonial.sentiment
Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)
>>> testimonial.sentiment.polarity
0.39166666666666666

Tokenization(分词和分句)

You can break TextBlobs into words or sentences.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> zen = TextBlob("Beautiful is better than ugly. "
...                "Explicit is better than implicit. "
...                "Simple is better than complex.")
>>> zen.words
WordList(['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])
>>> zen.sentences
[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]

Sentence 对象 和TextBlobs 一样,有相同的方法和属性.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> for sentence in zen.sentences:
...     print(sentence.sentiment)

Words Inflection and Lemmatization(词反射及词干提取:单复数、过去式等)

Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
singularize() 变单数, pluralize()变复数,用在对名词进行处理,且会考虑特殊名词单复数形式
代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> sentence = TextBlob('Use 4 spaces per indentation level.')
>>> sentence.words
WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
>>> sentence.words[2].singularize()
'space'
>>> sentence.words[-1].pluralize()
'levels'

Word 类 :lemmatize() 方法 对单词进行词形还原,名词找单数,动词找原型。所以需要一次处理名词,一次处理动词

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> from textblob import Word
>>> w = Word("octopi")
>>> w.lemmatize()     # 默认只处理名词
'octopus'
>>> w = Word("went")
>>> w.lemmatize("v")  # 对动词原型处理
'go'

WordNet Integration (WordNet整合)

You can access the synsets for a Word via the synsets 属性 或者用 get_synsets 方法只查看部分或全部synset.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> from textblob import Word
>>> from textblob.wordnet import VERB
>>> word = Word("octopus")
>>> word.synsets
[Synset('octopus.n.01'), Synset('octopus.n.02')]
>>> Word("hack").get_synsets(pos=VERB)    # 只查找 该词作为 动词 的集合,参数为空时和synsets方法相同
[Synset('chop.v.05'), Synset('hack.v.02'), Synset('hack.v.03'), Synset('hack.v.04'), Synset('hack.v.05'), Synset('hack.v.06'), Synset('hack.v.07'), Synset('hack.v.08')]

You can access the definitions for each synset via the definitions property or the define()method, which can also take an optional part-of-speech argument.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> Word("octopus").definitions  #单词“章鱼”的定义
['tentacles of octopus prepared as food', 'bottom-living cephalopod having a soft oval body with eight long tentacles']    # '章鱼的触手是食物','底硒头足类动物,身体软而呈卵形,有八只长触须'

You can also create synsets directly.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> from textblob.wordnet import Synset
>>> octopus = Synset('octopus.n.02')
>>> shrimp = Synset('shrimp.n.03')
>>> octopus.path_similarity(shrimp)
0.1111111111111111

For more information on the WordNet API, see the NLTK documentation on the Wordnet Interface.

WordLists

A WordList is just a Python list with additional methods. 属性words : 一个包含句子分词的list

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> animals = TextBlob("cat dog octopus")
>>> animals.words
WordList(['cat', 'dog', 'octopus'])
>>> animals.words.pluralize()
WordList(['cats', 'dogs', 'octopodes'])

Spelling Correction(拼写校正)

Use the correct() method to attempt spelling correction.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> b = TextBlob("I havv goood speling!")
>>> print(b.correct())
I have good spelling!

Word objects have a spellcheck() Word.spellcheck() method that returns a list of (word,confidence) tuples with spelling suggestions.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> from textblob import Word
>>> w = Word('falibility')
>>> w.spellcheck()
[('fallibility', 1.0)]

Spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector”[1] as implemented in the pattern library. It is about 70% accurate [2].

Get Word and Noun Phrase Frequencies(单词词频)

There are two ways to get the frequency of a word or noun phrase in a TextBlob. 两种方法来获取单词频次

The first is through the word_counts dictionary. 从属性word_counts 字典获取

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> monty = TextBlob("We are no longer the Knights who say Ni. "
...                     "We are now the Knights who say Ekki ekki ekki PTANG.")
>>> monty.word_counts['ekki']
3

If you access the frequencies this way, the search will not be case sensitive, and words that are not found will have a frequency of 0.

The second way is to use the count() method. 用count ()方法获取

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> monty.words.count('ekki')                  #单词频次
3

You can specify whether or not the search should be case-sensitive (default is False).

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> monty.words.count('ekki', case_sensitive=True)   #设置大小写敏感,默认不区分
2

Each of these methods can also be used with noun phrases.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> wiki.noun_phrases.count('python')   #短语频次
1

Translation and Language Detection(翻译及语言检测语言)

New in version 0.5.0.

TextBlobs can be translated between languages.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> en_blob = TextBlob(u'Simple is better than complex.')
>>> en_blob.translate(to='es')
TextBlob("Simple es mejor que complejo.")

If no source language is specified, TextBlob will attempt to detect the language. You can specify the source language explicitly, like so. Raises TranslatorError if the TextBlob cannot be translated into the requested language or NotTranslated if the translated result is the same as the input string.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> chinese_blob = TextBlob(u"美丽优于丑陋")
>>> chinese_blob.translate(from_lang="zh-CN", to='en')
TextBlob("Beautiful is better than ugly")

You can also attempt to detect a TextBlob’s language using TextBlob.detect_language().

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> b = TextBlob(u"بسيط هو أفضل من مجمع")
>>> b.detect_language()
'ar'

As a reference, language codes can be found here.

Language translation and detection is powered by the Google Translate API.

Parsing(解析)

Use the parse() method to parse the text. 句法解析 parse() 方法

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> b = TextBlob("And now for something completely different.")
>>> print(b.parse())
And/CC/O/O now/RB/B-ADVP/O for/IN/B-PP/B-PNP something/NN/B-NP/I-PNP completely/RB/B-ADJP/O different/JJ/I-ADJP/O ././O/O

By default, TextBlob uses pattern’s parser [3].

TextBlobs Are Like Python Strings!(TextBlobs像是字符串)

You can use Python’s substring syntax.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> zen[0:19]
TextBlob("Beautiful is better")

You can use common string methods.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> zen.upper()
TextBlob("BEAUTIFUL IS BETTER THAN UGLY. EXPLICIT IS BETTER THAN IMPLICIT. SIMPLE IS BETTER THAN COMPLEX.")
>>> zen.find("Simple")
65

You can make comparisons between TextBlobs and strings.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> apple_blob = TextBlob('apples')
>>> banana_blob = TextBlob('bananas')
>>> apple_blob < banana_blob
True
>>> apple_blob == 'apples'
True

You can concatenate and interpolate TextBlobs and strings.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> apple_blob + ' and ' + banana_blob
TextBlob("apples and bananas")
>>> "{0} and {1}".format(apple_blob, banana_blob)
'apples and bananas'

n-grams(提取前n个字)

The TextBlob.ngrams() method returns a list of tuples of n successive words.

ngrams(n) 方法返回 句子每 n 个连续单词为一个元素的 list

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> blob = TextBlob("Now is better than never.")
>>> blob.ngrams(n=3)
[WordList(['Now', 'is', 'better']), WordList(['is', 'better', 'than']), WordList(['better', 'than', 'never'])]

Get Start and End Indices of Sentences(句子开始和结束的索引)

Use sentence.start and sentence.end to get the indices where a sentence starts and ends within a TextBlob.

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
>>> for s in zen.sentences:
...     print(s)
...     print("---- Starts at index {}, Ends at index {}".format(s.start, s.end))
Beautiful is better than ugly.
---- Starts at index 0, Ends at index 30
Explicit is better than implicit.
---- Starts at index 31, Ends at index 64
Simple is better than complex.
---- Starts at index 65, Ends at index 95

文档

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

[html] view plain copy

  1. from textblob import TextBlob
  2. text = '''
  3. The titular threat of The Blob has always struck me as the ultimate movie
  4. monster: an insatiably hungry, amoeba-like mass able to penetrate
  5. virtually any safeguard, capable of--as a doomed doctor chillingly
  6. describes it--"assimilating flesh on contact.
  7. Snide comparisons to gelatin be damned, it's a concept with the most
  8. devastating of potential consequences, not unlike the grey goo scenario
  9. proposed by technological theorists fearful of
  10. artificial intelligence run rampant.
  11. '''
  12. blob = TextBlob(text)
  13. blob.tags # [('The', 'DT'), ('titular', 'JJ'),
  14. # ('threat', 'NN'), ('of', 'IN'), ...]
  15. blob.noun_phrases # WordList(['titular threat', 'blob',
  16. # 'ultimate movie monster',
  17. # 'amoeba-like mass', ...])
  18. for sentence in blob.sentences:
  19. print(sentence.sentiment.polarity)
  20. # 0.060
  21. # -0.341
  22. blob.translate(to="es") # 'La amenaza titular de The Blob...

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • Add new models or languages through extensions
  • WordNet integration
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2018/02/05 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
暂无评论
推荐阅读
编辑精选文章
换一批
Nginx配置文件nginx.conf全解
nginx配置文件nginx.conf的配置http、upstream、server、location等;
青山师
2023/05/05
7530
nginx反向代理http和https共同使用 双存在
不能自适应协议,也不支持协议变量,各种百度啊,两个钟头,测试了各种,都不适用宝塔,
墨渊
2018/07/13
1.4K2
Nginx反向代理中使用proxy_redirect重定向url
在使用Nginx做反向代理功能时,有时会出现重定向的url不是我们想要的url,这时候就可以使用proxy_redirect进行url重定向设置了。proxy_redirect功能比较强大,其作用是对发送给客户端的URL进行修改!! 语法:proxy_redirect [ default|off|redirect replacement ]; 默认:proxy_redirect default; 配置块(使用的字段):http、server、location 当上游服务器返回的响应是重定向或刷新请求(如HT
洗尽了浮华
2018/01/23
27.5K0
最详细nginx反向代理之端口配置
server { listen 80 default_server; listen [::]:80 default_server; server_name 域名; listen 443 ssl; root /usr/share/nginx/html; ssl_certificate cert/1_www.maomin.club_bundle.crt; ssl_certificate_key cert/2_www.maomin.club.key; ssl_session_timeout 5m; # Load configuration files for the default server block. include /etc/nginx/default.d/*.conf;
马克社区
2022/08/09
1.6K0
Vue的Nginx前端代理配置
当用vue开发好前端需要打包时,一般都需要配置下代理方便访问后台接口,避免出现找不到链接或者跨域问题。
杨永贞
2022/01/07
2.2K0
Nginx配置反向代理和负载均衡
今天给大家介绍一下如何利用Nginx进行反向代理,之所以介绍这个的原因是,因为开发的时候遇到一个很尴尬的场景。因为是springboot项目,所以每一个控制类的端口都不一样,但是app那边所有接口都是对应一个ip和一个端口。如果我们想要实现本地app调式,就必须配置一个nginx,进行反向代理连接我们启动的服务器。废话不多说,开始我们nginx配置的介绍。
林老师带你学编程
2019/05/25
8750
CentOS 7.6配置Nginx反向代理
利用三台CentOS 7虚拟机搭建简单的Nginx反向代理负载集群,三台虚拟机地址及功能介绍
星哥玩云
2022/07/27
8900
[工作随笔]JumpServer排坑安装及二次开发
koko是用来连接Linux服务器的跳板机,原来用python写的,现在用go重写的
DriverZeng
2022/10/31
1.4K0
nginx如何代理多个express服务
背景是这样的,我目前有一台服务器,域名已经申请了brzhang.club,证书也申请了,可以看到是https的,安全访问无污染,哈哈!
老码小张
2019/05/13
2.2K0
nginx 设置 websocket 反向代理
废话不多说,nginx 配置如下: #user nobody; worker_processes 1; #error_log logs/error.log; #error_log logs/error.log notice; #error_log logs/error.log info; #pid logs/nginx.pid; events { worker_connections 1024; } http { include mime
前Thoughtworks-杨焱
2021/12/08
5610
通过nginx反向代理为业务增加认证的方法
Elasticsearch 通过 x-pack 作为认证模块供用户使用,但是在 7.0 以下版本需要购买 licence 才能使用。Elasticsearch 7.0 以上版本 x-pack 作为基本的功能模块供使用,不用购买 licence。
腾讯云-MSS服务
2020/08/12
3.8K0
我所有在线项目的Nginx配置内容
有几个小伙伴想看看我的Nginx是怎么配置的,我这里放出来吧。 其实没太多内容,都是基本的配置: 1、域名的代理(正向/反向); 2、IP地址获取; 3、SingleR Header配置; 4、前后端配置; 5、域名配置; 6、HTTPS配置; 7、负载配置; #user nobody; worker_processes 1; #error_log logs/error.log; #error_log logs/error.log notice; #error_log logs/error.
老张的哲学
2022/04/11
5350
nginx学习(叁):一起来看下nginx是如何处理请求的
上一节说了配置文件中可以有多个server块,所以这里我配置2个server块,来看当以不同域名(虚拟主机地址)发送请求时,nginx将该请求转发到了哪里
冰霜
2022/03/15
4360
nginx学习(叁):一起来看下nginx是如何处理请求的
Nginx 常用的基础配置(web前端相关方面)
最近很多朋友通过趣站网问到Nginx配置前端 web 服务,所以特地写了这篇文章;希望能够帮助更多的朋友。
趣站网
2023/02/07
1.5K0
Nginx 常用的基础配置(web前端相关方面)
nginx通过https方式反向代理多实例tomcat
案例说明: 前面一层nginx+Keepalived部署的LB,后端两台web服务器部署了多实例的tomcat,通过https方式部署nginx反向代理tomcat请求。配置一如下: 1)LB层的nginx配置 访问http强制转到https [root@external-lb01 ~]# cat /data/nginx/conf/vhosts/80-www.kevin.com.conf server { listen 80; server_name kev
洗尽了浮华
2018/01/23
3.1K0
nginx配置多个server监听80端口
有时候需要部署很多个服务时,如果不想让域名的后面带上端口号这个问题应该怎么做呢,实际也是有这样的场景的,本小节来做一个学习
在水一方
2022/06/14
6.1K0
nginx配置多个server监听80端口
使用nginx配置一个ip对应多个域名
需求:--两个域名想指向同一个网站ip;解决:--如果不需要https的证书访问,其实不需要配置,在域名解析中,分别添加同一个ip即可,通过dns解析,映射到同一个网站上;如果需要https访问,就需要配置一下443端口了;首先申请一下ssl证书,选择nginx部署;多个域名只需要,添加sever配置既可;在http{}中新增server配置;原有server server { listen 80 default_server; server_name www.**
JQ实验室
2022/10/30
7.4K0
为jellyfin添加https
zhaoolee
2023/10/26
1.5K0
为jellyfin添加https
nginx https反向代理tomcat的2种实现方法
这篇文章主要给大家介绍了关于nginx https反向代理tomcat的2种实现方法,第一种方法是nginx配置https,tomcat也配置https,第二种方法是nginx采用https,tomcat采用http,文中通过示例代码介绍的非常详细,需要的朋友可以参考下。
拓荒者
2019/03/11
1.9K0
HTTP Auth 认证冲突
nginx 代理 springboot,Springboot 使用了 JWT 认证,HTTP头为 Authorization: Bearer {BASE64}
netkiller old
2021/10/08
1.4K0
相关推荐
Nginx配置文件nginx.conf全解
更多 >
LV.9
数篷科技客户端负责人
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
本文部分代码块支持一键运行,欢迎体验
本文部分代码块支持一键运行,欢迎体验