文章/答案/技术大牛

发布

社区首页 >专栏 >BeautifulSoup4中文文档

BeautifulSoup4中文文档

用户5760343

发布于 2022-05-14 04:52:46

3980

文章被收录于专栏：sktjsktj

1、解析html并以友好形式显示：BeautifulSoup(html_doc,'html.parser') print(soup.prettify()) html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story

Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.

... """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

2、结构语句： soup.title #获取标题<title>The Dormouse's story</title> sout.title.name soup.title.string #获取标题标签内的内容 The Dormouse's story soup.title.parent.name soup.p #获取第一个标签p soup.p['class'] #获取第一个标签p的class内容 soup.a #获取第一个标签a soup.find_all('a') #获取所有标签a，以列表返回 soup.find(id="link3") #根据属性查找 for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie

print(soup.get_text()) #获取文档内容，不带任何标签 3、其他组件安装： pip install lxml pip install html5lib 4、几种解析器： BeautifulSoup(markup, "html.parser") BeautifulSoup(markup, "lxml") BeautifulSoup(markup, "html5lib") 5、tag的用法： soup = BeautifulSoup('Extremely bold') tag = soup.b tag.name tag.name = "blockquote" tag.string tag.string.replace_with("No longer bold") tag['class'] tag.attrs tag['class'] = 'verybold' tag['id'] = 1 del tag['class'] del tag['id'] 6、tag.contents 将子节点以列表输出。通过tag的 .children 生成器,可以对tag的子节点进行循环: for child in title_tag.children: print(child) .descendants 属性可以对所有tag的子孙节点进行递归循环 for child in head_tag.descendants: print(child) 7、循环输出不带标签的所有内容： for string in soup.strings: print(repr(string)) 去掉空白 for string in soup.stripped_strings: print(repr(string)) 8、.parent 获得父节点 .parents获得所有父节点 .next_sibling / .previous_sibling 兄弟节点 .next_element 和 .previous_element 指向解析过程中下一个被解析的对象 9、find/find_all 使用正则： import re for tag in soup.find_all(re.compile("^b")): print(tag.name)

body

b

列表 soup.find_all(["a", "b"])

tag.has_attr('id') soup.find_all(href=re.compile("elsie"), id='link1') data_soup.find_all(attrs={"data-foo": "value"}) soup.find_all("a", class_="sister") soup.find_all(string="Elsie")

soup.find_all("a", limit=2) #只返回2个 soup.html.find_all("title", recursive=False) #只检查1级子节点

find_parents() 和 find_parent() find_next_siblings() 合 find_next_sibling() find_previous_siblings() 和 find_previous_sibling() find_all_next() 和 find_next() find_all_previous() 和 find_previous()

css选择器方式查找： soup.select("p nth-of-type(3)")

[...]

soup.select("body a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")

[<title>The Dormouse's story</title>]

soup.select("body > a") #>一级子标签，多级的不匹配

兄弟节点

soup.select("#link1 ~ .sister")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("#link1 + .sister")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

查找类：.xx

soup.select(".sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("[class~=sister]")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过ID查找：

soup.select("#link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.select("#link1,#link2")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过属性查找

soup.select('a[href]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值查找：

soup.select('a[href="http://example.com/elsie"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('a[href^="http://example.com/"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href$="tillie"]')

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href*=".com/el"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

只查找1个

soup.select_one(".sister")

10、append()追加内容 soup = BeautifulSoup("<a>Foo</a>") soup.a.append("Bar")

soup

<html><head></head><body><a>FooBar</a></body></html>

soup.a.contents

[u'Foo', u'Bar']

insert markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) tag = soup.a

tag.insert(1, "but did not endorse ") tag

<a href="http://example.com/">I linked to but did not endorse example.com</a>

tag.contents

[u'I linked to ', u'but did not endorse', example.com]

soup = BeautifulSoup("stop") tag = soup.new_tag("i") tag.string = "Don't" soup.b.string.insert_before(tag) soup.b

Don'tstop

soup.b.i.insert_after(soup.new_string(" ever ")) soup.b

Don't ever stop

soup.b.contents

[Don't, u' ever ', u'stop']

clear()清除string markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) tag = soup.a

tag.clear() tag

<a href="http://example.com/"></a>

extract移除元素 markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.a

i_tag = soup.i.extract()

a_tag

<a href="http://example.com/">I linked to</a>

i_tag

example.com

print(i_tag.parent) None

decompose也是移除元素 markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.a

soup.i.decompose()

a_tag

<a href="http://example.com/">I linked to</a>

replace_with替换 markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.a

new_tag = soup.new_tag("b") new_tag.string = "example.net" a_tag.i.replace_with(new_tag)

a_tag

<a href="http://example.com/">I linked to example.net</a>

wrap包装 soup = BeautifulSoup("I wish I was bold.") soup.p.string.wrap(soup.new_tag("b"))

I wish I was bold.

soup.p.wrap(soup.new_tag("div"))

<div>I wish I was bold.</div>

unwrap markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.a

a_tag.i.unwrap() a_tag

<a href="http://example.com/">I linked to example.com</a>

prettify格式化输出，可以指定编码格式 get_text 获得文档内容，指定分隔符

soup.get_text("|")

u'\nI linked to |example.com|\n'

如果不知道文档编码，使用UnicodeDamit来自动编码 from bs4 import UnicodeDammit dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") print(dammit.unicode_markup)

Sacré bleu!

dammit.original_encoding

'utf-8'

11、lxml解析比其他块 Beautiful Soup对文档的解析速度不会比它所依赖的解析器更快,如果对计算时间要求很高或者计算机的时间比程序员的时间更值钱,那么就应该直接使用 lxml .

换句话说,还有提高Beautiful Soup效率的办法,使用lxml作为解析器.Beautiful Soup用lxml做解析器比用html5lib或Python内置解析器速度快很多.

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2022-05-13，如有侵权请联系 cloudcommunity@tencent.com 删除

http

html

html5

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

http

html

html5

登录后参与评论

暂无评论

编辑精选文章

换一批

多租户的 4 种常用方案

1398

亿级月活的社交 APP，陌陌如何做到 3 分钟定位故障？

1027

60页PPT全解：DeepSeek系列论文技术要点整理

1925

Java与Go差别在哪，谁要被时代抛弃？

1494

大模型 Token 究竟是啥：图解大模型Token

1003

MCP协议详解：一文读懂跨时代的模型上下文协议

4704

BeautifulSoup4用法详解

python 编程算法 html html5

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

菲宇

2019/06/13

10.2K0

Python爬虫(十四)_BeautifulSoup4 解析器

python 爬虫

CSS选择器：BeautifulSoup4 和lxml一样，Beautiful Soup也是一个HTML/XML的解析器，主要的功能也是如何解析和提取HTML/XML数据。 lxml只会局部遍历，而Beautiful Soup是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。 BeautifulSoup用来解析HTML比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持lxml的XML解析器。 Bea

用户1174963

2018/01/17

8480

BeautifulSoup使用

其他

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

听城

2018/08/30

1K0

Python爬虫扩展库BeautifulSoup4用法精要

python 爬虫 html xml https

BeautifulSoup是一个非常优秀的Python扩展库，可以用来从HTML或XML文件中提取我们感兴趣的数据，并且允许指定使用不同的解析器。由于beautifulsoup3已经不再继续维护，因此新的项目中应使用beautifulsoup4，目前最新版本是4.5.0，可以使用pip install beautifulsoup4直接进行安装，安装之后应使用from bs4 import BeautifulSoup导入并使用。下面我们就一起来简单看一下BeautifulSoup4的强大功能，更加详细完整的学

Python小屋屋主

2018/04/16

7780

python︱HTML网页解析BeautifulSoup学习笔记

爬虫机器学习

一、载入html页面信息一种是网站在线的网页、一种是下载下来的静态网页。 1、在线网页参考《python用BeautifulSoup库简单爬虫入门+案例（爬取妹子图）》中的载入内容： import

悟乙己

2018/01/02

3.3K0

python爬虫之BeautifulSoup

python html jquery

文章目录 1. python爬虫之BeautifulSoup 1.1. 简介 1.2. 安装 1.3. 创建BeautifulSoup对象 1.4. Tag 1.4.1. 注意： 1.4.2. get 1.4.3. string 1.4.4. get_text() 1.5. 搜索文档树 1.5.1. find_all( name , attrs , recursive , text , **kwargs ) 1.5.2. find( name , attrs , recursive , text , *

爱撒谎的男孩

2019/12/31

9570

网络爬虫 | Beautiful Soup解析数据模块

javascript python html html5

从HTML文件中提取数据，除了使用XPath，另一种比较常用的解析数据模块。Beautiful Soup模块中查找提取功能非常强大、方便，且提供一些简单的函数来导航、搜索、修改分析树等功能。Beautiful Soup模块是Python的一个HTML解析库，借助网页的结构和属性来解析网页（比正则表达式简单、有效）。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。

数据STUDIO

2021/06/24

6090

python爬虫系列三：html解析大法

xml

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。在爬虫开发中主要用的是

py3study

2020/01/03

8460

BeautifulSoup爬取数据常用方法总结

编程算法 html python unicode xml

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

北山啦

2022/11/27

7920

二、爬虫基础库

爬虫

request模块安装 1 pip install requests 简单使用　　 import requests response=requests.get("https://movie.douban.com/cinema/nowplaying/beijing/") print(response.content) # 字节数据 print(response.text) # 字符数据 print(type(response)) # <class '

用户1214487

2018/01/24

1.8K0

BeautifulSoup的基本用法

编程算法 python css https

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。

py3study

2020/01/17

1.1K0

BeautifulSoup文档4-详细方法 | 用什么方法对文档树进行搜索？

编程算法正则表达式

BeautifulSoup的文档搜索方法有很多，官方文档中重点介绍了两个方法： find() 和 find_all() 下文中的实例，依旧是官网的例子： html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three

虫无涯

2023/02/22

1K0

BeautifulSoup文档1-简介、安装和使用

python

注意：以下实例来源于BeautifulSoup官方文档：Beautiful Soup 4.4.0 文档。

虫无涯

2023/02/21

4770

六、解析库之Beautifulsoup模块

python

一介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 #安装

用户1214487

2018/01/24

1.8K0

Python 爬虫之网页解析库 BeautifulSoup

xml html html5 编程算法 python

BeautifulSoup 是一个使用灵活方便、执行速度快、支持多种解析器的网页解析库，可以让你无需编写正则表达式也能从 html 和 xml 中提取数据。BeautifulSoup 不仅支持 Python 内置的 Html 解析器，还支持 lxml、html5lib 等第三方解析器。

keinYe

2019/08/01

1.3K0

Python学习笔记（BeautifulSoup选择器）

爬虫

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间。

python与大数据分析

2022/03/11

3310

Python 操作BeautifulSoup4

2023腾讯·技术创作特训营第三期

BeautifulSoup4是爬虫里面需要掌握的一个必备库，通过这个库，将使我们通过requests请求的页面解析变得简单无比，再也不用通过绞尽脑汁的去想如何正则该如何匹配内容了。（一入正则深似海虽然它使用起来效率很高效哈）

度假的小鱼

2023/11/18

3710

BeautifulSoup文档3-详细方法 | 如何对文档树进行遍历？

编程算法

以下实例还是官网的例子： html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.

虫无涯

2023/02/22

6810

Python3中BeautifulSoup的使用方法

python

崔庆才，Python技术控，爬虫博文访问量已过百万。喜欢钻研，热爱生活，乐于分享。个人博客：静觅 | http://cuiqingcai.com/

生信宝典

2018/12/29

3.8K0

Python爬虫之BeautifulSoup解析之路

python 爬虫正则表达式 html

上一篇分享了正则表达式的使用，相信大家对正则也已经有了一定的了解。它可以针对任意字符串做任何的匹配并提取所需信息。

Python数据科学

2018/08/06

1.8K0