BeautifulSoup爬取数据常用方法总结
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple beautifulsoup4 from bs4 import BeautifulSouphtml_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""soup = BeautifulSoup(html_doc,"lxml")soup.title<title>The Dormouse's story</title>soup.title.name'title'soup.title.string"The Dormouse's story"soup.title.text"The Dormouse's story"soup.title.parent.name'head'soup.p<p class="title"><b>The Dormouse's story</b></p>soup.p.name'p'soup.p["class"]['title']soup.a<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>soup.find("a")<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>soup.find_all("a")[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]for link in soup.find_all("a"):
print(link.get("href"))http://example.com/elsie
http://example.com/lacie
http://example.com/tillieprint(soup.get_text())The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag<b class="boldest">Extremely bold</b>type(tag)bs4.element.Tagtag.name'b'tag.name = "blockquote"
tag<blockquote class="boldest">Extremely bold</blockquote>tag["class"]['boldest']tag.attrs{'class': ['boldest']}tag["class"] = "verybold"
tag["id"] = 1
tag<blockquote class="verybold" id="1">Extremely bold</blockquote>del tag["class"]
tag<blockquote id="1">Extremely bold</blockquote>多值属性
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']['body', 'strikeout']css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']['body']tag.string'Extremely bold'type(tag.string)bs4.element.NavigableStringtag.string.replace_with("No longer bold")
tag<blockquote id="1">No longer bold</blockquote>soup.name'[document]'markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
comment'Hey, buddy. Want to buy a used parser?'type(comment)bs4.element.Commentcomment'Hey, buddy. Want to buy a used parser?'但是当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出:
print(soup.prettify())<html>
<body>
<b>
<!--Hey, buddy. Want to buy a used parser?-->
</b>
</body>
</html>from bs4 import CData
cdata = CData("A CDATA block")
comment.replace_with(cdata)
print(soup.b.prettify())<b>
<![CDATA[A CDATA block]]>
</b>html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc,"html.parser")一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.
soup.head<head><title>The Dormouse's story</title></head>soup.title<title>The Dormouse's story</title>这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取标签中的第一个标签:
soup.body.b<b>The Dormouse's story</b>通过点取属性的方式只能获得当前名字的第一个tag:
soup.a<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>如果想要得到所有的标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()
soup.find_all("a")[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]head_tag = soup.head
head_tag<head><title>The Dormouse's story</title></head>head_tag.contents[<title>The Dormouse's story</title>]head_tag.contents[0]<title>The Dormouse's story</title>head_tag.contents[0].contents["The Dormouse's story"]BeautifulSoup 对象本身一定会包含子节点,也就是说标签也是 BeautifulSoup 对象的子节点:
soup.contents['\n',
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>]len(soup.contents)2soup.contents[0].name到这里就结束了,如果对你有帮助,欢迎点赞关注评论,你的点赞对我很重要