给 iOS 开发者的 python 学习日记十五

文章来源：企鹅号 - 猿老师

并不是所有的资料都能这么方便地以表格式资料（Tabular data），EXCEL 试算表或者 JSON载入工作环境，有时候我们的资料散落在网路不同的角落里，然而并不是每一个网站都会建置 API（Application Programming Interface）让你很省力地把资料带回家，这时候我们就会需要网页解析（Web scraping）

准备工作

除了BeautifulSoup 套件以外，我们还需要搭配使用lxml 套件与 requests 套件。由于我们的开发环境是安装 Anaconda，所以这些套件都不需要再另外下载与安装，只要进行一贯的 import 就好。

lxml 套件

lxml 套件是用来作为 BeautifulSoup 的解析器（Parser），BeautifulSoup可以支援的解析器其实不只一种，还有 html.parser（Python 内建）与html5lib，根据官方文件的推荐，我们使用解析速度最快的 lxml。

requests 套件

requests 套件允许我们发送与接收有机及草饲的 HTTP/1.1 请求（这真的是美式幽默。）

第一个 BeautifulSoup 应用

先喝一口美丽的汤尝尝味道。

import requests as rq

from bs4 import BeautifulSoup

url ="https://www.ptt.cc/bbs/NBA/index.html" # PTT NBA 板

response = rq.get(url) # 用 requests 的 get 方法把網頁抓下來

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

print(soup.prettify()) # 把排版後的 html 印出來

一些 BeautifulSoup 的属性或方法

很快试用一些BeautifulSoup 的属性或方法。

title 属性

title.name 属性

title.string 属性

a 属性

find_all() 方法

import requests as rq

from bs4 importBeautifulSoup

import requests as rq

from bs4 import BeautifulSoup

url ="https://www.ptt.cc/bbs/NBA/index.html" # PTT NBA 板

response = rq.get(url) # 用 requests 的 get 方法把網頁抓下來

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

# 一些屬性或方法

print(soup.title) # 把 tag 抓出來

print("---")

print(soup.title.name) # 把 title 的 tag 名稱抓出來

print("---")

print(soup.title.string) # 把 title tag 的內容欻出來

print("---")

print(soup.title.parent.name) # title tag 的上一層tag

print("---")

print(soup.a) # 把第一個抓出來

print("---")

print(soup.find_all('a')) # 把所有的抓出來

bs4 元素

Beautiful Soup 帮我们将 html 档案转换为 bs4 的物件，像是标签（Tag），标签中的内容（NavigableString）与 BeautifulSoup 对象本身。

标签（Tag）

import requests as rq

from bs4 import BeautifulSoup

url ="https://www.ptt.cc/bbs/NBA/index.html" # PTT NBA 板

response = rq.get(url) # 用 requests 的 get 方法把網頁抓下來

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

print(type(soup.a))

print("---")

print(soup.a.name) # 抓標籤名 a

print("---")

print(soup.a['id']) # 抓的id 名稱

标签中的内容（NavigableString）

import requests as rq

from bs4 import BeautifulSoup

url ="https://www.ptt.cc/bbs/NBA/index.html" # PTT NBA 板

response = rq.get(url) # 用 requests 的 get 方法把網頁抓下來

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

print(type(soup.a.string))

print("---")

soup.a.string

BeautifulSoup

import requests as rq

from bs4 import BeautifulSoup

url ="https://www.ptt.cc/bbs/NBA/index.html" # PTT NBA 板

response = rq.get(url) # 用 requests 的 get 方法把網頁抓下來

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text, 'lxml')# 指定 lxml 作為解析器

type(soup)

爬树

DOM（Document Object Model）的树状结构观念在使用 BeautifulSoup 扮演至关重要的角色，所以我们也要练习爬树。

往下爬

从标签中回传更多资讯。

contents 属性

children 属性

string 属性

import requests as rq

from bs4 import BeautifulSoup

url ="https://www.ptt.cc/bbs/NBA/index.html" # PTT NBA 板

response = rq.get(url) # 用 requests 的 get 方法把網頁抓下來

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

print(soup.body.a.contents)

print(list(soup.body.a.children))

print(soup.body.a.string)

往上爬

回传上一阶层的标签。

parent 属性

import requests as rq

from bs4 import BeautifulSoup

url ="https://www.ptt.cc/bbs/NBA/index.html" # PTT NBA 板

response = rq.get(url) # 用 requests 的 get 方法把網頁抓下來

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

print(soup.title)

print("---")

print(soup.title.parent)

往旁边爬

回传同一阶层的标签。

next_sibling 属性

previous_sibling 属性

import requests as rq

from bs4 import BeautifulSoup

url ="https://www.ptt.cc/bbs/NBA/index.html" # PTT NBA 板

response = rq.get(url) # 用 requests 的 get 方法把網頁抓下來

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

first_a_tag = soup.body.a

next_to_first_a_tag =first_a_tag.next_sibling

print(first_a_tag)

print("---")

print(next_to_first_a_tag)

print("---")

print(next_to_first_a_tag.previous_sibling)

搜寻

这是我们主要使用BeautifulSoup 套件来做网站解析的方法。

find() 方法

find_all() 方法

import requests as rq

from bs4 import BeautifulSoup

url ="https://www.ptt.cc/bbs/NBA/index.html" # PTT NBA 板

response = rq.get(url) # 用 requests 的 get 方法把網頁抓下來

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

print(soup.find("a")) # 第一個

print("---")

print(soup.find_all("a")) # 全部

可以在第二个参数class_= 加入 CSS 的类别。

import requests as rq

from bs4 import BeautifulSoup

url = "https://www.ptt.cc/bbs/NBA/index.html"# PTT NBA 板

response = rq.get(url) # 用 requests 的 get 方法把網頁抓下來

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

print(soup.find("div", class_="r-ent"))

BeautifulSoup 牛刀小試

大略照着官方文件练习了前面的内容之后，我们参考Tutorial of PTT crawler来应用 BeautifulSoup 把 PTT NBA 版首页资讯包含推文数，作者id，文章标题与发文日期搜集下来。

我们需要的资讯都放在CSS 类别为 r-ent 的

中。

import requests as rq

from bs4 import BeautifulSoup

url = 'https://www.ptt.cc/bbs/NBA/index.html'

response = rq.get(url)

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

posts = soup.find_all("div",class_ = "r-ent")

print(posts)

type(posts)

注意这个 posts物件是一个 ResultSet，一般我们使用回圈将里面的每一个元素再抓出来，先练习一下作者 id。

import requests as rq

from bs4 import BeautifulSoup

url ='https://www.ptt.cc/bbs/NBA/index.html'

response = rq.get(url)

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

author_ids = [] # 建立一個空的 list 來放置作者 id

posts = soup.find_all("div",class_ = "r-ent")

for post in posts:

author_ids.extend(post.find("div", class_ ="author"))

print(author_ids)

接下来我们把推文数，文章标题与发文日期一起写进去。

import numpy as np

import requests as rq

from bs4 import BeautifulSoup

url ='https://www.ptt.cc/bbs/NBA/index.html'

response = rq.get(url)

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

author_ids = [] # 建立一個空的 list 來放作者 id

recommends = [] # 建立一個空的 list 來放推文數

post_titles = [] # 建立一個空的 list 來放文章標題

post_dates = [] # 建立一個空的 list 來放發文日期

posts = soup.find_all("div",class_ = "r-ent")

for post in posts:

try:

author_ids.append(post.find("div", class_ ="author").string)

except:

author_ids.append(np.nan)

try:

post_titles.append(post.find("a").string)

except:

post_titles.append(np.nan)

try:

post_dates.append(post.find("div", class_ ="date").string)

except:

post_dates.append(np.nan)

# 推文數藏在 div 裡面的 span 所以分開處理

recommendations =soup.find_all("div", class_ = "nrec")

for recommendation in recommendations:

try:

recommends.append(int(recommendation.find("span").string))

except:

recommends.append(np.nan)

print(author_ids)

print(recommends)

print(post_titles)

print(post_dates)

检查结果都没有问题之后，那我们就可以把这几个 list 放进dictionary 接着转换成 dataframe 了。

import numpy as np

import pandas as pd

import requests as rq

from bs4 import BeautifulSoup

url ='https://www.ptt.cc/bbs/NBA/index.html'

response = rq.get(url)

html_doc = response.text # text 屬性就是 html 檔案

soup = BeautifulSoup(response.text,"lxml") # 指定 lxml 作為解析器

author_ids = [] # 建立一個空的 list 來放作者 id

recommends = [] # 建立一個空的 list 來放推文數

post_titles = [] # 建立一個空的 list 來放文章標題

post_dates = [] # 建立一個空的 list 來放發文日期

posts = soup.find_all("div",class_ = "r-ent")

for post in posts:

try:

author_ids.append(post.find("div", class_ ="author").string)

except:

author_ids.append(np.nan)

try:

post_titles.append(post.find("a").string)

except:

post_titles.append(np.nan)

try:

post_dates.append(post.find("div", class_ ="date").string)

except:

post_dates.append(np.nan)

# 推文數藏在 div 裡面的 span 所以分開處理

recommendations =soup.find_all("div", class_ = "nrec")

for recommendation in recommendations:

try:

recommends.append(int(recommendation.find("span").string))

except:

recommends.append(np.nan)

ptt_nba_dict = {"author":author_ids,

"recommends":recommends,

"title": post_titles,

"date": post_dates

}

ptt_nba_df = pd.DataFrame(ptt_nba_dict)

ptt_nba_df

rvest 牛刀小試

library(rvest)

library(magrittr)

ptt_nba_parser

url

html_doc

#指定 xpath

xpath_author_ids

xpath_recommends

xpath_titles

xpath_dates

#擷取資料

author_ids %

html_nodes(xpath = xpath_author_ids) %>%

html_text

recommends %

html_nodes(xpath = xpath_recommends) %>%

html_text %>%

as.integer

titles %

html_nodes(xpath = xpath_titles) %>%

html_text

dates %

html_nodes(xpath = xpath_dates) %>%

html_text

#整理成 data frame

return(df)

}

ptt_nba_df

View(ptt_nba_df)

关于牛刀小试的注意事项

BeautifulSoup 我们使用的选择概念是 CSS 选择器；rvest 我们则是使用 XPATH 选择器

两种作法都需要考虑同一个基本问题，就是被删除的文章，在 Python 中我们使用 try-except 让程式不会中断，在 R 语言中我们用更广泛的方式指定XPATH。

我们稍微练习了一下Python 极富盛名的网页解析套件BeautifulSoup ，我们做了官方文件的一些范例以及 PTT 的练习。

发表于: 2017-12-262017-12-26 08:00:22
原文链接：http://kuaibao.qq.com/s/20171226A0361N00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

给 iOS 开发者的 python 学习日记十五

相关快讯

扫码

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐