012：pyquery介绍与实战爬取糗事百科猫眼排行

李玺

发布于 2021-11-22 18:42:18

1790

发布于 2021-11-22 18:42:18

文章被收录于专栏：爬虫逆向案例

很久没更新了。最近一直在使用pyquery做一些小爬虫文件。个人感觉是值得推荐的，本篇我来介绍下pq的用法及其实战。内容主要以代码为主。

PyQuery库也是一个非常强大又灵活的网页解析库，如果你有前端开发经验的，都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择，PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同，所以不用再去费心去记一些奇怪的方法了。

Pyquery基础认识：

首先看一下 1、字符串的初始化

from pyquery import PyQuery as pq
 
html = '''
    
         first item
         second item
         third item
         fourth item
         fifth item
     '''
 
doc = pq(html)
print(doc)
print(type(doc))
print(doc('li'))

运行结果：

         first item
         second item
         third item
         fourth item
         fifth item
     

- - - - - - -  - - - - - - --  - --  -

- - - - - - -  - - - - - - --  - --  -
    first item
    second item
    third item
    fourth item
    fifth item

2、打开html文件（注意路径）

from pyquery import PyQuery as pq
doc = pq(filename='index.html')
print(doc)
print(doc('head'))

运行结果：

Title


    
    
         first item
         second item
         third item
         fourth item
         fifth item
     '''



    
    Title

3、打开网站

from pyquery import PyQuery as pq
import requests
# doc1 = pq(url='https://www.baidu.com')
# print(doc)
content = requests.get(url='https://www.baidu.com').content.decode('utf-8')
doc = pq(content)
print(doc('head'))

4、基于CSS选择器查找

from pyquery import PyQuery as pq

html = '''
    
         first item
         second item
         third item
         fourth item
         fifth item
     '''

doc = pq(html)
print(doc)
print("- - - - - -  - - - - - - - - - - - -- -  - - - - - - -- -  - -")
# id等于haha下面的class等于item-0下的a标签下的span标签（注意层级关系以空格隔开）
print(doc('#haha .item-0 a span'))

运行结果：

         first item
         second item
         third item
         fourth item
         fifth item
     
- - - - - -  - - - - - - - - - - - -- -  - - - - - - -- -  - -
third item

5、可以通过已经查找的标签，查找这个标签下的子标签或者父标签，而不用从头开始查找。

from pyquery import PyQuery as pq

html = '''
    
         first item
         second item
         third item
         fourth item
         fifth item
     '''

doc = pq(html)
item = doc('div ul')
print(item)
print("-  ----------------------------------------")
# 注意这里查找ul标签的所有子标签，也就是li标签，下面是查找class属性的标签，如果你把class换成href肯定不行，它指的只是儿子并不是子子孙孙
print(item.children('[class]'))

运行结果：

         first item
         second item
         third item
         fourth item
         fifth item
     
-  ----------------------------------------
first item
         second item
         third item
         fourth item
         fifth item

6、获取属性值

from pyquery import PyQuery as pq

html = '''
    
         first item
         second item
         third item
         fourth item
         fifth item
     '''
doc = pq(html)

item = doc(".item-0.active a")

print(type(item))
print(item)
# 获取属性值的两种方法
print(item.attr.href)
print(item.attr('href'))

运行结果：

注意class=item-0 active是一个class的属性，但是在pyquery里面要是中间也是空格隔开的话，就变成了item-0下的active标签下的a标签了，所以这里空格必须改成点

third item
link3.html
link3.html

7、获取标签的内容

from pyquery import PyQuery as pq
 
html = '''
    
         first item
         
         			second item
         
         			third item
         
         			fourth item
         
         			fifth item
     '''
 
doc = pq(html)
a = doc("a").text()
print(a)

#结果很有趣，他是找到所有标签的值，然后给连到一起打出来

second item third item fourth item fifth item

高级提高：

8、Dom操作 1、属性的增加删除操作

from pyquery import PyQuery as pq
 
html = '''
    
         first item
         second item
         third item
         fourth item
         fifth item
     '''
 
doc = pq(html)
li = doc('.item-0.active')
print(li)
#删除classactive
print(li.removeClass('active'))
#增加class属性haha
print(li.addClass('haha'))

运行结果：

third item
         
third item
         
third item

是不是666

2、attrs和css 添加属性和值

from pyquery import PyQuery as pq
 
html = '''
    
         first item
         second item
         third item
         fourth item
         fifth item
     '''
 
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.attr('id','id_test'))
print(li.css('font-size','20px'))

运行结果

third item
         
third item
         
third item

3、删除某个标签，在爬去过程中我们通常爬去一下标签或者内容下来的时候总会有些不想要的标签，这个时候我们可以用下面的类似方法删除这个标签。

from pyquery import PyQuery as pq
 
html = '''
    
         first item
         second item
         third item
         fourth item
         fifth item
     '''
 
doc = pq(html)
data = doc('.content')
print(data.text())
#删除所有a标签
data.find('a').remove()
#再次打印
print(data.text())

运行结果：

first item second item third item fourth item fifth item
first item

常用方法介绍：

实战案例：

爬取腾讯招聘：

import requests
import random,os
from pyquery import PyQuery as pq

USER_AGENTS = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
            'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",
    ]
USER_AGENTS=random.choice(USER_AGENTS)
headers = {
    "User-Agent":USER_AGENTS
}

def req(kw,page):
    if not os.path.exists('centent'):
        os.makedirs('centent')
    for i in range(page):
        page = i * 10
        base_url = "https://hr.tencent.com/position.php?keywords={}&start={}"
        url = base_url.format(kw,page)
        response = requests.get(url=url,headers=headers).content.decode('utf-8')
        shuju(response)
def shuju(response):
    doc = pq(response)
    items = doc('table')
    tr = items('tr').text().split(" ")
    tr =tr[1:-2]
    for data in tr:
        data= data.split("\n")
        print("正在下载- - - - - - - - -")
        try:
            cont =("岗位：  "+data[0]+"   类别："+data[1]+"   人数："+data[2]+"  地点："+data[3]+"  发布时间："+data[4]+'\n')
            with open("centent/%s.txt"%kw,'a+',encoding='utf-8')as fp:
                fp.write(cont)
        except:
            pass

if __name__ == '__main__':
    kw = input("请输入职位名字:")
    page = int(input("请输入页码:"))
    req(kw,page)

爬取糗事百科：

from pyquery import  PyQuery as pq
import requests,os

base_url= "https://www.qiushibaike.com/hot/page/1/"
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
            }

response = requests.request('get',base_url,headers=headers).content.decode('utf-8')
doc=pq(response)
items =doc('#content-left')

name = items('h2').text().split(' ')
content = items('.content span').text().split(' ')
dianzan = items('.stats-vote i').text().split(' ')
pinglun = items('.stats-comments i').text().split(' ')
for i in range(25):
    data = "名字: "+name[i]+"内容: "+content[i]+'\n'+"                                                       点赞:"+dianzan[i]+" 评论: "+pinglun[i]+'\n'
    print(data)
    with open("qiushi.txt","a+",encoding='utf-8')as fp:
        fp.write(data)
img_all = items('.thumb a img')
img_list =[]
for img in img_all:
    img_list.append(img.attrib['src'])
for j in img_list:
    img_url = "https:"+j
    print(img_url)
    iii = requests.request('get',img_url,headers=headers).content
    img_name=img_url[-10:]
    kw="qiushi"
    if not os.path.exists("./" + kw):
        os.mkdir("./" + kw)
    with open("./%s/%s" % (kw, img_name), "ab") as f:
        f.write(iii)

爬取猫眼排行：

from pyquery import  PyQuery as pq
import requests
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
            }
url = "https://maoyan.com/board"
response = requests.request('get',url,headers=headers)

content_all = response.content.decode('utf-8')

doc = pq(content_all)
# print(doc)
items = doc('div dd')
# print(items)
paiming = items('.board-index').text().split(" ")
name = items('.name').text().split(" ")
zhuyan =items('.star').text().split(" ")
daytime =items('.releasetime').text().split(" ")
pingfen = items('.score').text().split(" ")
img_url = items('.board-img')
print(name)
url_list = []
for url in img_url:
    # print(url.attrib['data-src'])
    url_list.append(url.attrib['data-src'])
# print(type(url_list))
for i in range(10):
    data = "排名:"+paiming[i]+" 电影:"+name[i]+" "+zhuyan[i]+" "+daytime[i]+" 评分:"+pingfen[i]+" 图片链接: "+url_list[i]
    with open('maoyan.txt','a+',encoding='utf-8')as fp:
        print(data)
        fp.write(data+'\n')

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2019/03/01 ，如有侵权请联系 cloudcommunity@tencent.com 删除

jquery