PKW: asyncio 标准库简介（第 4 期）

周萝卜

发布于 2019-07-17 15:39:11

67000

代码可运行

文章被收录于专栏：萝卜大杂烩萝卜大杂烩

运行总次数：0

代码可运行

这是 Python Knowledge Weekly（PKW）第 4 期。

”事情很少有根本做不成的；其所以做不成，与其说是条件不够，不如说是由于决心不够。“ ”Few things are impossible in themselves; and it is often for want of will, rather than of means, that man fails to succeed.“

本周分享知识

一、asyncio 标准库简介

二、使用异步多线程来爬取小说

asyncio 标准库简介

asyncio 是干什么的？

异步网络操作
并发
协程

关于 asyncio 的一些关键字的说明：

event_loop 事件循环：程序开启一个无限循环，把一些函数注册到事件循环上，当满足事件发生的时候，调用相应的协程函数
coroutine 协程：协程对象，指一个使用 async 关键字定义的函数，它的调用不会立即执行函数，而是会返回一个协程对象。
task 任务：一个协程对象就是一个原生可以挂起的函数，任务则是对协程进一步封装，其中包含了任务的各种状态
future: 代表将来执行或没有执行的任务的结果。它和 task 上没有本质上的区别
async/await 关键字：python3.5 以后用于定义协程的关键字，async 定义一个协程，await 用于挂起阻塞的异步调用接口。

介绍两种执行协程的方式

事件循环

 1import threading
 2import asyncio
 3
 4
 5async def say_hello():
 6    print('SAY HELLO')
 7
 8
 9async def hello():
10    print("hello! %s" % threading.currentThread())
11    await asyncio.sleep(2)
12    print('here')
13    print('Hello! %s' % threading.currentThread())
14
15
16async def no_hello():
17    print('no Hello! %s' % threading.currentThread())
18    await say_hello()
19
20
21def use_loop():
22    loop = asyncio.get_event_loop()
23    tasks = [hello(), no_hello()]
24    loop.run_until_complete(asyncio.wait(tasks))
25    loop.close()
26
27
28if __name__ == "__main__":
29    use_loop()

执行结果

1no Hello! <_MainThread(MainThread, started 30864)>
2SAY HELLO
3hello! <_MainThread(MainThread, started 30864)>
4here
5Hello! <_MainThread(MainThread, started 30864)>

通过线程名称可以看出，所有的函数都是通过一个线程来并发执行的。

task 任务

 1import threading
 2import asyncio
 3
 4
 5async def say_hello():
 6    print('SAY HELLO')
 7
 8
 9async def hello():
10    print("hello! %s" % threading.currentThread())
11    await asyncio.sleep(2)
12    print('here')
13    print('Hello! %s' % threading.currentThread())
14
15
16async def no_hello():
17    print('no Hello! %s' % threading.currentThread())
18    await say_hello()
19
20
21async def use_task():
22    task1 = asyncio.create_task(hello())
23    task2 = asyncio.create_task(no_hello())
24    await task1
25    await task2
26
27
28if __name__ == "__main__":
29    asyncio.run(use_task())

执行结果

1hello! <_MainThread(MainThread, started 28288)>
2no Hello! <_MainThread(MainThread, started 28288)>
3SAY HELLO
4here
5Hello! <_MainThread(MainThread, started 28288)>

可以看出，效果时一样的。好，下面我们就来一起实用一下

使用异步多线程来爬取小说

这里用到 aiohttp 库，是基于 asyncio 的异步 http 请求库，可以理解为一个支持异步 I/O 的 requests 我们选择的小说网站为：http://www.jinyongwang.com/fei/，是金庸先生的《飞狐外传》。首先来分析下网站结构:

在该页面，可以获取到小说的目录及不同章节的地址，只需要取到 class 为 mlist 的 ul 元素即可
然后进入到每个章节，取 p 元素的内容即可详细编码

 1def get_chapter_feihuwaizhuan():
 2    url = 'http://www.jinyongwang.com/fei'
 3    res = requests.get(url).text
 4    content = BeautifulSoup(res, "html.parser")
 5    ul = content.find('ul', attrs={'class': 'mlist'}).find_all('li')
 6    chapter = []
 7    for i in ul:
 8        chap_name = i.find('a').text.split('\u3000')
 9        if len(chap_name) == 2:
10            chap = chap_name[0]
11            name = chap_name[1]
12            uri = i.find('a')['href']
13            chapter.append([chap, name, uri])
14    # print(chapter)
15    return chapter
16
17
18async def fetch(session, url):
19    async with session.get(url) as response:
20        return await response.text()
21
22
23async def get_fei_details(chapter):
24    baseurl = 'http://www.jinyongwang.com'
25    url = baseurl + chapter
26    async with aiohttp.ClientSession() as session:
27        html = await fetch(session, url)
28        content = BeautifulSoup(html, "html.parser")
29        div = content.find('div', attrs={'class': 'vcon'}).find_all('p')
30        details = []
31        for p in div:
32            de = p.text
33            details.append(de)
34        print(details)
35
36
37if __name__ == "__main__":
38    # asyncio.run(get_fei_details('/fei/484.html'))
39    chap = get_chapter_feihuwaizhuan()
40    loop = asyncio.get_event_loop()
41    tasks = [get_fei_details(url[2]) for url in chap]
42    loop.run_until_complete(asyncio.wait(tasks))
43    loop.close()

这样，不同的章节，几乎是在同时下载下来。

当然，爬虫不保存，和咸鱼有什么区别！我们使用 aiofiles 库来做保存操作，代码为

 1import aiofiles
 2
 3
 4async def save(chapter, details):
 5    print('save to txt')
 6    async with aiofiles.open(chapter + '.txt', 'w', encoding='gb18030') as fd:
 7        for i in range(len(details)):
 8            s = details[i] + '\n'
 9            await fd.write(s)
10    print('save finish!')

然后在 get_fei_details 这个函数中调用即可

 1async def get_fei_details(chapter):
 2    baseurl = 'http://www.jinyongwang.com'
 3    url = baseurl + chapter[2]
 4    async with aiohttp.ClientSession() as session:
 5        html = await fetch(session, url)
 6        content = BeautifulSoup(html, "html.parser")
 7        div = content.find('div', attrs={'class': 'vcon'}).find_all('p')
 8        details = []
 9        for p in div:
10            de = p.text
11            details.append(de)
12        print(details)
13        print(chapter)
14        await save(chapter[1], details)

这样，我们就实现了异步抓取小说并保存的功能了。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-04-02，如有侵权请联系 cloudcommunity@tencent.com 删除

http