这是 Python Knowledge Weekly(PKW)第 4 期。
”事情很少有根本做不成的;其所以做不成,与其说是条件不够,不如说是由于决心不够。“ ”Few things are impossible in themselves; and it is often for want of will, rather than of means, that man fails to succeed.“
一、asyncio 标准库简介
二、使用异步多线程来爬取小说
asyncio 是干什么的?
关于 asyncio 的一些关键字的说明:
1import threading
2import asyncio
3
4
5async def say_hello():
6 print('SAY HELLO')
7
8
9async def hello():
10 print("hello! %s" % threading.currentThread())
11 await asyncio.sleep(2)
12 print('here')
13 print('Hello! %s' % threading.currentThread())
14
15
16async def no_hello():
17 print('no Hello! %s' % threading.currentThread())
18 await say_hello()
19
20
21def use_loop():
22 loop = asyncio.get_event_loop()
23 tasks = [hello(), no_hello()]
24 loop.run_until_complete(asyncio.wait(tasks))
25 loop.close()
26
27
28if __name__ == "__main__":
29 use_loop()
执行结果
1no Hello! <_MainThread(MainThread, started 30864)>
2SAY HELLO
3hello! <_MainThread(MainThread, started 30864)>
4here
5Hello! <_MainThread(MainThread, started 30864)>
通过线程名称可以看出,所有的函数都是通过一个线程来并发执行的。
1import threading
2import asyncio
3
4
5async def say_hello():
6 print('SAY HELLO')
7
8
9async def hello():
10 print("hello! %s" % threading.currentThread())
11 await asyncio.sleep(2)
12 print('here')
13 print('Hello! %s' % threading.currentThread())
14
15
16async def no_hello():
17 print('no Hello! %s' % threading.currentThread())
18 await say_hello()
19
20
21async def use_task():
22 task1 = asyncio.create_task(hello())
23 task2 = asyncio.create_task(no_hello())
24 await task1
25 await task2
26
27
28if __name__ == "__main__":
29 asyncio.run(use_task())
执行结果
1hello! <_MainThread(MainThread, started 28288)>
2no Hello! <_MainThread(MainThread, started 28288)>
3SAY HELLO
4here
5Hello! <_MainThread(MainThread, started 28288)>
可以看出,效果时一样的。 好,下面我们就来一起实用一下
这里用到 aiohttp 库,是基于 asyncio 的异步 http 请求库,可以理解为一个支持异步 I/O 的 requests 我们选择的小说网站为:http://www.jinyongwang.com/fei/,是金庸先生的《飞狐外传》。 首先来分析下网站结构:
1def get_chapter_feihuwaizhuan():
2 url = 'http://www.jinyongwang.com/fei'
3 res = requests.get(url).text
4 content = BeautifulSoup(res, "html.parser")
5 ul = content.find('ul', attrs={'class': 'mlist'}).find_all('li')
6 chapter = []
7 for i in ul:
8 chap_name = i.find('a').text.split('\u3000')
9 if len(chap_name) == 2:
10 chap = chap_name[0]
11 name = chap_name[1]
12 uri = i.find('a')['href']
13 chapter.append([chap, name, uri])
14 # print(chapter)
15 return chapter
16
17
18async def fetch(session, url):
19 async with session.get(url) as response:
20 return await response.text()
21
22
23async def get_fei_details(chapter):
24 baseurl = 'http://www.jinyongwang.com'
25 url = baseurl + chapter
26 async with aiohttp.ClientSession() as session:
27 html = await fetch(session, url)
28 content = BeautifulSoup(html, "html.parser")
29 div = content.find('div', attrs={'class': 'vcon'}).find_all('p')
30 details = []
31 for p in div:
32 de = p.text
33 details.append(de)
34 print(details)
35
36
37if __name__ == "__main__":
38 # asyncio.run(get_fei_details('/fei/484.html'))
39 chap = get_chapter_feihuwaizhuan()
40 loop = asyncio.get_event_loop()
41 tasks = [get_fei_details(url[2]) for url in chap]
42 loop.run_until_complete(asyncio.wait(tasks))
43 loop.close()
这样,不同的章节,几乎是在同时下载下来。
当然,爬虫不保存,和咸鱼有什么区别!我们使用 aiofiles 库来做保存操作,代码为
1import aiofiles
2
3
4async def save(chapter, details):
5 print('save to txt')
6 async with aiofiles.open(chapter + '.txt', 'w', encoding='gb18030') as fd:
7 for i in range(len(details)):
8 s = details[i] + '\n'
9 await fd.write(s)
10 print('save finish!')
然后在 get_fei_details 这个函数中调用即可
1async def get_fei_details(chapter):
2 baseurl = 'http://www.jinyongwang.com'
3 url = baseurl + chapter[2]
4 async with aiohttp.ClientSession() as session:
5 html = await fetch(session, url)
6 content = BeautifulSoup(html, "html.parser")
7 div = content.find('div', attrs={'class': 'vcon'}).find_all('p')
8 details = []
9 for p in div:
10 de = p.text
11 details.append(de)
12 print(details)
13 print(chapter)
14 await save(chapter[1], details)
这样,我们就实现了异步抓取小说并保存的功能了。