我写了我的第一个网络刮刀,这(令人惊讶)完成了这项工作。我正在为图片抓取一个流行的漫画网站 (它们的900
上有),但问题是刮板太慢了。
例如,如果我下载了一个10
漫画示例,那么每个图像平均需要一个4
到5
secs (该示例总共需要> 40 secs
),如果您问我,这有点太慢了,因为每个图像都差不多。80KB
到800KB
的大小。
我已经读过,我可以切换到lxml
来异步地进行抓取,但是这个包与Python3.6不兼容。
我试过这个:
pip3 install lxml
只是为了得到这个:
Could not find a version that satisfies the requirement python-lxml (from versions: )
No matching distribution found for python-lxml
所以我的问题是如何加速刮板?
也许该怪我的刮擦逻辑?最后,是否有办法只为相关部分刮网页?
这是密码。我删除了所有的眼糖果和输入验证-完整的代码这里
import re
import time
import requests
import itertools
from requests import get
from bs4 import BeautifulSoup as bs
def generate_comic_link(array, num):
for link in itertools.islice(array, 0, num):
yield link
def grab_image_src_url(link):
req = requests.get(link)
comic = req.text
soup = bs(comic, 'html.parser')
for i in soup.find_all('p'):
for img in i.find_all('img', src=True):
return img['src']
def download_image(link):
file_name = url.split('/')[-1]
with open(file_name, "wb") as file:
response = get(url)
file.write(response.content)
def fetch_comic_archive():
url = 'http://www.poorlydrawnlines.com/archive/'
req = requests.get(url)
page = req.text
soup = bs(page, 'html.parser')
all_links = []
for link in soup.find_all('a'):
all_links.append(link.get('href'))
return all_links
def filter_comic_archive(archive):
pattern = re.compile(r'http://www.poorlydrawnlines.com/comic/.+')
filtered_links = [i for i in archive if pattern.match(i)]
return filtered_links
all_comics = fetch_comic_archive()
found_comics = filter_comic_archive(all_comics)
print("\nThe scraper has found {} comics.".format(len(found_comics)))
print("How many comics do you want to download?")
n_of_comics = int(input(">> ").strip())
start = time.time()
for link in generate_comic_link(found_comics, n_of_comics):
print("Downloading: {}".format(link)
url = grab_image_src_url(link)
download_image(url)
end = time.time()
print("Successfully downloaded {} comics in {:.2f} seconds.".format(n_of_comics, end - start))
发布于 2018-03-18 21:39:30
解决方案是导入threading
。为了使用与问题中相同的代码,下面是解决方案:
...
for link in generate_comic_link(found_comics, n_of_comics):
print("Downloading: {}".format(link))
url = grab_image_src_url(link)
thread = threading.Thread(target=download_image, args=(url,))
thread.start()
thread.join()
...
这实际上降低了几乎50%
的下载速度,甚至对于上面所示的粗糙代码也是如此。
与以前的10
秒相比,10
图像示例的下载时间现在大约为21
秒。
完全重构的代码是这里。
https://stackoverflow.com/questions/49322145
复制相似问题