我已经测试过它的瓶颈是什么。它来自中间层的select查询。
class CheckDuplicatesFromDB(object):
def process_request(self, request, spider):
# url_list is a just python list. some urls in there.
if (request.url not in url_list):
self.crawled_urls = dict()
connection = pymysql.connect(host='123
既然我们使用自己的终端命令运行抓取蜘蛛,那么我如何运行自己定义的函数呢?
例子如下:
import scrapy
class Fcc(scrapy.Spider):
name = "fcc"
start_urls = ["http://freecodecamp.org/"]
def parse(self, response):
for link in response.css("a::attr(href)").getall():
yield {