,LxmlLinkExtractor是基于lxml的HTMLParser实现的:
class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow...)要忽略的后缀,如果为空,则为包scrapy.linkextractors中的列表IGNORED_EXTENSIONS,如下所示:
IGNORED_EXTENSIONS = [
# 图片...bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg...----
官网给的CrawlSpider的例子:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors...%s', response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id