文章/答案/技术大牛

发布

社区首页 >问答首页 >在脚本文件函数中获取Scrapy crawler输出/结果

问在脚本文件函数中获取Scrapy crawler输出/结果
EN

Stack Overflow用户

提问于 2016-10-25 18:47:00

回答 5查看 6.9K关注 0票数 12

我使用脚本文件在scrapy项目中运行爬行器，并且爬行器记录爬虫的输出/结果。但是我想在脚本文件中使用爬行器输出/结果，在某些函数中，.I不想将输出/结果保存在任何文件或DB中。下面是从https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script获取的脚本代码

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())


d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()

def spider_output(output):
#     do something to that output

如何在“spider_output”方法中获取爬行器输出。可以获得输出/结果。

scrapy

web-crawler

twisted

scrapy-spider

python

回答 5

Stack Overflow用户

发布于 2016-10-25 21:01:41

下面是在一个列表中获取所有输出/结果的解决方案

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    dispatcher.connect(crawler_results, signal=signals.item_scraped)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())

票数 22

Stack Overflow用户

发布于 2020-07-15 03:25:50

这是一个老问题，但供将来参考。如果你正在使用python 3.6+，我建议你使用scrapyscript，它允许你运行你的蜘蛛，并以一种非常简单的方式获得结果：

from scrapyscript import Job, Processor
from scrapy.spiders import Spider
from scrapy import Request
import json

# Define a Scrapy Spider, which can accept *args or **kwargs
# https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
class PythonSpider(Spider):
    name = 'myspider'

    def start_requests(self):
        yield Request(self.url)

    def parse(self, response):
        title = response.xpath('//title/text()').extract()
        return {'url': response.request.url, 'title': title}

# Create jobs for each instance. *args and **kwargs supplied here will
# be passed to the spider constructor at runtime
githubJob = Job(PythonSpider, url='http://www.github.com')
pythonJob = Job(PythonSpider, url='http://www.python.org')

# Create a Processor, optionally passing in a Scrapy Settings object.
processor = Processor(settings=None)

# Start the reactor, and block until all spiders complete.
data = processor.run([githubJob, pythonJob])

# Print the consolidated results
print(json.dumps(data, indent=4))

[
    {
        "title": [
            "Welcome to Python.org"
        ],
        "url": "https://www.python.org/"
    },
    {
        "title": [
            "The world's leading software development platform \u00b7 GitHub",
            "1clr-code-hosting"
        ],
        "url": "https://github.com/"
    }
]

票数 5

Stack Overflow用户

发布于 2016-10-25 20:15:07

AFAIK没有办法这样做，因为crawl()

返回在爬网完成时触发的延迟。

爬虫除了将结果输出到记录器之外，不会将结果存储在任何地方。

然而，返回输出会与scrapy的整个异步性质和结构相冲突，所以在这里先保存到文件然后再读取它是一种更好的方法。

您可以简单地设计管道，将您的项目保存到文件中，并在spider_output中读取文件。您将收到您的结果，因为reactor.run()正在阻止您的脚本，直到输出文件完成。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40237952

复制

相似问题

问在脚本文件函数中获取Scrapy crawler输出/结果
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在脚本文件函数中获取Scrapy crawler输出/结果EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在脚本文件函数中获取Scrapy crawler输出/结果
EN