我正在尝试让scrapy将结果写入到S3存储桶中。我的配置文件中有以下内容:
ITEM_PIPELINES = {
'scrapy.pipelines.files.S3FilesStore': 1
}
FEED_URI = 's3://1001-results-bucket/results.json'
FEED_FORMAT = 'json'
我的解析函数非常简单:
class TestSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
for n in range(0,1):
request = scrapy.FormRequest("https://website", formdata={'id': "%s" % i})
yield request
def parse(self, response):
yield {
'foo': 'bar'
}
我得到以下错误:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scrapy/core/scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scrapy/middleware.py", line 40, in from_settings
mw = mwcls()
TypeError: __init__() missing 1 required positional argument: 'uri'
有什么想法吗?
发布于 2018-02-14 17:05:48
我能够通过创建自定义管道来解决这个问题,它似乎工作得很好。
settings.py
ITEM_PIPELINES = {
'test.pipelines.S3FileUpload': 1
}
pipelines.py
import boto
import boto.s3
from boto.s3.key import Key
import os
from os.path import join, dirname
from dotenv import Dotenv
dotenv = Dotenv(join(dirname(__file__), ".env"))
os.environ.update(dotenv)
AWS_ACCESS_KEY_ID= os.environ.get("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.environ.get("AWS_SECRET_ACCESS_KEY")
class S3FileUpload(object):
def process_item(self, item, spider):
bucket = s3.get_bucket('my-bucket')
k = Key(bucket)
k.key = 'test.json'
k.set_contents_from_string(str(item))
return item
https://stackoverflow.com/questions/48778381
复制