我正在使用Scrapy从下面的链接中抓取财务数据:
reponse.body如下所示:
我尝试使用常规回归来拆分响应,然后将其转换为json,但它没有显示json对象,以下是我的代码:
import scrapy
import re
import json
class StocksSpider(scrapy.Spider):
name = 'stocks'
allowed_domains = ['web.ifzq.gtimg.cn']
start_urls = ['http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=jQuery11240339550$']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse,
#endpoint='render.json', # optional; default is render.html
#splash_url='<url>', # optional; overrides SPLASH_URL
#slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional
)
def parse(self, response):
try:
json_data = re.search('\{\"data\"\:(.+?)\}\}\]', response.text).group(1)
except AttributeError:
json_data = ''
#print json_data
loaded_json = json.loads(json_data)
print loaded_json
它抛出一个错误,指出没有json对象可以解码:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy_splash/middleware.py", line 156, in process_spider_output
for el in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/finance/finance/spiders/stocks.py", line 25, in parse
loaded_json = json.loads(json_data)
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
2018-06-09 23:54:26 [scrapy.core.engine] INFO: Closing spider (finished)
我的目标是将其转换为json,这样我就可以轻松地迭代内容。有必要将其转换为json吗?在这种情况下如何转换?响应是unicode格式的,所以我也需要将其转换为utf-8?有没有其他好的迭代方式?
发布于 2018-06-10 00:56:26
问题似乎是实际的数据在jQuery1124033955090772971586_1528569153921()
中。我可以通过删除请求url中的一个参数来摆脱它。如果你确实需要它,这可能会起到作用:
>>> import json
>>> url = 'http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=jQuery1124033955090772971586_1528569153921&_=1528569153953'
>>> fetch(url)
2018-06-09 21:55:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=jQuery1124033955090772971586_1528569153921&_=1528569153953> (referer: None)
>>> data = response.text.strip('jQuery1124033955090772971586_1528569153921()')
>>> parsed_data = json.loads(data)
如果您希望从url中删除_callback
参数,只需:
>>> import json
>>> url = 'http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_=1528569153953'
>>> fetch(url)
2018-06-09 21:53:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_=1528569153953> (referer: None)
>>> parsed_data = json.loads(response.text)
发布于 2018-06-10 00:51:33
正如bla所说,没有&_callback=jQuery1124033955090772971586_1528569153921数据是有效的,回调是不需要的,也不是静态的,例如http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=test是给出相同的结果
发布于 2018-06-10 03:11:17
import re
import scrapy
class StocksSpider(scrapy.Spider):
name = 'stocks'
allowed_domains = ['gtimg.cn']
start_urls = ['http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=jQuery1124033955090772971586_1528569153921&_=1528569153953']
def parse(self, response):
try:
json = eval(re.findall(r'jQuery\d+_\d+(\(\{.+\}\))', response.body)[0])
print json
except:
self.log('Response couldn\'t be parsed, seems like it is having different format')
使用eval而不是在json中进行转换,因为最后您将把它用作列表等的字典
可能是这样的,
import re
import scrapy
class StocksSpider(scrapy.Spider):
name = 'stocks'
allowed_domains = ['gtimg.cn']
start_urls = ['http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=jQuery1124033955090772971586_1528569153921&_=1528569153953']
def parse(self, response):
data = eval(re.findall(r'jQuery\d+_\d+(\(\{.+\}\))', response.body)[0])
items = data.get('data', {}).get('data', [])
for item in items:
yield item
或者你可以使用json load代替eval,这也是可以的
https://stackoverflow.com/questions/50779567
复制