在开发爬虫的过程中,不知道大家有没有过这样的感受:
辛辛苦苦地抓了包,
勤勤恳恳地分析了dom,
抓耳挠腮地破解了js加密,
最后兴奋地想验证demo的正确与否,
哦豁。
还要一个一个把HTTP请求的headers字符串,
一个一个地写入dict,
当请求头字段毛毛多的时候,
手都僵了,头都炸了。
为什么要做程序员?
为什么要写爬虫?
回过头来一想,做程序员呢,最重要的是开心。
写到这里,偶尔脑袋灵光一闪,为什么自己不写个工具呢:
将raw HTTP请求的headers转为python的字典,
刚好,笔者做过相关方面的工作,现在把它分享出来,
大家一起不再做重复工作。
举个例子:
类似下面的GET请求的headers:
GET https://www.google.com/complete/search?client=chrome-omni&gs_ri=chrome-ext-ansg&xssi=t&q=&oit=0&pgcl=7&gs_rn=42&psi=uwwl1_ZqKAGUC4ku&sugkey=AIzaSyBOti4mM-6x9WDnZIjIeyEU21OpBXqWBgw HTTP/1.1Host: www.google.comConnection: keep-aliveX-Client-Data: CJS2yQEIpbbJAQjBtskBCKmdygEIqKPKAQ==User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36Accept-Encoding: gzip, deflate, brAccept-Language: zh-CN,zh;q=0.9Cookie: NID=123=I8MOj96l2pOfO41yeQ1eiazObgghJa-gxm8xnQEYTP5QGZJUPoM0wQTVWoHAydnRybn-wxH56hfUNTttYw4ojmn8ik1zoBG6lh2J2eI1XL01mzcd6lfZ9RplO48qml6n; 1P_JAR=2018-6-29-7
我们传统写爬虫的时候大多数都是手动敲打成字典
笔者写了一个工具名叫pyheader,目前测试通过支持电脑为:mac,windows7,win10估计也支持,没测过
放在我的git上:leegohi/always
对于上述的HTTP请求报文,使用方法如下:
1、直接拷贝
2、命令行或者shell执行pyheader
然后获得结果:
HEADERS:{ "Accept-Language": "zh-CN,zh;q=0.9", "Accept-Encoding": "gzip,deflate,br", "Connection": "keep-alive", "User-Agent": "Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/67.0.3396.99Safari/537.36", "Host": "www.google.com", "X-Client-Data": "CJS2yQEIpbbJAQjBtskBCKmdygEIqKPKAQ=="}
如果有post或者get的请求原始格式数据
例如请求如下:
GET /5a1Fazu8AA54nxGko9WTAnF6hhy/su?wd=windows%20python%20%E5%86%99%E5%85%A5%E5%89%AA%E5%88%87&sugmode=2&json=1&p=3&sid=1454_26458_21101_18560_26350_20927&req=2&bs=windows%20python%20%E5%86%99%E5%85%A5%E5%89%AA%E5%88%87&csor=19&cb=jQuery110202842497758386793_1530541567216&_=1530541567219 HTTP/1.1Host: sp0.baidu.comConnection: keep-aliveUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36Accept: */*Referer: https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=windows%20python%20%E5%86%99%E5%85%A5%E5%89%AA%E5%88%87&rsv_pq=d6bd6cd70000cb1f&rsv_t=9dd0ikH5qzck12PlUA%2FO8yZSNMNn4uVF2NdGhcxJUMnzlDlpuYBMZiJGURQ&rqlang=cn&rsv_enter=1&rsv_sug3=41&rsv_sug1=20&rsv_sug7=101&rsv_sug2=0&inputT=110804&rsv_sug4=110803&rsv_jmp=slowAccept-Encoding: gzip, deflate, brAccept-Language: zh-CN,zh;q=0.9,en;q=0.8Cookie: BAIDUID=CDBB02FC314A6F5552413452B0872E23:FG=1; BIDUPSID=CDBB02FC314A6F5552413452B0872E23; PSTM=1529893958; H_PS_PSSID=1454_26458_21101_18560_26350_20927; PSINO=3
结果如下:
QUERY DATA:{ "wd": "windows", "sugmode": "2", "cb": "jQuery110202842497758386793_1530541567216", "req": "2", "bs": "windows", "p": "3", "json": "1", "csor": "19", "sid": "1454_26458_21101_18560_26350_20927", "_": "1530541567219"}HEADERS:{ "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", "Accept-Encoding": "gzip,deflate,br", "Connection": "keep-alive", "Accept": "*/*", "User-Agent": "Mozilla/5.0(Macintosh;IntelMacOSX10_11_6)AppleWebKit/537.36(KHTML,likeGecko)Chrome/67.0.3396.99Safari/537.36", "Host": "sp0.baidu.com", "Referer": "https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=windows%20python%20%E5%86%99%E5%85%A5%E5%89%AA%E5%88%87&rsv_pq=d6bd6cd70000cb1f&rsv_t=9dd0ikH5qzck12PlUA%2FO8yZSNMNn4uVF2NdGhcxJUMnzlDlpuYBMZiJGURQ&rqlang=cn&rsv_enter=1&rsv_sug3=41&rsv_sug1=20&rsv_sug7=101&rsv_sug2=0&inputT=110804&rsv_sug4=110803&rsv_jmp=slow", "Cookie": "BAIDUID=CDBB02FC314A6F5552413452B0872E23:FG=1;BIDUPSID=CDBB02FC314A6F5552413452B0872E23;PSTM=1529893958;H_PS_PSSID=1454_26458_21101_18560_26350_20927;PSINO=3"HEADERS:
ok。介绍完毕,具体文档介绍可以移步我的git.
最后啰嗦一句,大家可以关注我的微信公众号叫:什么的干货
或者搜索:
smdgh88
领取专属 10元无门槛券
私享最新 技术干货