爬虫入门篇(上手即用)

星辉

发布于 2019-04-07 06:36:04

9070

文章被收录于专栏：用户2119464的专栏用户2119464的专栏

什么是爬虫

爬虫是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。

为什么是python? Python确实很适合做爬虫，丰富的第三方库十分强大，简单几行代码便可实现你想要的功能。

编辑器的选择

可以考虑使用Pycharm，专用编辑器会更好用一些。

与其特立独行用最轻便最好用的，倒不如用用户量大的Idle，生态更好，遇到问题会更有的可能找到解决方案。我想，这可能便是Idle和编程语言的护城河。

mac 操作

python 版本号查询直接在 terminal 键入 python( 此为 python2 ) 直接在 terminal 键入 python3( 此为 python3 )

注意事项 python 默认对应 python2.7, pip 默认对应 pip2, python2 对应 pip2 python3 对应 pip3，不想用系统默认的python和pip，则需要使用 python3 和 pip3.

安装 python3 mac 本身带有 python2.7，需要自行安装python3

brew install python3

链接 python3 安装但没有链接到 python3，此时需要链接python3

brew link python

但是出现错误

Error: Permission denied @ dir_s_mkdir - /usr/local/Frameworks

输入以下指令，从而将相应的文件夹的权限打开

sudo mkdir /usr/local/Frameworks sudo chown $(whoami):admin /usr/local/Frameworks

简单爬虫代码

一段可获得html网页的朴素的爬虫代码

import urllib.request response = urllib.request.urlopen(‘http://python.org/’) result = response.read().decode(‘utf-8’) print(result)

若有些网址设有反爬机制，请求若没有headers就会报错。可以通过chrome浏览器的F12-network查看request的headers，将该网页的headers信息复制下来使用。

一段加入headers的获取网页爬虫代码

import urllib.request headers = {‘User_Agent’: ‘’} response = urllib.request.Request(‘http://python.org/’, headers=headers) html = urllib.request.urlopen(response) result = html.read().decode(‘utf-8’) print(result)

反馈异常错误非常关键，避免在爬取的过程中被打断而终止。

一段加入try…exception结构的网页爬虫网页

import urllib.requestimport urllib.error try: headers = {} response = urllib.request.Request(‘http://python.org/’, headers=headers) html = urllib.request.urlopen(response) result = html.read().decode(‘utf-8’) except urllib.error.URLError as e: if hasattr(e, ‘reason’): print(‘错误原因是’ + str(e.reason)) except urllib.error.HTTPError as e: if hasattr(e, ‘code’): print(‘错误状态码是’ + str(e.code)) else: print(‘请求成功通过。’)