在数据驱动的时代,网络爬虫已成为获取信息的核心工具。当遇到目标网站的反爬机制时,代理IP就像"隐形斗篷",帮助爬虫突破限制。本文将用通俗的语言,带您掌握Python爬虫结合代理IP抓取数据的全流程。

想象成一只"数字蜘蛛",通过发送HTTP请求访问网页,获取HTML内容后解析出所需数据。Python的Requests库就像蜘蛛的"腿",BeautifulSoup和Scrapy框架则是它的"大脑"。
代理服务器就像"快递中转站",当您用Python发送请求时,请求会先到达代理服务器,再由代理转发给目标网站。这样目标网站看到的是代理的IP,而非您的真实地址。
import requests
from bs4 import BeautifulSoup
# 设置代理(格式:协议://IP:端口)
proxies = {
'http': 'http://123.45.67.89:8080',
'https': 'http://123.45.67.89:8080'
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get('https://www.zdaye.com/blog/article/just_changip', proxies=proxies, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.text)import threading
import time
def fetch_data(url, proxy):
try:
response = requests.get(url, proxies={"http": proxy}, timeout=10)
if response.status_code == 200:
print(f"Success with {proxy}")
# 处理数据...
except:
print(f"Failed with {proxy}")
# 付费代理池(示例)
proxy_pool = [
'http://proxy1.com:8080',
'http://proxy2.com:8080',
# 添加更多代理...
]
urls = ['https://example.com/page1', 'https://example.com/page2']
# 创建线程池
threads = []
for url in urls:
for proxy in proxy_pool:
t = threading.Thread(target=fetch_data, args=(url, proxy))
threads.append(t)
t.start()
time.sleep(0.1) # 防止瞬间请求过多
# 等待所有线程完成
for t in threads:
t.join()在settings.py中配置:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'myproject.middlewares.ProxyMiddleware': 100,
}
PROXY_POOL = [
'http://user:pass@proxy1.com:8080',
'http://user:pass@proxy2.com:8080',
]创建中间件middlewares.py:
import random
class ProxyMiddleware:
def process_request(self, request, spider):
request.meta['proxy'] = random.choice(settings.get('PROXY_POOL'))import time
import random
def safe_request(url):
time.sleep(random.uniform(1,3)) # 随机等待1-3秒
return requests.get(url)# 使用Session保持会话
session = requests.Session()
response = session.get('https://login.example.com', proxies=proxies)
# 处理登录后获取Cookie...import pandas as pd
data = []
# 假设通过爬虫获取到items列表
for item in items:
clean_item = {
'title': item['title'].strip(),
'price': float(item['price'].replace('$', '')),
'date': pd.to_datetime(item['date'])
}
data.append(clean_item)
df = pd.DataFrame(data)
df.to_csv('output.csv', index=False)import pymongo
client = pymongo.MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['products']
for item in items:
collection.insert_one(item)通过Python爬虫与代理IP的组合,我们可以高效获取互联网上的公开信息。但技术始终是工具,合理使用才能创造价值。在享受数据便利的同时,请始终牢记:技术应该有温度,抓取需有底线。未来的智能抓取系统,将是效率与伦理的完美平衡。