在网络爬虫开发中,超时(Timeout)和延迟加载(Lazy Loading)是两个常见的技术挑战。
本文将介绍如何在Python爬虫中优雅地处理超时和延迟加载,并提供完整的代码实现,涵盖**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
、**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Selenium</font>**
、**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Playwright</font>**
等工具的最佳实践。
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
设置超时Python的**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
库允许在HTTP请求中设置超时参数:
import requests
url = "https://example.com"
try:
# 设置连接超时(connect timeout)和读取超时(read timeout)
response = requests.get(url, timeout=(3, 10)) # 3秒连接超时,10秒读取超时
print(response.status_code)
except requests.exceptions.Timeout:
print("请求超时,请检查网络或目标服务器状态")
except requests.exceptions.RequestException as e:
print(f"请求失败: {e}")
关键点:
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">timeout=(connect_timeout, read_timeout)</font>**
分别控制连接和读取阶段的超时。**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">aiohttp</font>**
实现异步超时控制对于高并发爬虫,**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">aiohttp</font>**
(异步HTTP客户端)能更高效地管理超时:
import aiohttp
import asyncio
async def fetch(session, url):
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=5)) as response:
return await response.text()
except asyncio.TimeoutError:
print("异步请求超时")
except Exception as e:
print(f"请求失败: {e}")
async def main():
async with aiohttp.ClientSession() as session:
html = await fetch(session, "https://example.com")
print(html[:100]) # 打印前100字符
asyncio.run(main())
优势:
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">ClientTimeout</font>**
可设置总超时、连接超时等参数。延迟加载(Lazy Loading)是指网页不会一次性加载所有内容,而是动态加载数据,常见于:
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Selenium</font>**
模拟浏览器行为**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Selenium</font>**
可以模拟用户操作,触发动态加载:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
driver.get("https://example.com/lazy-load-page")
# 模拟滚动到底部,触发加载
for _ in range(3): # 滚动3次
driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.END)
time.sleep(2) # 等待数据加载
# 获取完整页面
full_html = driver.page_source
print(full_html)
driver.quit()
关键点:
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">send_keys(Keys.END)</font>**
模拟滚动到底部。**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">time.sleep(2)</font>**
确保数据加载完成。**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Playwright</font>**
处理动态内容**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Playwright</font>**
(微软开源工具)比Selenium更高效,支持无头浏览器:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/lazy-load-page")
# 模拟滚动
for _ in range(3):
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000) # 等待2秒
# 获取完整HTML
full_html = page.content()
print(full_html[:500]) # 打印前500字符
browser.close()
优势:
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">wait_for_timeout()</font>**
比**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">time.sleep()</font>**
更灵活。爬取一个无限滚动加载的电商网站(如淘宝、京东),并处理超时问题。
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
def fetch_with_requests(url):
try:
response = requests.get(url, timeout=(3, 10))
return response.text
except requests.exceptions.Timeout:
print("请求超时,尝试使用Selenium")
return None
def fetch_with_selenium(url):
driver = webdriver.Chrome()
driver.get(url)
# 模拟滚动3次
for _ in range(3):
driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.END)
time.sleep(2)
html = driver.page_source
driver.quit()
return html
def main():
url = "https://example-shop.com/products"
# 先尝试用requests(更快)
html = fetch_with_requests(url)
# 如果失败,改用Selenium(处理动态加载)
if html is None or "Loading more products..." in html:
html = fetch_with_selenium(url)
# 解析数据(示例:提取商品名称)
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
products = soup.find_all('div', class_='product-name')
for product in products[:10]: # 打印前10个商品
print(product.text.strip())
if __name__ == "__main__":
main()
优化点:
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
(高效),失败后降级到**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Selenium</font>**
(兼容动态加载)。**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">BeautifulSoup</font>**
解析HTML。问题 | 解决方案 | 适用场景 |
---|---|---|
HTTP请求超时 | **<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests.get(timeout=(3, 10))</font>** | 静态页面爬取 |
高并发超时控制 | **<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">aiohttp + ClientTimeout</font>** | 异步爬虫 |
动态加载数据 | **<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Selenium</font>**模拟滚动/点击 | 传统动态页面 |
高效无头爬取 | **<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Playwright</font>** + **<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">wait_for_timeout</font>** | 现代SPA(单页应用) |
最佳实践建议:
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">timeout=(3, 10)</font>**
),避免无限等待。**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
),必要时再用浏览器自动化(**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Selenium/Playwright</font>**
)。