如果你抓取过像 Amazon 这样的全球电商网站,你一定有过这种崩溃体验:
同一个商品链接,打开美国站是英文版,切到日本站变成全角文字,再到德国站,居然还出现了 € 字符乱码。
最离谱的是——你的爬虫居然还能“自信地”打印出一堆看似乱码但没报错的内容。
这类问题往往不是代码写错,而是忽略了字符集、页面布局差异以及本地化策略。
今天我们不讲“完美写法”,反而要反着来:看一个错误案例,拆解为什么它会踩坑,然后再修成一个“更稳的版本”。
这段代码在 StackOverflow 上你可能都见过类似的。它能跑,但坑也跟着来了。
import requests
from bs4 import BeautifulSoup
url = "https://www.amazon.com/dp/B08N5WRWNW"
headers = {"User-Agent": "my-bot/1.0"}
resp = requests.get(url, headers=headers, timeout=10)
# 問題点:直接用 resp.text,不管编码是什么
soup = BeautifulSoup(resp.text, "html.parser")
title = soup.select_one("#productTitle").get_text(strip=True)
print("Title:", title)这段代码的表面逻辑没错:请求 → 解析 → 提取标题。
但在实际运行中,可能出现以下“灾情”:
resp.text 自动解码后,文字看着像“火星语”。#productTitle 根本找不到。于是 NoneType 报错,一运行直接挂。我们现在来一步步修好它
目标:
pip install requests bs4 chardet# robust_amazon_scraper.py
import time
import random
import requests
import chardet
from bs4 import BeautifulSoup
# === 代理配置(以亿牛云为例) ===
proxy_host = "proxy.16yun.cn"
proxy_port = 3100
proxy_user = "16YUN"
proxy_pass = "16IP"
proxy_url = f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"
proxies = {"http": proxy_url, "https": proxy_url}
# === 请求头 ===
HEADERS = {
"User-Agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"),
"Accept-Language": "en-US,en;q=0.9",
}
# === 小工具函数 ===
def check_robots():
"""简单检查 robots.txt"""
try:
r = requests.get("https://www.amazon.com/robots.txt", timeout=8)
print("robots.txt status:", r.status_code)
print("\n".join(r.text.splitlines()[:10]))
except Exception as e:
print("robots.txt 读取失败:", e)
def detect_encoding(resp):
"""更聪明的编码检测"""
if resp.encoding and resp.encoding.lower() != "iso-8859-1":
return resp.encoding
detected = chardet.detect(resp.content)
return detected.get("encoding") or "utf-8"
def safe_get(url, max_retries=3):
"""带重试与随机等待的请求函数"""
for i in range(max_retries):
try:
r = requests.get(url, headers=HEADERS, proxies=proxies, timeout=15)
if r.status_code == 200 and len(r.content) > 200:
return r
except Exception as e:
print("请求失败:", e)
time.sleep(random.uniform(1, 3))
raise RuntimeError("多次请求失败")
def parse_amazon_page(html):
"""针对不同语言/布局容错"""
soup = BeautifulSoup(html, "html.parser")
title = None
for sel in ["#productTitle", "h1 span#productTitle", ".a-size-large.a-color-base.a-text-normal"]:
node = soup.select_one(sel)
if node:
title = node.get_text(strip=True)
break
price = None
for sel in ["#priceblock_ourprice", ".a-price .a-offscreen", "#corePriceDisplay_desktop_feature_div .a-offscreen"]:
node = soup.select_one(sel)
if node:
price = node.get_text(strip=True)
break
return title, price
def scrape_product(url):
check_robots()
resp = safe_get(url)
encoding = detect_encoding(resp)
html = resp.content.decode(encoding, errors="replace")
title, price = parse_amazon_page(html)
return {"title": title, "price": price, "encoding": encoding}
if __name__ == "__main__":
result = scrape_product("https://www.amazon.com/dp/B08N5WRWNW")
print(result)chardet 判断真实编码,而不是盲目信任 requests 的自动检测。chardet.detect() 的习惯。原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。