Python Scrapy 跨平台爬虫实战：XPath 解析与结构化数据提取

原创

小白学大数据

发布于 2026-06-29 16:52:52

1820

爬虫开发中，请求—下载—解析—存储是最基础的四段流水线。请求和下载部分各语言方案大同小异，真正拉开效率差距的是解析层。BeautifulSoup 面对深层嵌套、条件筛选时力不从心；正则可读性差、维护成本高。XPath 是 W3C 标准查询语言，专为树结构设计，配合 Scrapy 的异步引擎，在大规模、跨平台爬虫项目中几乎没有对手。

一、Scrapy 项目初始化

pip install scrapy
scrapy startproject multispider && cd multispider
scrapy genspider technews example.com

在 items.py 中声明结构化字段：

import scrapy

class NewsItem(scrapy.Item):
    title        = scrapy.Field()
    url          = scrapy.Field()
    author       = scrapy.Field()
    publish_date = scrapy.Field()
    content      = scrapy.Field()
    tags         = scrapy.Field()
    source       = scrapy.Field()

二、XPath 高频语法速查

场景	表达式	说明
全局搜索	//div[@class="list"]	不关心层级
相对定位	.//h2/a/@href	以当前节点为根，实战最关键
模糊匹配	contains(@class, "active")	多 class 场景必用
位置限定	//li[position()<=3]	取前 N 个
轴遍历	//h2/following-sibling::p	取兄弟节点
条件排除	//p[not(contains(@class,"ad"))]	XPath 原生过滤广告

核心原则：循环遍历列表项时，子元素 XPath 必须以 . 开头（.//），否则会回到整个文档根节点全局搜索，导致数据错位。

三、核心爬虫：列表页 → 详情页两级解析

编辑 spiders/technews.py：

import scrapy
from multispider.items import NewsItem

class TechNewsSpider(scrapy.Spider):
    name = 'technews'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/news']

    def parse(self, response):
        # 列表页：定位所有文章条目
        for article in response.xpath('//div[@class="article-list"]/article'):
            detail_url = article.xpath('.//h2/a/@href').get()
            if detail_url:
                yield response.follow(
                    detail_url,
                    callback=self.parse_detail,
                    meta={'list_title': article.xpath('.//h2/a/text()').get(default='').strip()}
                )

        # 翻页
        next_page = response.xpath('//a[contains(@class,"next")]/@href').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_detail(self, response):
        item = NewsItem()
        item['title'] = (
            response.xpath('//h1[@class="article-title"]/text()').get(default='').strip()
            or response.meta.get('list_title', '')
        )
        item['url']          = response.url
        item['author']       = response.xpath('//span[@class="author-name"]/text()').get(default='匿名').strip()
        item['publish_date'] = response.xpath('//time[@class="publish-date"]/@datetime').get()
        item['tags']         = response.xpath('//div[@class="tags"]//a/text()').getall()
        item['source']       = 'technews'

        # 正文提取：排除广告/推荐节点
        paragraphs = response.xpath(
            '//div[@class="article-body"]'
            '//p[not(contains(@class,"ad")) and not(contains(@class,"recommend"))]'
            '/text()'
        ).getall()
        item['content'] = '\n'.join(p.strip() for p in paragraphs if p.strip())

        yield item

四个关键技巧：

.// 相对路径：循环体内必须用 . 开头，避免跨条目误抓
get(default='')：防止 NoneType 错误，提供安全兜底
response.follow()：自动补全相对 URL，无需手动拼域名
meta 透传：列表页元数据传递到详情页，做 fallback 容错

四、跨平台适配：规则配置与爬虫逻辑解耦

不同站点 HTML 结构不同，但数据模型和清洗逻辑完全可复用。核心思路是将 XPath 规则抽成配置字典：

SITE_RULES = {
    'siteA': {
        'start_urls':    ['https://site-a.com/news'],
        'list_item':     '//div[@class="news-item"]',
        'detail_link':   './/a[@class="title"]/@href',
        'title':         '//h1[@class="post-title"]/text()',
        'author':        '//span[@itemprop="author"]/text()',
        'publish_date':  '//meta[@property="article:published_time"]/@content',
        'content':       '//div[@class="post-content"]//p/text()',
        'tags':          '//div[@class="tag-list"]//a/text()',
        'next_page':     '//a[@rel="next"]/@href',
    },
    'siteB': {
        # ... 另一个站点的规则
    },
}

class MultiSiteSpider(scrapy.Spider):
    name = 'multisite'

    def start_requests(self):
        for site_name, rules in SITE_RULES.items():
            for url in rules['start_urls']:
                yield scrapy.Request(url, callback=self.parse_list,
                                     meta={'site_name': site_name, 'rules': rules})

    def parse_list(self, response):
        rules = response.meta['rules']
        for article in response.xpath(rules['list_item']):
            link = article.xpath(rules['detail_link']).get()
            if link:
                yield response.follow(link, callback=self.parse_detail, meta=response.meta)
        # 翻页
        next_page = response.xpath(rules['next_page']).get()
        if next_page:
            yield response.follow(next_page, callback=self.parse_list, meta=response.meta)

    def parse_detail(self, response):
        rules = response.meta['rules']
        item = NewsItem()
        item['url']    = response.url
        item['source'] = response.meta['site_name']
        item['title']  = response.xpath(rules['title']).get(default='').strip()
        item['author'] = response.xpath(rules['author']).get(default='匿名').strip()
        item['content'] = '\n'.join(p.strip() for p in response.xpath(rules['content']).getall() if p.strip())
        item['tags']   = response.xpath(rules['tags']).getall()
        yield item

新增站点只需加一段规则配置，核心代码零改动——这是 Scrapy 跨平台扩展的工程优势。

五、接入代理 IP：突破反爬封锁

跨平台大规模爬虫必然触发目标站点的 IP 频率限制。以亿牛云爬虫代理为例，在 Scrapy 中接入代理只需编写一个下载器中间件。

新建 middlewares.py：

import base64
import random

def base64ify(bytes_or_str):
    """生成 Proxy-Authorization 认证头"""
    input_bytes = bytes_or_str.encode('utf8') if isinstance(bytes_or_str, str) else bytes_or_str
    return base64.urlsafe_b64encode(input_bytes).decode('ascii')

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        # 亿牛云爬虫代理参数（官网 www.16yun.cn）
        proxyHost = "t.16yun.cn"
        proxyPort = "31111"
        proxyUser = "username"    # 替换为你的用户名
        proxyPass = "password"    # 替换为你的密码

        # 设置代理地址
        request.meta['proxy'] = f"http://{proxyHost}:{proxyPort}"

        # 添加认证头（Scrapy 2.6.2+ 可省略，会自动设置）
        request.headers['Proxy-Authorization'] = 'Basic ' + base64ify(f"{proxyUser}:{proxyPass}")

        # 设置 Proxy-Tunnel：相同随机数 = 相同出口 IP（适合需要登录态保持的场景）
        tunnel = random.randint(1, 10000)
        request.headers['Proxy-Tunnel'] = str(tunnel)

        # 如需每个请求强制切换 IP，关闭连接复用
        request.headers['Connection'] = "Close"

在 settings.py 中启用中间件并配置重试策略：

DOWNLOADER_MIDDLEWARES = {
    'multispider.middlewares.ProxyMiddleware': 100,
}

# 代理认证失败（407）时自动重试
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 407, 408, 429]

# 并发与限速
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0.5
DOWNLOAD_TIMEOUT = 15

代理 IP 使用要点：

场景	配置方式	说明
每次请求换 IP	Connection: Close + 随机 Tunnel	最常用，适合批量抓取
保持同一 IP	固定 Proxy-Tunnel 值	适合需要登录/Cookie 缓存的流程
HTTPS 站点	使用库原生代理认证	避免手动 Proxy-Authorization 被转发到目标站
407 错误	检查域名/端口/用户名/密码	认证信息错误
429 错误	降低并发或增加延迟	请求速率超出订单上限

六、数据清洗管道

编辑 pipelines.py，将清洗逻辑与爬虫逻辑分离：

import re
import json
from datetime import datetime
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem

class DataCleaningPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # 去除控制字符和首尾空白
        for field in ['title', 'author', 'content']:
            val = adapter.get(field, '')
            if val:
                val = re.sub(r'[\x00-\x1f\x7f-\x9f\u00a0]', '', val).strip()
                adapter[field] = val if val else None

        # 标签去重
        tags = adapter.get('tags', [])
        seen, cleaned = set(), []
        for tag in (t.strip() for t in tags if t.strip()):
            key = tag.lower()
            if key not in seen:
                seen.add(key)
                cleaned.append(tag)
        adapter['tags'] = cleaned[:10]

        # 必填校验
        if not adapter.get('title'):
            raise DropItem("Missing title")
        return item

class JsonExportPipeline:
    def open_spider(self, spider):
        self.file = open('output.jsonl', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        self.file.write(json.dumps(dict(item), ensure_ascii=False) + '\n')
        return item
    def close_spider(self, spider):
        self.file.close()

# settings.py
ITEM_PIPELINES = {
    'multispider.pipelines.DataCleaningPipeline': 100,
    'multispider.pipelines.JsonExportPipeline': 200,
}

七、调试与运行

# Scrapy Shell 验证 XPath（写代码前必做）
scrapy shell 'https://example.com/news'
>>> response.xpath('//h1[@class="article-title"]/text()').get()
'Python 3.12 新特性解析'

# 运行爬虫
scrapy crawl technews -o results.json

八、XPath 避坑指南

陷阱	错误写法	正确写法
全局搜索误抓	article.xpath('//h2/text()')	article.xpath('.//h2/text()')
多 class 失配	@class="item active"	contains(@class, "active")
空白未处理	.get() 直接用	.get(default='').strip()
编码乱码	默认编码	FEED_EXPORT_ENCODING='utf-8'

九、总结

Scrapy + XPath 的工程价值集中在三个层面：

解析层：XPath 的树结构查询能力远超 BeautifulSoup，深层嵌套、多条件筛选、跨轴遍历是原生优势
架构层：异步引擎 + 中间件 + Pipeline 天然支持大规模、跨平台扩展。规则配置与爬虫逻辑解耦，新增站点边际成本趋近于零
反爬层：通过代理 IP 中间件（如亿牛云爬虫代理）无缝接入 IP 池，Proxy-Tunnel 机制精确控制 IP 切换时机，配合 407 重试策略保障稳定性

实际项目中，先用 scrapy shell 验证 XPath 表达式再写代码；清洗逻辑统一收敛到 Pipeline；代理中间件根据业务场景选择随机 IP 或固定 IP 模式。这三点做到位，爬虫的可维护性和稳定性会有质的提升。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

scrapy