—— 灵感型方案:从信息瓶颈到工程化落地
以 央视新闻、中国新闻网、环球网 为例,这三类新闻源基本覆盖了国内外的核心时事:
在采集过程中,常见的难点包括:
借鉴金融系统中的“变动通知”机制,可以设计一个 多源新闻的增量采集引擎:
这种方式能避免重复抓取,大幅节省流量与计算资源。
以下示例代码展示了一个简化的 多站点增量采集逻辑,以 Python 为例:
import requests
from bs4 import BeautifulSoup
import hashlib
import time
# ========== 代理配置(示例:亿牛云爬虫代理) ==========
proxy_host = "proxy.16yun.cn"
proxy_port = "10000"
proxy_user = "16YUN"
proxy_pass = "16IP"
proxies = {
"http": f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}",
"https": f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"
}
# ========== 增量逻辑 ==========
visited = {}
def get_hash(text: str) -> str:
return hashlib.md5(text.encode("utf-8")).hexdigest()
def fetch_list(url: str, selector: str, attr="href"):
r = requests.get(url, proxies=proxies, timeout=10)
r.encoding = "utf-8"
soup = BeautifulSoup(r.text, "html.parser")
return [a[attr] for a in soup.select(selector) if a.get(attr)]
def fetch_detail(url: str, title_sel: str, content_sel: str):
r = requests.get(url, proxies=proxies, timeout=10)
r.encoding = "utf-8"
soup = BeautifulSoup(r.text, "html.parser")
title = soup.select_one(title_sel).text.strip()
content = "\n".join([p.text.strip() for p in soup.select(content_sel)])
return {"title": title, "content": content}
def update(site: str, url: str, title_sel: str, content_sel: str):
try:
data = fetch_detail(url, title_sel, content_sel)
content_hash = get_hash(data["content"])
if url not in visited:
print(f"[{site}] 新增:{data['title']}")
visited[url] = content_hash
elif visited[url] != content_hash:
print(f"[{site}] 更新:{data['title']}")
visited[url] = content_hash
except Exception as e:
print(f"[{site}] 失败:{url} -> {e}")
# ========== 站点配置 ==========
sites = {
"央视新闻": {
"list_url": "https://news.cctv.com/roll/index.shtml",
"list_selector": "div.roll_yc a",
"title_selector": "h1",
"content_selector": "div.content_area p"
},
"中国新闻网": {
"list_url": "https://www.chinanews.com.cn/scroll-news/news1.html",
"list_selector": "div.news_list a",
"title_selector": "h1",
"content_selector": "div.left_zw p"
},
"环球网": {
"list_url": "https://www.huanqiu.com/channel/23",
"list_selector": "a.item",
"title_selector": "h1",
"content_selector": "div.b-container p"
}
}
# ========== 主循环 ==========
if __name__ == "__main__":
while True:
for site, conf in sites.items():
print(f"\n=== 采集 {site} ===")
urls = fetch_list(conf["list_url"], conf["list_selector"])
for link in urls:
if not link.startswith("http"):
if "chinanews" in conf["list_url"]:
link = "https://www.chinanews.com.cn" + link
elif "cctv" in conf["list_url"]:
link = "https://news.cctv.com" + link
elif "huanqiu" in conf["list_url"]:
link = "https://www.huanqiu.com" + link
update(site, link, conf["title_selector"], conf["content_selector"])
time.sleep(120)
在上述三个新闻源上测试:
结果表明,多站点统一的增量采集机制在新闻数据抓取中更高效。
从技术实现到应用价值,这种方法可以直接转化为 行业级的舆情雷达解决方案。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。