用AI批量下载Sam Altman个人博客页面

AIGC部落

发布于 2025-01-15 22:06:45

480

文章被收录于专栏：Dance with GenAIDance with GenAI

Sam Altman个人博客页面：https://blog.samaltman.com/

找到翻页规律：https://blog.samaltman.com/?page=12

在deepseek中输入提示词：

你是一个Python编程专家，要写一个网页爬虫任务的脚本，具体步骤如下：

打开网页：https://blog.samaltman.com/?page={pagenumber},参数{pagenumber}的值是从1到12；

定位网页中所有class="post"的article元素，然后定位其中的a元素，提取其文本内容，作为网页标题；

提取其href属性值，作为网页URL；

下载网页，保存到文件夹：F:\Sam Altman,网页格式为html，网页文件名用上一步中提取的网页标题，网页中的全部图片要按照原文样式保存下来；

注意：每一步都要输出信息到屏幕上文件名中可能会包含了非法字符（例如问号?），导致文件无法保存。Windows文件系统不允许文件名中包含某些特殊字符，如<, >, :, ", /, \, |, ?, *。为了解决这个问题，我们需要在保存文件之前对文件名进行清理，去除或替换这些非法字符，可以使用正则表达式来完成这个任务。

Deepseek给出源代码：

import os

import re

import requests

from bs4 import BeautifulSoup

from urllib.parse import urljoin

# 定义基础URL和页码范围

base_url = "https://blog.samaltman.com/?page={}"

pages = range(1, 13)

# 定义保存路径

save_folder = r"F:\Sam Altman"

# 确保保存文件夹存在

if not os.path.exists(save_folder):

os.makedirs(save_folder)

# 定义一个函数来清理文件名

def clean_filename(filename):

# 替换非法字符为下划线

cleaned_filename = re.sub(r'[<>:"/\\|?*]', '_', filename)

return cleaned_filename

# 遍历每一页

for page_number in pages:

print(f"Processing page {page_number}")

url = base_url.format(page_number)

# 发送请求获取网页内容

response = requests.get(url)

if response.status_code != 200:

print(f"Failed to retrieve page {page_number}")

continue

# 解析HTML内容

soup = BeautifulSoup(response.text, 'html.parser')

# 定位所有class="post"的article元素

posts = soup.find_all('article', class_='post')

# 遍历每个post

for post in posts:

# 定位a元素并提取文本内容作为标题

title_element = post.find('a')

if title_element:

title = title_element.text.strip()

post_url = urljoin(base_url, title_element['href'])

print(f"Found post: {title} with URL: {post_url}")

# 清理文件名

cleaned_title = clean_filename(title)

# 下载文章页面

post_response = requests.get(post_url)

if post_response.status_code != 200:

print(f"Failed to retrieve post: {title}")

continue

# 保存文章页面到文件

post_html_path = os.path.join(save_folder, f"{cleaned_title}.html")

with open(post_html_path, 'w', encoding='utf-8') as file:

file.write(post_response.text)

print(f"Saved post: {title} to {post_html_path}")

# 解析文章页面内容

post_soup = BeautifulSoup(post_response.text, 'html.parser')

# 定位所有图片元素

images = post_soup.find_all('img')

# 遍历每个图片并下载

for img in images:

img_url = urljoin(post_url, img['src'])

img_response = requests.get(img_url)

if img_response.status_code != 200:

print(f"Failed to retrieve image: {img_url}")

continue

# 保存图片到文件

img_name = os.path.basename(img_url)

img_path = os.path.join(save_folder, img_name)

with open(img_path, 'wb') as img_file:

img_file.write(img_response.content)

print(f"Saved image: {img_name} to {img_path}")

print("All pages processed.")

脚本说明：

页面遍历：脚本会遍历从1到12的页码，构建每个页面的URL并发送请求。

文章提取：对于每个页面，脚本会定位所有class="post"的article元素，并提取其中的a元素的文本内容作为标题，以及href属性作为文章的URL。

文章下载：脚本会下载每个文章的HTML内容，并保存到指定的文件夹中，文件名使用提取的标题。

图片下载：对于每个文章页面，脚本会定位所有的img元素，下载图片并保存到相同的文件夹中。

注意事项：

确保你已经安装了requests和beautifulsoup4库。如果没有安装，可以使用以下命令安装：

bash

复制

pip install requests beautifulsoup4

由于网络请求可能会失败，脚本中包含了错误处理逻辑，以确保在请求失败时不会中断整个爬虫任务。

保存的文件名和路径使用了os.path模块来处理，确保路径的正确性。

这个脚本应该能够满足你的需求，并且会在每一步输出相关信息到屏幕上。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2025-01-14，如有侵权请联系 cloudcommunity@tencent.com 删除

post

本文分享自 Dance with GenAI 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

用AI批量下载Sam Altman个人博客页面

用AI批量下载Sam Altman个人博客页面

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐