在求职季,你是否曾对着成百上千的招聘信息发呆?HR每天要筛选数百份简历,求职者要在海量岗位中找方向。如果能用技术手段快速提取岗位核心需求,无论是求职者优化简历还是企业分析人才市场,都能事半功倍。

本文将手把手教你用Python爬取某联招聘的职位需求,并通过词云可视化展示高频关键词。整个过程分为三个阶段:数据采集(爬虫)、数据处理(清洗)、数据可视化(词云),即使没有编程基础也能跟着操作。
pip install requests beautifulsoup4 pandas wordcloud jieba matplotlib
requests:发送HTTP请求BeautifulSoup:解析HTMLpandas:数据处理wordcloud:生成词云jieba:中文分词matplotlib:绘图展示打开某联招聘官网(https://www.***.com),以"Python开发"为例搜索:
https://sou.***.com/?kw=Python开发&page=1page递增.job-list ul liimport requests
from bs4 import BeautifulSoup
def get_job_list(keyword, page):
url = f"https://sou.***.com/?kw={keyword}&page={page}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
try:
response = requests.get(url, headers=headers, timeout=10)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
jobs = []
for li in soup.select('.job-list ul li'):
title = li.select_one('.job-title').text.strip()
company = li.select_one('.company-name').text.strip()
salary = li.select_one('.salary').text.strip() if li.select_one('.salary') else "面议"
requirement = li.select_one('.job-requirements').text.strip() if li.select_one('.job-requirements') else ""
jobs.append({
'title': title,
'company': company,
'salary': salary,
'requirement': requirement
})
return jobs
else:
print(f"请求失败,状态码:{response.status_code}")
return []
except Exception as e:
print(f"发生错误:{e}")
return []
问题1:直接请求返回403

问题2:频繁请求被封IP

进阶技巧:
time.sleep(random.uniform(1,3))添加随机延迟def crawl_all_pages(keyword, max_pages=5):
all_jobs = []
for page in range(1, max_pages+1):
print(f"正在爬取第{page}页...")
jobs = get_job_list(keyword, page)
if not jobs:
break
all_jobs.extend(jobs)
time.sleep(2) # 礼貌性延迟
return all_jobs
# 执行爬取
jobs_data = crawl_all_pages("Python开发", 3)
import pandas as pd
df = pd.DataFrame(jobs_data)
# 去除空值
df = df.dropna(subset=['requirement'])
# 保存原始数据(可选)
df.to_csv('zhaopin_raw.csv', index=False, encoding='utf_8_sig')
观察发现,职位描述通常包含:

all_requirements = ' '.join(df['clean_require'].tolist())
import jieba
# 添加自定义词典(可选)
jieba.load_userdict("tech_terms.txt") # 可自行准备技术术语词典
# 分词并过滤停用词
stopwords = set()
with open('stopwords.txt', 'r', encoding='utf-8') as f: # 停用词表
for line in f:
stopwords.add(line.strip())
words = [word for word in jieba.cut(all_requirements)
if len(word) > 1 and word not in stopwords and not word.isspace()]
word_str = ' '.join(words)
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# 设置词云参数
wc = WordCloud(
font_path='simhei.ttf', # 中文字体文件路径
background_color='white',
width=800,
height=600,
max_words=100,
max_font_size=100
)
# 生成词云
wc.generate(word_str)
# 显示词云
plt.figure(figsize=(10, 8))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.savefig('zhaopin_wordcloud.png', dpi=300, bbox_inches='tight')
plt.show()

colormap参数选择matplotlib配色方案collections.Counter查看高频词import requests
from bs4 import BeautifulSoup
import pandas as pd
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import time
import random
# 1. 数据采集
def get_job_list(keyword, page):
url = f"https://sou.***.com/?kw={keyword}&page={page}"
headers = {
"User-Agent": "Mozilla/5.0...",
"Referer": "https://www.***.com/"
}
try:
response = requests.get(url, headers=headers, timeout=10)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
jobs = []
for li in soup.select('.job-list ul li'):
# 解析逻辑同上...
return jobs
return []
except:
return []
# 2. 数据处理
def process_data(jobs_data):
df = pd.DataFrame(jobs_data)
df = df.dropna(subset=['requirement'])
def extract_req(text):
# 提取逻辑同上...
pass
df['clean_require'] = df['requirement'].apply(extract_req)
all_text = ' '.join(df['clean_require'].tolist())
return all_text
# 3. 生成词云
def generate_wordcloud(text):
stopwords = set(['可以', '能够', '具有', '等']) # 示例停用词
words = [word for word in jieba.cut(text)
if len(word) > 1 and word not in stopwords]
wc = WordCloud(
font_path='simhei.ttf',
width=800,
height=600
).generate(' '.join(words))
plt.figure(figsize=(10, 8))
plt.imshow(wc)
plt.axis('off')
plt.savefig('output.png')
plt.show()
# 主流程
if __name__ == "__main__":
keyword = "Python开发"
jobs = []
for page in range(1, 4):
jobs.extend(get_job_list(keyword, page))
time.sleep(2)
text = process_data(jobs)
generate_wordcloud(text)
Q1:被网站封IP怎么办? A:立即启用备用代理池,建议使用住宅代理(如站大爷IP代理),配合每请求更换IP策略。更稳妥的方式是购买付费代理服务,这类代理通常有更高的匿名度和稳定性。
Q2:如何获取更精准的职位描述? A:某联招聘的HTML结构可能变化,建议:
.job-description > p)Q3:词云中出现无关词汇怎么过滤? A:三步解决:
min_font_size参数过滤低频词Q4:如何爬取其他招聘网站? A:核心流程相同,需注意:
Q5:代码运行报错怎么办? A:按这个顺序排查:
schedule库实现每天自动爬取通过这个项目,你不仅掌握了网络爬虫和数据分析的基本技能,更重要的是建立了从数据采集到价值输出的完整思维链条。下次面对海量招聘信息时,你就可以用代码快速提炼关键信息,让技术真正服务于实际需求。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。