因近期小组的一个项目有文本挖掘的需求,需要用到Word2Vec的文本特征抽取,为了进行技术预演需要我们提前对模型进行训练。而只要涉及数据挖掘相关的模型,数据集是不必可少的。中文文本挖掘领域,百科词条涵盖面广,而且内容比较丰富,于是便选择百科的词条作为数据集 (http://baike.com)。
2.1 抓取方案
step1:
收集百科词条种子(后台的id列表)
step2:
获取详情页并解析html中的词条正文
step3:
数据保存(以文本txt保存或者存库)
a)如何获取加载列表的js请求地址和请求参数格式?
打开Chrome浏览器之后,键盘上按“F12”进入调试界面
b)如从词条详情页获取正文的css样式 ?
2.2 代码实现
step1:收集词条id列表并保存到redis
1 def fetch_seeds():
2 print "-- fetch seeds --"
3 cnt = 0
4 for def_index in range(4, 10):
5 ret = do_run(index=def_index)
6 cnt += ret
7 print("cnt = %d" % cnt)
8
9 def do_run(index, page_num=100):
10 artical_list = []
11 for pn in range(1, page_num + 1):
12 try:
13 url = 'http://api.hudong.com/flushjiemi.do?flag=2&topic=%d&page=%d&type=2' % (index, pn)
14 retText = fetch(url)
15 print("ret = %s" % retText)
16 ret_json = json.loads(retText, encoding='utf-8')
17 result = ret_json["result"]
18 if len(result) > 0:
19 for ob in result:
20 # artical_list.append(ob["article_topic_name"])
21 # artical_list.append("%s%s%s" % (ob["article_topic_name"], "-", ob["article_id"]))
22 artical_list.append(ob["article_id"])
23 save2redis(index, artical_list)
24 # sleep
25 if pn % 5 == 0:
26 print 'pn=%d, sleeping...' % pn
27 time.sleep(1)
28 except:
29 print "http get or parse error!"
30
31 return 1
32
33 def save2redis(index, article_list):
34 r = redis.Redis(host=redis_db_host, port=redis_db_port, db=redis_db_index)
35 for article in article_list:
36 r.sadd("%s-%d" % ("news.set", index), article)
step2:抓取词条详情并保存到redis
1 def fetch_detail():
2 print "-- fetch detail --"
3 r = redis.Redis(host=redis_db_host, port=redis_db_port, db=redis_db_index)
4 cnt = 0
5 for news_index in range(4, 10):
6 seeds = r.smembers("%s-%s" % ("news.set", news_index))
7 if len(seeds) > 0:
8 for seed in seeds:
9 try:
10 ret = crawl(seed)
11 cnt += 1
12 if cnt % 10 == 0:
13 time.sleep(2)
14 print 'cnt=%d, sleeping...' % cnt
15 # save to redis
16 save_detail(seed, result=ret)
17 # break # unit test
18 except:
19 print "fetch detail error!!!"
20 pass
21
22 def crawl(page_no):
23 url = 'http://jiemi.baike.com/pa/detail?id=%s&type=1' % page_no
24 print "url=", url
25 content = fetch(url)
26 soup = BeautifulSoup(content, "html.parser")
27 return fetch_with_class(soup, class_type="jiemi-content")
28
29 def save_detail(seed, result=""):
30 r = redis.Redis(host=redis_db_host, port=redis_db_port, db=redis_db_index_2)
31 r.set("id_%s" % seed, result)
32 return 1
1)环境说明 python2.7, redis4.x
2)github项目完整源码 https://github.com/SeaSky0606/baike-crawler
3)为维护网络和谐,词条数据仅适用于研究与学习,请勿恶意抓取。
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
我的博客即将同步至腾讯云开发者社区,邀请大家一同入驻:https://cloud.tencent.com/developer/support-plan?invite_code=3jkio1or10cgg