前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Python爬虫学习之爬取百度贴吧资源

Python爬虫学习之爬取百度贴吧资源

作者头像
python学习教程
发布2019-12-25 16:51:38
8540
发布2019-12-25 16:51:38
举报
文章被收录于专栏:python学习教程

爬取百度贴吧某帖子的各楼层的内容

案例源码

代码语言:javascript
复制
# coding=utf-8

import urllib2

from bs4 import BeautifulSoup


class BDTB:

    def __init__(self, baseurl, seeLZ, floorTag):

        self.baseurl = baseurl

        self.seeLZ = '?see_lz=' + str(seeLZ)

        self.file = None

        self.floor = 1

        self.floorTag = floorTag

        self.defaultTitle = u"百度贴吧"

    def getpage(self, pagenum):

        try:

            url = self.baseurl + self.seeLZ + '&pn=' + str(pagenum)

            request = urllib2.Request(url)

            response = urllib2.urlopen(request)

            page = BeautifulSoup(response, "html5lib")

            return page

        except urllib2.URLError, e:

            if hasattr(e, 'reason'):

                print u"连接百度贴吧失败,错误原因", e.reason

                return None

    def getTitle(self):

        page = self.getpage(1)

        tag = page.h3

        title = tag['title']

        print title

        return title

    def getPageNum(self):

        page = self.getpage(1)

        num = page.find_all(attrs={"class": "red"})

        pagenum = num[1].string

        return int(pagenum)

    def getcontent(self):

        pagenum = self.getPageNum() + 1

        contents = []

        for num in range(1, pagenum):

            page = self.getpage(num)

            num = page.find_all('cc')

            for item in num:

                content = item.get_text()

                contents.append(content.encode('utf-8'))

        return contents

    def getFileTitle(self):

        title = self.getTitle()

        if title is not None:

            self.file = open(title + ".txt", "w+")

        else:

            self.file = open(self.defaultTitle + ".txt", "w+")

    def writeData(self):

        contents = self.getcontent()

        for item in contents:

            if self.floorTag == '1':

                floorLine = '\n' + \
                    str(self.floor) + \
                    u'---------------------------------------------\n'

                self.file.write(floorLine)

            self.file.write(item)

            self.floor += 1

    def start(self):

        self.getFileTitle()

        pagenum = self.getPageNum()

        if pagenum == None:

            print "URL已失效,请重试"

            return

        try:

            print "该帖子共有" + str(pagenum) + "页"

            self.writeData()

        except IOError, e:

            print "写入异常,原因" + e.message

        finally:

            print "写入成功"


print u"请输入帖子代号"

baseurl = 'http://tieba.baidu.com/p/' + \
    str(raw_input(u'http://tieba.baidu.com/p/'))

seeLZ = raw_input("是否只获取楼主发言,是输入1,否输入0\n")

floorTag = raw_input("是否写入楼层信息,是输入1否输入0\n")

bdtb = BDTB(baseurl, seeLZ, floorTag)

bdtb.start()
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2019-12-13,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 python教程 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 案例源码
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档