前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >从登陆到爬取:Python反反爬获取某宝成千上万条公开商业数据

从登陆到爬取:Python反反爬获取某宝成千上万条公开商业数据

作者头像
荣仔_最靓的仔
发布于 2021-02-02 09:50:50
发布于 2021-02-02 09:50:50
1.2K00
代码可运行
举报
运行总次数:0
代码可运行

不知从何时起,开始享受上了爬取成千上万条数据的感觉!

本文将运用Python反反爬技术讲解如何获取某宝成千上万条公开商业数据。

目录

1 前期准备

2 案例详解

2.1 导入模块

2.2 核心代码

2.3 总观代码

3 总结声明


1 前期准备

Python环境:Python 3.8.2

Python编译器:JetBrains PyCharm 2018.1.2 x64

第三方库及模块:selenium、time、csv、re

此外,还需要一个浏览器驱动器:webDriver

其中,selenium是一个第三方库,需要另外安装,就在终端输入下述命令行即可

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
pip install selenium

输入

若未报错则证明第三方库安装成功。

这里再说一下浏览器驱动器如何安装(以谷歌浏览器为例介绍):

首先,下载浏览器驱动器WebDriver

chrom浏览器的驱动器下载地址:http://npm.taobao.org/mirrors/chromedriver/

firefox(火狐浏览器)的驱动器下载地址:https://github.com/mozilla/geckodriver/releases

Edge浏览器的驱动器下载地址:https://developer.microsoft.com/en-us/micrsosft-edage/tools/webdriver

Safari浏览器的驱动器下载地址:https://webkit.org/blog/6900/webdriver-support-in-safari-10/

以谷歌浏览器为例,需要首先知道浏览器的版本号

只需要前面的

对应好就OK,大的方向对应了就行,然后找到相匹配的版本进行下载

下载好以后测试一下

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
# 从 selenium 里面导入 webdriver
from selenium import webdriver

# 指定 chrom 驱动(下载到本地的浏览器驱动器,地址定位到它)
driver = webdriver.Chrome('E:/software/chromedriver_win32/chromedriver.exe')

# get 方法打开指定网址
driver.get('http://www.baidu.com')

至此,准备工作就绪,接下来正式进入爬虫案例讲解

2 案例详解

2.1 导入模块

将前文所述的第三方库及相关模块进行导入

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
from selenium.webdriver import ActionChains # 导入动作链
from selenium import webdriver
import time
import csv
import re

2.2 核心代码

确定目标网页:淘宝网(官网)

编写自动打开目标网页代码

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
# 传入浏览器驱动本地地址
driver = webdriver.Chrome('E:/software/chromedriver_win32/chromedriver.exe')
# 传入目标页面地址
driver.get('https://www.taobao.com/')

最大化浏览器

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
driver.maximize_window() # 最大化浏览器

传入关键字并实现自动搜索商品

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
keyword = input('请输入您要搜索的商品名字:')
driver.find_element_by_id('q').send_keys(keyword)  # 根据“检查”的id值精确定位淘宝网搜索框并传入关键字
driver.find_element_by_class_name('btn-search').click() # 根据class标签'btn-search'定位到搜索按钮并点击

这时我们发现,需要登录才能查看搜索内容,那么接下来解决登陆问题

传入账号密码(这里通过F12键定位其xpath值)

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
driver.find_element_by_xpath('//*[@id="fm-login-id"]').send_keys('账号')
driver.find_element_by_xpath('//*[@id="fm-login-password"]').send_keys('密码')

解决人机验证问题(反反爬,实现滑块向右滑动)

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
login = driver.find_element_by_xpath('//*[@id="nc_1_n1z"]') # 通过xpath找到滑块
action = ActionChains(driver)  # 创造出一个动作链
action.click_and_hold(on_element=login) # 点击不松开
action.move_by_offset(xoffset=300-42, yoffset=0) # 通过坐标轴滑动
action.pause(0.5).release().perform() # 设置链式调用时间(滑动滑块时间),并松开鼠标   perform()执行动作链

获取整个页面的目标数据值(for循环)

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
divs = driver.find_elements_by_xpath('//div[@class="items"]/div[@class="item J_MouserOnverReq  "]')
for div in divs:
    info = div.find_element_by_xpath('.//div[@class="row row-2 title"]/a').text
    price = div.find_element_by_xpath('.//strong').text
    deal = div.find_element_by_xpath('.//div[@class="deal-cnt"]').text
    shop = div.find_element_by_xpath('.//div[@class="shop"]/a').text

保存文件(以csv格式进行存储)

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
with open('data.csv', mode='a', newline="") as csvfile:
    csvWriter = csv.writer(csvfile, delimiter=',')
    csvWriter.writerow([info, price, deal, shop])

以上是爬取一页数据,那么爬取多页数据怎么撸代码呢

获取总页数

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
page = driver.find_element_by_xpath('//*[@id="mainsrp-pager"]/div/div/div/div[1]').text # 获取总页数标签
page_list = re.findall('(\d+)', page)  # 正则表达式获取多个精确数字数据[返回的是列表]
page_num = page_list[0]   # 字符串类型数据

for循环遍历所有页面,获取该商品的所有数据

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
driver.get('https://s.taobao.com/search?q={}&s={}'.format(keyword, page_num*44))
page_num += 1

值得注意的是,上述代码的页面地址是根据查看多页地址数据获得规律总结出来的

很显然,从第

页数据地址开始,其

值从

,以数字44叠加规律生成。

2.3 总观代码

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
from selenium.webdriver import ActionChains # 导入动作链
from selenium import webdriver
import time
import csv
import re

# 寻找关键字并登录淘宝
def search_product(key):
    driver.get('https://www.taobao.com/')
    driver.find_element_by_id('q').send_keys(key)  # 根据“检查”的id值精确定位淘宝网搜索框并传入关键字
    driver.find_element_by_class_name('btn-search').click() # 根据class标签'btn-search'定位到搜索按钮并点击

    driver.implicitly_wait(10) # 隐式等待(单位是秒) 等到页面渲染完之后就不再等待
    driver.maximize_window() # 最大化浏览器

    # 解决登陆(登录防爬:例如有滑块)
    driver.find_element_by_xpath('//*[@id="fm-login-id"]').send_keys('这里填写账户名称/手机号码')
    time.sleep(1)
    driver.find_element_by_xpath('//*[@id="fm-login-password"]').send_keys('这里填写账户密码')
    time.sleep(2)

    # 解决滑块
    login = driver.find_element_by_xpath('//*[@id="nc_1_n1z"]') # 通过xpath找到滑块
    action = ActionChains(driver)  # 创造出一个动作链
    action.click_and_hold(on_element=login) # 点击不松开
    action.move_by_offset(xoffset=300-42, yoffset=0) # 通过坐标轴滑动
    action.pause(0.5).release().perform() # 设置链式调用时间(滑动滑块时间),并松开鼠标   perform()执行动作链
    driver.find_element_by_xpath('//*[@id="login-form"]/div[4]/button').click() # 点击登录并重定向到前面的关键字
    driver.implicitly_wait(10) # 隐式等待

    page = driver.find_element_by_xpath('//*[@id="mainsrp-pager"]/div/div/div/div[1]').text # 获取总页数标签
    page_list = re.findall('(\d+)', page)  # 正则表达式获取多个精确数字数据[返回的是列表]
    page_num = page_list[0]   # 字符串类型数据

    return int(page_num)

# 爬取数据并保存
def get_data():
    divs = driver.find_elements_by_xpath('//div[@class="items"]/div[@class="item J_MouserOnverReq  "]')
    for div in divs:
        info = div.find_element_by_xpath('.//div[@class="row row-2 title"]/a').text
        price = div.find_element_by_xpath('.//strong').text
        deal = div.find_element_by_xpath('.//div[@class="deal-cnt"]').text
        shop = div.find_element_by_xpath('.//div[@class="shop"]/a').text
        print(info, price, deal, shop, sep='|')

        # 保存
        with open('data.csv', mode='a', newline="") as csvfile:
            csvWriter = csv.writer(csvfile, delimiter=',')
            csvWriter.writerow([info, price, deal, shop])

def main():
    print('正在爬取第1页数据...')
    page = search_product(keyword)
    get_data()

    # 第2页之后数据获取
    page_num = 1   # page_num * 44
    while page_num != page:
        print('*' * 100)
        print('正在爬取第{}页数据'.format(page_num+1))
        print('*' * 100)
        driver.get('https://s.taobao.com/search?q={}&s={}'.format(keyword, page_num*44))
        driver.implicitly_wait(10)  # 隐式等待
        get_data()
        page_num += 1

    driver.quit()

if __name__ == '__main__':
    driver = webdriver.Chrome('E:/software/chromedriver_win32/chromedriver.exe')
    # keyword = '电脑'
    keyword = input('请输入您要搜索的商品名字:')
    main()

总运行效果截图展示

这是PyCharm运行效果截图

这是csv文件打开后的截图

3 总结声明

最近在复习准备期末考试,7月份后我将系统写作爬虫专栏:Python网络数据爬取及分析「从入门到精通」 感兴趣的叫伙伴们可以先关注一波!

更多原创文章及分类专栏请点击此处→我的主页

★版权声明:本文为CSDN博主「荣仔!最靓的仔!」的原创文章,遵循CC 4.0 BY-SA版权协议。 转载请附上原文出处链接及本声明


欢迎留言,一起学习交流~~~

感谢阅读

END

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2020/06/25 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
暂无评论
推荐阅读
编辑精选文章
换一批
访谈 - Sensory CEO Todd Mozer与FindBiometrics CEO Peter O'Neil
Sensory CEO Todd Mozer近日接受了FindBiometrics CEO Peter O'Neil的专访。内容包括了 Sensory于2019年对Vocalize.ai,独立第三方语音和声音生物特征测试实验室的收购,以及包含语音识别和交互,面部识别和模拟的人的虚拟化身(virtual avatar)的应用,以及关于当但隐私保护的探讨等等。
用户6026865
2020/01/17
4400
STM&Sensory Enable Embedded VUI Through STM32Cube Ecosystem
TM32 MCUs pair with Sensory’s VoiceHub technology to streamline development of voice-based user interfaces on wearables, IoT, and smart-home applications
用户6026865
2022/09/02
4370
Auto Makers Are Expanding Voice Controls for Drivers
Auto Makers Are Expanding Voice Controls for Drivers. Cars Will Talk More, Too.
用户6026865
2023/03/03
3450
Auto Makers Are Expanding Voice Controls for Drivers
Introducing SensoryCloud.ai: Flexibility
After a quarter century of running embedded or “on the Edge” Sensory is moving into the cloud with the opportunity to offer hybrid solutions with more Flexibility, Accuracy, Features/Technologies, Privacy and Cost advantages than ever before.
用户6026865
2022/04/02
2240
Introducing SensoryCloud.ai:  Flexibility
Voice Assistants…What’s Going On
Amazon is on track to lose $10B on its devices group, which includes Alexa, and massive layoffs have been announced targeting the Alexa team. Google Assistant Actions and Driving Mode have been shut down amidst rumors of layoffs and re-prioritizing the Google Assistant and AI functions to make their in-house hardware better.
用户6026865
2023/03/03
3730
Voice Assistants…What’s Going On
2020年最值得加入的TOP10人工智能公司
人工智能已经来到了转折点(Inflection Point) - 已不再只是起到装饰作用,从各方面看(all intents and purposes)已经成为了核心要素(core ingredient)。
用户6026865
2020/06/11
6500
The Big Shift In Blockchain Technology And Its Consequences
Blockchain-Technology.jpg After a successful year of Blockchain campaign in 2019, what more should w
用户4822892
2020/02/03
4470
The Big Shift In Blockchain Technology And Its Consequences
AI Weekly | October 16, 2021
This week, Microsoft and Nvidia announced that theytrained what they claim is one of the largest and most capable AI languagemodels to date: Megatron-Turing Natural Language Generation (MT-NLP). MT-NLPcontains 530 billion parameters — the parts of the model learned fromhistorical data — and achieves leading accuracy in a broad set of tasks,including reading comprehension and natural language inferences.
用户9732312
2022/05/13
8630
AI Weekly | October 16, 2021
ZOOM Release Edge Speech Recognition Powered by Sensory
ZOOM RELEASES EDGE SPEECH RECOGNITION POWERED BY SENSORY
用户6026865
2022/09/02
5780
ZOOM Release Edge Speech Recognition Powered by Sensory
New AI Module SIM8965 Launched by SIMCOM
SIMCOM has launched its new generation intelligent module, the SIM8965 series, targeting the Chinese and global markets. This addition enhances its lineup of smart module products. As a globally recognized provider of IoT modules and solutions, SIMCOM’s rich array of smart modules provides robust support for clients in developing edge computing and on-device AI products.
用户1440066
2025/03/06
540
Sensory’s TrulyHandsfree and Arm’sCortex-M55
Efficient wake word recognition on microcontrollers with Cortex-M55 and Helium technology for use in consumer and automotive products that include more and more AI features for voice applications.
用户6026865
2022/09/02
3470
Sensory’s TrulyHandsfree and Arm’sCortex-M55
Milvus 2.3 Accelerates AI-Powered Applications With GPU Support
NVIDIA GTC 2023—March 21, 2023—Zilliz[1], the inventor of the open-source vector database Milvus, today announced the beta launch of Milvus 2.3, featuring NVIDIA GPU support for greater flexibility and dramatic improvements in real-time workload performance.
Zilliz RDS
2023/08/26
2620
Milvus 2.3 Accelerates AI-Powered Applications With GPU Support
5 Predictions for Voice Technology in 2023
There is no doubt that voice is the most natural and convenient communication mode, so it's little wonder that the adoption of voice technology on smart devices has more recently become the preferred interface in many contexts.
用户6026865
2023/03/02
3010
5 Predictions for Voice Technology in 2023
openshift|如何登录与登出
k8s逐渐已成为企业IT基础设施的标配,需要进一步学习企业基本k8s--openshift的功能,强化对容器云的理解及其架构,在深入理解平台理念后作出开发与定制创新。
heidsoft
2019/07/07
1.6K0
SensoryCloud AI - 支持Liveness的声纹生物特征识别
Biometric data is the unique information that can be used to identify a person with accuracy. It includes uniquely identifiable features such as fingerprint, face recognition, iris, voice recognition. The increased acceptance of biometrics by consumers has encouraged the uptake of these systems on a wider scale.
用户6026865
2022/05/17
4570
Open Source Modern Container Application Architecture Guide
Two concepts need to be clarified first: cloud migration and cloud-native. Cloud migration refers to the process of moving applications, data, and business processes to a cloud computing environment. Cloud-native, on the other hand, is an approach to building and running applications that exploits the elasticity, scalability, and agility of cloud computing.
行者深蓝
2023/12/05
2550
Sensory为Farberware微波炉带来创新的离线自然交互语音功能
Sensory近期不断推出新技术,新平台和新应用。除了于去年底推出的VoiceHub(Voicehub.sensory.com)离线语音模型在线生成平台之外,Sensory的离线语音技术,创新的离线语音助理自然语音交互技术组合 -TrulyNature,也在主流的家电产品中获得了广泛的采用。如于2021年二季度在美国上市的语音控制微波炉 - Farberware FM11VABK。
用户6026865
2021/05/28
5730
Sensory为Farberware微波炉带来创新的离线自然交互语音功能
Edge-native applications
Despite growing awareness of edge computing, there still lies a big misconception that the edge is simply an extension of the cloud.
用户6026865
2023/03/03
4460
Edge-native applications
The Conversational AI Industry Landscape Map
The conversational AI landscape is divided into categories:
用户6026865
2023/03/02
4320
The Conversational AI Industry Landscape Map
KubeCon2021视频全集
周四主题演讲 | Thursday Keynotes From Allies to Partners: A Foundational Toolkit for Inclusive Leadership CPU Burst:摆脱不必要的节流,同时实现高 CPU 利用率和高应用程序性能 | CPU Burst: Getting Rid of Unnecessary Throttling, Achieving High CPU Utilization and Application Performance atha
kinnylee
2021/12/16
8800
相关推荐
访谈 - Sensory CEO Todd Mozer与FindBiometrics CEO Peter O'Neil
更多 >
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
本文部分代码块支持一键运行,欢迎体验
本文部分代码块支持一键运行,欢迎体验