要用Python进行Web Scraping并避免"请确认你是人类"的问题,可以采取以下几种方法:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)
import requests
proxies = {
'http': 'http://127.0.0.1:8888',
'https': 'https://127.0.0.1:8888'
}
response = requests.get(url, proxies=proxies)
import requests
import pytesseract
from PIL import Image
# 下载验证码图片
response = requests.get(captcha_url)
with open('captcha.png', 'wb') as f:
f.write(response.content)
# 识别验证码
image = Image.open('captcha.png')
captcha = pytesseract.image_to_string(image)
# 发送带验证码的请求
data = {
'username': 'your_username',
'password': 'your_password',
'captcha': captcha
}
response = requests.post(login_url, data=data)
from selenium import webdriver
# 使用Chrome浏览器驱动
driver = webdriver.Chrome()
# 打开网页
driver.get(url)
# 模拟操作
element = driver.find_element_by_xpath('//input[@id="username"]')
element.send_keys('your_username')
# 提交表单
element.submit()
# 获取结果
result = driver.find_element_by_xpath('//div[@id="result"]').text
# 关闭浏览器
driver.quit()
需要注意的是,使用Web Scraping时应遵守网站的使用条款和Robots协议,不要对网站造成过大的访问压力,以免给网站带来困扰。
领取专属 10元无门槛券
手把手带您无忧上云