我有一个css选择器,当在chrome JS控制台中执行它时工作得很好,但是在一个例子中运行它时不能工作,但是在另一个示例上运行它时却不能工作(我无法区分两者之间的区别)。
url_1 = 'https://www.amazon.com/s?k=bacopa&page=1'
url_2 = 'https://www.amazon.com/s?k=acorus+calamus&page=1'
在chrome控制台中执行这两个查询时,以下查询都能很好地工作。
document.querySelectorAll('div.s-result-item')
然后通过漂亮汤运行这两个urls,这就是我得到的输出。
url_1 (works)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
r = requests.get(url_1, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
listings = soup .select('div.s-result-item')
print(len(listings))
产出: 53 (正确)
url_2 (不工作)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
r = requests.get(url_2, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
listings = soup.select('div.s-result-item')
print(len(listings))
产出:0(不正确-预期: 49)
有人知道这里可能发生了什么吗?我怎样才能让css选择器与美丽的汤一起工作?
发布于 2019-05-30 04:58:00
我想是html。将解析器更改为“lxml”。为了提高效率,您还可以将css选择器缩短为类,并重用与Session
对象的连接。
import requests
from bs4 import BeautifulSoup as bs
urls = ['https://www.amazon.com/s?k=bacopa&page=1','https://www.amazon.com/s?k=acorus+calamus&page=1']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent' : 'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
listings = soup.select('.s-result-item')
print(len(listings))
发布于 2019-05-30 00:36:02
尝试selenium library
下载网页
from selenium import webdriver
from bs4 import BeautifulSoup
url_1 = 'https://www.amazon.com/s?k=bacopa&page=1'
url_2 = 'https://www.amazon.com/s?k=acorus+calamus&page=1'
#set chrome webdriver path
driver = webdriver.Chrome('/usr/bin/chromedriver')
#download webpage
driver.get(url_2)
soup = BeautifulSoup(driver.page_source, 'html.parser')
listings = soup.find_all('div',{'class':'s-result-item'})
print(len(listings))
O/P:
url_1: 50
url_2 : 48
https://stackoverflow.com/questions/56374167
复制