我正在使用Scrapy与Selenium,以便从一个特定的搜索引擎(ekoru)抓取urls。这是我从搜索引擎返回的响应的屏幕截图,只有一个请求:

因为我使用的是selenium,所以我假设我的用户代理应该是正常的,那么还有什么问题会让搜索引擎立即检测到机器人呢?
下面是我的代码:
class CompanyUrlSpider(scrapy.Spider):
name = 'company_url'
def start_requests(self):
yield SeleniumRequest(
url='https://ekoru.org',
wait_time=3,
screenshot=True,
callback=self.parseEkoru
)
def parseEkoru(self, response):
driver = response.meta['driver']
search_input = driver.find_element_by_xpath("//input[@id='fld_q']")
search_input.send_keys('Hello World')
search_input.send_keys(Keys.ENTER)
html = driver.page_source
response_obj = Selector(text=html)
links = response_obj.xpath("//div[@class='serp-result-web-title']/a")
for link in links:
yield {
'ekoru_URL': link.xpath(".//@href").get()
}发布于 2020-10-08 01:44:33
有时你需要传递其他参数,以避免被任何网页检测到。
让我分享一个你可以使用的代码:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#This code helps to simulate a "human being" visiting the website
chrome_options = Options()
chrome_options.add_argument('--start-maximized')
driver = webdriver.Chrome(options=chrome_options, executable_path=r"chromedriver")
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source":
"""Object.defineProperty(navigator,
'webdriver', {get: () => undefined})"""})
url = 'https://ekoru.org'
driver.get(url)收益率(查看条形图地址下面的"Chrome正在被控制...“):

https://stackoverflow.com/questions/64245471
复制相似问题