文章/答案/技术大牛

发布

社区首页 >问答首页 >使用CSS选择器(Python、BS4)销毁数据

问使用CSS选择器(Python、BS4)销毁数据
EN

Stack Overflow用户

提问于 2022-02-01 00:17:47

回答 1查看 120关注 0票数 0

我是第一次使用CSS选择器抓取数据。

而锚固内容的抓取存在问题。

这是我的代码：

import requests
from bs4 import BeautifulSoup

url = "https://weworkremotely.com/remote-jobs/search?utf8=✓&term=ruby"
wwr_result = requests.get(url)
wwr_soup = BeautifulSoup(wwr_result.text, "html.parser")
posts = wwr_soup.find_all("li", {"class": "feature"})
link = post.select("#category-2 > article > ul > li:nth-child(1) > a[href]")

title = post.find("span", {"class": "title"}).get_text()
company = post.find("span", {"class": "company"}).get_text()
location = post.find("span", {"class": "region company"}).get_text()
link = post.select("#category-2 > article > ul > li:nth-child(1) > a[href]")

print {"title": title, "company": company, "location": location, "link":f"https://weworkremotely.com/{link}"}

我想废除锚的内容，使每个帖子的链接。所以我让阿瑞夫。

但它不起作用，但所有子类别的内容都报废了。

我怎么才能把锚的内容换掉呢？

python

web-scraping

beautifulsoup

css-selectors

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-02-01 05:48:41

假设您正确地从列出的所有作业中选择了感兴趣的作业，则需要一个循环，然后使用子字符串-jobs (即循环期间的post.select_one('[href*=-jobs]' )提取第一个href属性：

import requests
from bs4 import BeautifulSoup

url = "https://weworkremotely.com/remote-jobs/search?utf8=✓&term=ruby"
wwr_result = requests.get(url)
wwr_soup = BeautifulSoup(wwr_result.text, "html.parser")
posts = wwr_soup.find_all("li", {"class": "feature"})

for post in posts:
    print('https://weworkremotely.com' + post.select_one('a[href*=-jobs]')['href'])

若要将页面上的所有列表切换到：

posts = wwr_soup.select('li:has(.tooltip)')

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70934334

复制

相似问题

问使用CSS选择器(Python、BS4)销毁数据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用CSS选择器(Python、BS4)销毁数据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用CSS选择器(Python、BS4)销毁数据
EN