文章/答案/技术大牛

发布

社区首页 >问答首页 >使用BeautifulSoup进行网页抓取将不起作用

问使用BeautifulSoup进行网页抓取将不起作用
EN

Stack Overflow用户

提问于 2020-04-19 05:06:33

回答 2查看 664关注 0票数 2

最终，我试图打开一个新闻网站的所有文章，然后将所有文章中使用的词排在前10名。要做到这一点，我首先想看看有多少文章，这样我就可以在某个时候迭代它们，还没有真正弄清楚我想要如何做每件事。

为此，我想使用BeautifulSoup4。我想我想要获取的类是Javascript，因为我没有得到任何东西。这是我的代码：

url = "http://ad.nl"
ad = requests.get(url)
soup = BeautifulSoup(ad.text.lower(), "xml")
titels = soup.findAll("article")

print(titels)
for titel in titels:
    print(titel)

项目名称有时是h2或h3。它总是有一个相同的类，但我不能从那个类中获得任何东西。它有一些父级，但使用了相同的名称，但扩展名为-wrapper。我甚至不知道如何使用父类来获得我想要的东西，但我认为这些类也是Javascript。还有一个我感兴趣的href。但同样，这可能也是Javascript，因为它不返回任何内容。

有没有人知道我如何通过使用BeautifulSoup来使用任何东西(最好是href，但文章名称也可以)？

javascript

python

class

web-scraping

beautifulsoup

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-04-19 06:01:28

正如@Sri在评论中提到的，当你打开这个url时，你会看到一个页面，其中你必须首先接受cookie，这需要交互。当需要交互时，可以考虑使用selenium (https://selenium-python.readthedocs.io/)之类的东西。

这里有一些东西可以让你入门。

(编辑:在运行下面的代码之前，您需要运行pip install selenium )

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

url = 'https://ad.nl'

# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Firefox()
driver.get(url)

# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()

# grab the html. It'll wait here until the page is finished loading
html = driver.page_source

# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")

for article in articles:
    # check for article titles in both h2 and h3 elems
    h2_titles = article.findAll('h2', {'class': 'ankeiler__title'})
    h3_titles = article.findAll('h3', {'class': 'ankeiler__title'})
    for t in h2_titles:
        # first I was doing print(t.text), but some of them had leading
        # newlines and things like '22:30', which I assume was the hour of the day
        text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
        print(text)
    for t in h3_titles:
        text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
        print(text)

# close the browser
driver.close()

这可能是您想要的，也可能不是您想要的，但这只是如何使用硒和美汤的一个示例。您可以随意复制/使用/修改您认为合适的内容。如果你想知道要使用什么选择器，请阅读@JL Peyret的评论。

票数 1

Stack Overflow用户

发布于 2020-04-19 16:10:22

以防您不想使用selenium。这对我很有效。我在两台不同网络连接的电脑上试过了。你能试试吗？

from bs4 import BeautifulSoup
import requests

cookies={"pwv":"2",
"pws":"functional|analytics|content_recommendation|targeted_advertising|social_media"}

page=requests.get("https://www.ad.nl/",cookies=cookies)

soup = BeautifulSoup(page.content, 'html.parser')

articles = soup.findAll("article")

然后按照kimbo的代码提取h2/h3。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61296242

复制

相似问题

问使用BeautifulSoup进行网页抓取将不起作用
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用BeautifulSoup进行网页抓取将不起作用EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用BeautifulSoup进行网页抓取将不起作用
EN