如何使用BS4中find all方法抓取某些字符串

在使用BS4中的find_all方法抓取某些字符串时，可以按照以下步骤进行操作：

导入BeautifulSoup库和相关依赖：首先需要导入BeautifulSoup库和相关依赖，确保已经安装了Python和BeautifulSoup库。
获取HTML内容：使用合适的方法获取包含目标字符串的HTML内容。可以通过网络请求获取网页内容，也可以从本地文件中读取HTML内容。
创建BeautifulSoup对象：将获取到的HTML内容传入BeautifulSoup类中，创建一个BeautifulSoup对象，以便后续的解析操作。
使用find_all方法：使用find_all方法来查找包含目标字符串的元素。find_all方法可以接受多个参数，用于指定要查找的标签名、属性名和属性值等。
遍历结果并提取字符串：遍历find_all方法返回的结果集，可以使用字符串提取方法（如get_text()）来提取目标字符串。

下面是一个示例代码：

from bs4 import BeautifulSoup

# 获取HTML内容
html = """
<html>
<body>
<div class="content">
    <h1>标题1</h1>
    <p>段落1</p>
    <h2>标题2</h2>
    <p>段落2</p>
</div>
</body>
</html>
"""

# 创建BeautifulSoup对象
soup = BeautifulSoup(html, 'html.parser')

# 使用find_all方法查找包含目标字符串的元素
elements = soup.find_all(text=['标题1', '段落2'])

# 遍历结果并提取字符串
for element in elements:
    print(element)

在上述示例中，我们使用了一个包含标题和段落的HTML内容。通过使用find_all方法，我们查找了包含"标题1"和"段落2"的元素，并使用循环打印出了这些字符串。

腾讯云相关产品和产品介绍链接地址：

腾讯云官网：https://cloud.tencent.com/
云服务器（CVM）：https://cloud.tencent.com/product/cvm
云数据库MySQL版：https://cloud.tencent.com/product/cdb_mysql
云原生容器服务（TKE）：https://cloud.tencent.com/product/tke
人工智能平台（AI Lab）：https://cloud.tencent.com/product/ailab
物联网开发平台（IoT Explorer）：https://cloud.tencent.com/product/iothub
移动开发平台（MPS）：https://cloud.tencent.com/product/mps
分布式文件存储（CFS）：https://cloud.tencent.com/product/cfs
区块链服务（BCS）：https://cloud.tencent.com/product/bcs
腾讯云元宇宙：https://cloud.tencent.com/solution/virtual-universe

Python语言中BS4 find_all()语句中的过滤函数问题

、、、、

我正在抓取一个HTML网页。我在Mac机上使用Python库(4.6.0)和BeautifulSoup (3.7)。在其他东西中，我看到了一堆'div‘标签，它们有class属性。一些'div‘标签带有多个class属性值。现在我想根据标记名和class属性值进行过滤，例如，我想找到class='a‘但没有class='b’的' div‘标记(是的，有些div标记带有class='a b')。为了获得这些标记，我尝试使用BS4文档()中提到的过滤函数。我的印象是，find_all()将bs4标记元素传递给函数，在该函数中，您可以对BS

浏览 103提问于2018-07-09得票数 -1

1回答

我想用BeautifulSoup从表中抓取所有成员的详细信息

、

import requests from bs4 import BeautifulSoup url = 'http://www.gmcgujarat.org/searchdoctor.aspx' html = requests.get(url).text soup = BeautifulSoup(html, 'html.parser') name = soup.find(" ") for count in range(3333,4444): data = {name: " "} r = requ

浏览 0提问于2019-03-12得票数 2

2回答

我的代码显示"TypeError: not all arguments converted during string formatting“，有什么问题吗？

代码如下： from bs4 import BeautifulSoup import requests x = 0 value = 1 x = x + value url = "https://www.bol.com/nl/s/algemeen/zoekresultaten/sc/media_all/index.html?" + "page=%s" + "&searchtext=ipad" % x response = requests.get(url) html = re

浏览 14提问于2017-07-18得票数 0

2回答

如何用空类值从div中刮取文本

、、、

嗨，如何在不上课的情况下从div中抓取文本？首先，我尝试使用类“作业”页从div中抓取所有数据，然后没有类值，但它不起作用。 from bs4 import BeautifulSoup import requests a = {} def antal_pl(name=''): try: page_response = requests.get('https://antal.pl/oferty-pracy?s=&sid=&did=Accountancy', timeout=40).text pag

浏览 1提问于2018-04-28得票数 0

回答已采纳

2回答

Python和BeautifulSoup打开页面

、、、

我想知道如何使用BeautifulSoup打开我列表中的另一个页面？我关注了，但它没有告诉我们如何打开列表中的另一个页面。另外，我如何打开一个嵌套在类中的"a href“呢？下面是我的代码： # coding: utf-8 import requests from bs4 import BeautifulSoup r = requests.get("") soup = BeautifulSoup(r.content) soup.find_all("a") for link in soup.find_all("a"): pr

浏览 0提问于2015-09-24得票数 8

回答已采纳

2回答

BeautifulSoup输出保持[]

、、、

我试图从一个带有BeautifulSoup + python请求的网站上抓取文本。但它只是作为一种产出。 from bs4 import BeautifulSoup import requests url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html" r = requests.get(url) soup = BeautifulSoup(r.content) data = soup.find_all("div"

浏览 1提问于2016-03-16得票数 1

回答已采纳

2回答

如何在csv中写入抓取的数据？

、、

大家好，我是python的新手，我不知道如何将抓取的数据转换成csv格式。这是我的程序 import requests import urllib.request from bs4 import BeautifulSoup import pandas url = 'https://menupages.com/restaurants/ny-new-york/2' response = requests.get(url) soup = BeautifulSoup(response.text, "html.parser") all_links = soup

浏览 1提问于2019-09-25得票数 1

2回答

Python -用BeautifulSoup迭代页面

、

我使用BeautifulSoup4从几个网页中抓取数据。例如，在下面的例子中，url是，有96页。我的问题是脚本在几页之后抛出了一个错误。通常，当代码到达第15-20页时。错误信息：追溯(最近一次调用)：文件"main.py"，第34行，在if next_page.text != 'Next'：AttributeError：'NoneType‘对象中没有属性'text’ 谢谢你提前提供帮助！ import requests import os import csv from itertools import count from bs4 imp

浏览 1提问于2018-12-06得票数 1

回答已采纳

4回答

如何在BS4中有效抓取多个URL

、、、

我正在尝试找到一种在BS4中抓取多个页面的有效方法。我能够轻松地抓取第一页，并获得我需要的所有数据，但不幸的是，并不是所有的数据都在上面。还有另外两个页面需要抓取，而不是硬编码并更改第二个和第三个页面的URL，我想知道是否有更好的方法使用BS4在Python中实现这一点。唯一需要更改的部分是page=1到相应的页码(1、2、3)。 import csv import requests from bs4 import BeautifulSoup url = "https://www.congress.gov/members?q={%22congress%22:%22115%22}&

浏览 0提问于2018-04-04得票数 0

1回答

findall中的findall对字符串无效

、

我试图在我抓取的HTML页面中搜索特定的字符串。我在bs4中使用了find_all()方法并提供了字符串参数，但它不起作用。网页：https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pKVGlnQVAB?hl=en-IN&gl=IN&ceid=IN%3Aen from bs4 import BeautifulSoup import requests def search(soup):

浏览 34提问于2020-03-22得票数 0

回答已采纳

1回答

美汤4在表格中查找文本

、、、、

我一直在尝试使用BS4从网页中抓取。我找不到我想要的数据(表中的球员名字，即"Claiborne，Morris")。当我使用： soup = BeautifulSoup(r.content, "html.parser") PlayerName = soup.find_all("table") print (PlayerName) 没有球员的名字甚至在输出中，它只显示了一个不同的表。当我使用： soup = BeautifulSoup(r.content, 'html.parser') texts = soup.findAl

浏览 11提问于2016-07-23得票数 1

回答已采纳

1回答

将漂亮汤刮擦表转换为列表

、、

用Beautifulsoup从Wikipedia中抓取一列将返回最后一行，而我希望它们都在列表中： from urllib.request import urlopen from bs4 import BeautifulSoup site = "https://en.wikipedia.org/wiki/Agriculture_in_India" html = urlopen(site) soup = BeautifulSoup(html, "html.parser") table = soup.find("table", {'cla

浏览 3提问于2017-05-11得票数 1

回答已采纳

2回答

如何使用美汤保存来自网站的附件？

、、、

我已经写了一个代码来抓取一个网站的附件。它本质上是抓取附件的超链接。我不能想出一种方法来直接将这些附件保存在本地位置。 import requests import pandas as pd from requests import get url = 'https://www.amfiindia.com/research-information/amfi-monthly' response = get(url,verify=False) import bs4 from bs4 import BeautifulSoup html_soup = BeautifulSoup(re

浏览 0提问于2020-06-20得票数 0

3回答

Web抓取:用Python抓取多个Web

、、

from bs4 import BeautifulSoup import requests url = 'https://uk.trustpilot.com/review/thread.com' for pg in range(1, 10): pg = url + '?page=' + str(pg) soup = BeautifulSoup(page.content, 'lxml') for paragraph in soup.find_all('p'): print(paragraph.text) 我想

浏览 1提问于2019-01-13得票数 3

回答已采纳

2回答

Python - Beautiful Soup -如何过滤提取的关键字数据？

、、、

我想用Beautiful Soup和requests抓取网站的数据，我已经得到了我想要的数据，但现在我想要过滤它： from bs4 import BeautifulSoup import requests url = "website.com" keyword = "22222" r = requests.get(url) data = r.text soup = BeautifulSoup(data, 'lxml') for article in soup.find_all('a'): for a in artic

浏览 29提问于2019-03-18得票数 1

回答已采纳

4回答

如何使用BeautifulSoup从网站中抓取所有标题？

、、、

我试图从一个简单的网站抓取所有的标题。我的尝试： from bs4 import BeautifulSoup, SoupStrainer import requests url = "http://nypost.com/business" page = requests.get(url) data = page.text soup = BeautifulSoup(data) soup.find_all('h') soup.find_all('h')返回[]，但是如果我执行类似于soup.h1或soup.h2的操作，它将返回相应的数据。我是不是不

浏览 5提问于2017-07-12得票数 13

回答已采纳

1回答

导出到未对齐的表

、、

我试图从这个链接中抓取一个表：当抓取表时，名称和统计数据类别对齐，但数字本身不对齐。 import csv from bs4 import BeautifulSoup import requests soup = BeautifulSoup( requests.get("https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc", timeout=30).text, 'lxml') def scrape_data(url): # t

浏览 0提问于2019-10-28得票数 0

回答已采纳

2回答

在Python 3中，抓取隐藏在标签中的网页上的所有文本

、、

我需要抓取一个网页()，但我遇到了一个问题--我需要在首页显示的文本绝对隐藏在许多不同的格式化标记中。我知道如何使用Beautiful Soup抓取常规页面，但这并不能满足我的需要(例如，文本丢失，一些标签通过...) import requests from bs4 import BeautifulSoup from collections import Counter urls = ['https://www304.americanexpress.com/credit-card/compare'] with open('thisisanew.txt'

浏览 1提问于2014-09-09得票数 0

1回答

从web抓取html页面中的Python脚本中提取列表

、、、

我对web抓取很陌生，遇到了一个带有以下代码的小路障： import requests from bs4 import BeautifulSoup url = "www.website.com" page = requests.get(url) soup = BeautifulSoup(page.content, "html.parser") price_scripts = soup.find_all('script')[23] print(price_scripts) 所提取的脚本似乎都是Python脚本。下面是从上面的代码中打印出来的内容：

浏览 5提问于2019-11-27得票数 1

回答已采纳

1回答

找不到我知道在文档中的标记- find_all()返回[]

、、、

我正在使用bs4抓取khanacademy上的https://www.khanacademy.org/profile/DFletcher1990/ one用户资料。我正在尝试获取用户统计数据(加入日期，获得的能量点，完成的视频)。我有check https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 这似乎是：“最常见的意外行为是找不到您知道在文档中的标记。您看到它进入，但find_all()返回[]或find()返回None。这是Python内置解析器的另一个常见问题，它有时会跳过它不理解的标记。同样，解决方案是安装lxml或htm

浏览 16提问于2019-02-16得票数 2

回答已采纳

1回答

Python数据抓取:使用href和prettify系列来抓取标题不起作用

、

我是Python的新手，我的第一次尝试是从一个随机的网站上抓取一些网页。这是我的代码，我搞不懂到底是怎么回事。我正在抓取标题和剧集的大小，但它有2个href和美容不起作用。代码如下： from bs4 import BeautifulSoup import requests source = requests.get('https://1337x.to/popular-tv').text soup = BeautifulSoup(source, 'lxml') tvhead = soup.find('tbody') filename =

浏览 5提问于2018-09-12得票数 1

2回答

BeautifulSoup只识别页面中的几个元素

、、、

我在一个网站上做过网络抓取。它只包含页面中的前20个元素。如果我们向下滚动，剩下的元素将被加载。如何也刮掉这些元素？有什么不同的方法吗？ import requests from bs4 import BeautifulSoup r=requests.get("https://www.century21.com/real-estate/rock-spring-ga/LCGAROCKSPRING/") c=r.content c soup=BeautifulSoup(c,"html5lib") soup all=soup.find_all("div&

浏览 3提问于2017-11-11得票数 0

回答已采纳

1回答

巨蟒。AttributeError：'NoneType‘对象没有'startswith’属性

、、、、

为什么这段代码不能工作并给出AttributeError？ internship = parser.find_all('a', attrs = {'title': lambda job: job.startswith('Internship')}) 当这个有效的时候： internship = parser.find_all('a', attrs = {'title': lambda job: job and job.startswith('Internship')}) 这是我从第一个代码中得到的错

浏览 0提问于2017-05-08得票数 4

回答已采纳

1回答

返回为None的bs4 p标记

、

import bs4 import requests import re r = requests.get('https://www.the961.com/latest-news/lebanon-news/').text soup = bs4.BeautifulSoup(r, 'lxml') for article in soup.find_all('article'): title = article.h3.text print(title) date = article.find('span'

浏览 1提问于2021-04-29得票数 1

回答已采纳

1回答

在覆盖/新窗口上显示的数据

、、、

我对网络抓取是完全陌生的，我想从：抓取评论和属性回复。然而，我获得的HTML似乎是针对宿舍页面，而不是带有评论的重叠页面，我想知道如何从评论面板中获取和刮取。我可以使用下面的片段来抓取用户评论， from bs4 import BeautifulSoup url = 'https://www.hostelworld.com/hosteldetails.php/HI-NYC-Hostel/New-York/1850#reviews' response = requests.get(url) SoupPage = BeautifulSoup(response.text, &

浏览 0提问于2019-05-25得票数 0

回答已采纳

3回答

在使用美汤抓取表数据时遇到麻烦

、、、

我想从这个site中抓取表格数据。我尝试了下面的代码，但无论出于什么原因，BS4似乎无法获取表数据： import bs4 as bs import urllib.request sauce = urllib.request.urlopen('https://drafty.cs.brown.edu/csprofessors').read() soup = bs.BeautifulSoup(sauce, 'lxml') table = soup.find('table', attrs={"id": "table"

浏览 30提问于2020-10-24得票数 1

回答已采纳

2回答

为什么我不能访问tbody中的信息？

、

这是websiteI的源代码，我正在用BeautifulSoup做网页抓取，但在tbody中找不到tr；在网站的源代码中，tbody中实际上有tr；但是find_all函数只能返回头部的tr。我正在抓取的链接：下面是我的一些代码： ```from bs4 import BeautifulSoup ```html = urlopen(url) ```type(soup) ```print(tr)

浏览 0提问于2019-06-18得票数 1

1回答

如何使用请求库对已抓取的链接列表进行use抓取

、、

我已经从一个网站()上抓取了一组链接--所有的链接都包含在“meeting _ links”中，包括“会议”一词，即会议论文。我现在需要跟随他们中的每一个链接来抓取他们里面的更多链接。我又回到了使用请求库，并尝试 r2 = requests.get("meeting_links") 但它返回以下错误： MissingSchema: Invalid URL 'list_meeting_links': No schema supplied. Perhaps you meant http://list_meeting_links? 我已经将其更改为，但仍然没有区

浏览 7提问于2019-07-12得票数 1

回答已采纳

1回答

如何使用bs4在python中刮取多个页面

、、

我有一个查询，因为我一直在刮一个网站"“，因为我无法从表中给定的链接中抓取电子邮件id。虽然需要从给定表格中的链接中刮取姓名、电子邮件和董事。请任何人，解决我的问题，因为我是一个新手，使用python与美丽的汤和要求的网页刮。谢谢你，迪克沙 #Scraping the website #Import a liabry to query a website import requests #Specify the URL companies_list = "https://www.zaubacorp.com/company-list" link = requests.

浏览 2提问于2020-05-03得票数 0

回答已采纳

1回答

美丽的汤刮

、、、

我遇到了旧的工作代码无法正常工作的问题。我的python代码是用漂亮的汤抓取一个网站，并提取事件数据(日期、事件、链接)。我的代码是拉取位于tbody中的所有事件。每个事件都存储在一个<tr class="Box">中。问题是我的抓取器似乎在这个<tr style ="box-shadow: none;>之后停止了，在它到达这个部分(这是一个包含3个我不想抓取的事件的站点广告的部分)之后，代码停止从<tr class="Box">中拉取事件数据。有没有一种方法可以跳过这种tr风格/忽略未来的案例？ ? i

浏览 12提问于2020-09-30得票数 2

回答已采纳

1回答

我如何只刮数字而不是数字后面的文字？

、、、

下面是从HTML代码中提取出来的，我想要从网页上抓取。给予： <tbody> <tr> <th>SAT Math</th> <td>"541 average"</td> </tr> </tbody> 我正在使用Python和Beautiful进行网络搜索和提取541，但我的问题是：一旦我提取了"541平均值“，如何处理掉所有多余的物质--例如GPA --我如何去除”平均值“？非常感谢你，我将非常感谢任何人可以帮助我！ (对不起，我是

浏览 3提问于2017-10-26得票数 0

回答已采纳

2回答

如何将Web抓取代码转换为循环？

、、

我能够循环web抓取过程，但从页面收集的数据替换了之前从页面中收集的数据。制作excel只包含最后一页的数据。我该怎么办？ from bs4 import BeautifulSoup import requests import pandas as pd print ('all imported successfuly') for x in range(1, 44): link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}') print (link) req

浏览 2提问于2020-03-04得票数 0

1回答

使用漂亮汤的数据格式的问题

、、、

我使用漂亮的汤抓取数据创建了一个数据文件。然而，有两个问题。为什么for循环运行2次？如何删除数据帧上的括号？将urllib.request导入为req from bs4 import BeautifulSoup import bs4 import requests import pandas as pd url = "https://finance.yahoo.com/quote/BF-B/profile?p=BF-B" root = requests.get(url) soup = BeautifulSoup(root.text, 'ht

浏览 2提问于2020-09-26得票数 1

回答已采纳

1回答

将项目列表视为单个项目错误:如何在已抓取的字符串中查找每个“link”中的链接

、、、

我正在写一个python代码来从这个网站上抓取会议的pdf： pdf链接在链接内，链接也在链接内。我有上面页面上的第一组链接，然后我需要在新的urls中抓取链接。当我这样做时，我得到以下错误： AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()? 到目前为止，这是我的代

浏览 9提问于2019-07-11得票数 0

回答已采纳

2回答

无法从DuckDuckGo搜索结果中抓取链接

、、、

我想从DuckDuckGo搜索结果中抓取第一个链接。我写了下面的代码： import requests from bs4 import BeautifulSoup class Bse: def currentPrice(self,symbol): headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:84.0) Gecko/20100101 Firefox/84.0" }

浏览 6提问于2021-04-02得票数 0

2回答

我想用BeautifulSoup在python中抓取，但'str‘对象没有属性’str‘时出现错误

我想用BeautifulSoup在python中抓取，但是'str‘对象没有属性’str‘时出现了错误。预期的结果是为数组中的每个值分配数字。以下是我的代码 import requests from bs4 import BeautifulSoup url = "https://ja.wikipedia.org/wiki/メインページ" response= requests.get(url) soup = BeautifulSoup(response.content, "html.parser") today = soup.find("di

浏览 0提问于2021-05-16得票数 0

2回答

Python漂亮的汤在表格上迭代

、

我正在尝试将表数据抓取到CSV文件中。不幸的是，我遇到了一个障碍，下面的代码只是在所有后续TR中重复第一个TR中的TD。 import urllib.request from bs4 import BeautifulSoup f = open('out.txt','w') url = "http://www.international.gc.ca/about-a_propos/atip-aiprp/reports-rapports/2012/02-atip_aiprp.aspx" page = urllib.request.urlopen(u

浏览 1提问于2012-04-25得票数 22

回答已采纳

1回答

如何使用Python和BeautifulSoup或Pandas从Finviz.com中抓取最大赢家和最大输家的表行？

、、、

这就是我所能得到的。如何使用Python和BeautifulSoup或Pandas从Finviz.com中抓取最大赢家和最大输家的表行？ import requests from bs4 import BeautifulSoup r=requests.get("https://finviz.com") c=r.content soup = BeautifulSoup(c, "html.parser") table =soup.find("table", {"class": "t-home-table"})

浏览 15提问于2020-04-09得票数 1

1回答

如何在多个页面上抓取链接标题并通过指定的标签

、

我很难弄清楚如何使用BeautifulSoup来抓取页面上所有的100个链接标题，因为它是在"a href =.“下。我已经尝试了下面的代码，但它返回一个空白。 from bs4 import BeautifulSoup from urllib.request import urlopen import bs4 url = 'https://www150.statcan.gc.ca/n1/en/type/data?count=100' page = urlopen(url) soup = bs4.BeautifulSoup(page,'html.parser

浏览 17提问于2020-06-02得票数 0

回答已采纳

2回答

如何将被刮过的web数据放入csv文件

、

我已经从一个使用jupyter笔记本中的代码的网站上抓取了数据(基本上是火车的详细信息，如No、Name、Type、Zone等)：如何将在“输出”中获得的结果放入DataFrame中，然后放入csv文件？ import requests from bs4 import BeautifulSoup import pandas as pd r=requests.get("https://indiarailinfo.com/arrivals/kanpur-central-cnb/452") print(r.text[0:200000]) soup=BeautifulSou

浏览 1提问于2018-06-06得票数 0

回答已采纳

1回答

如何在web抓取时拆分<p>标记内的元素

、

我在试着抓取url。但是，输出不是所需的格式。我只需要分支机构的名称和地址。如何将此信息从p标记中拆分。 import re import requests from bs4 import BeautifulSoup page = requests.get(url) Branch_list=[] soup = BeautifulSoup(page.content, 'html.parser') for i in soup.find_all('div',class_="

浏览 17提问于2020-11-09得票数 1

回答已采纳

1回答

如何使用美丽的汤和熊猫从这个网站上捕获结构化格式的表格？

、、、、

我想从这个网站上抓取表格，因为它每小时都在更新，所以我也想跟踪变化。我尝试过使用selenium抓取数据，但它们都在一个列中，没有任何表。如何使用pandas和Beautiful Soup以结构化的格式抓取表格并跟踪更改。这就是我想弄明白的代码。 import pandas as pd from bs4 import BeautifulSoup soup = BeautifulSoup(html, "html.parser") table = soup.find('table', attrs={'id':'subs noBorders

浏览 7提问于2020-09-23得票数 0

回答已采纳

2回答

如何抓取wikipedia infobox并将其存储到csv文件中

、、

我已经抓取了维基百科的infobox，但我不知道如何将数据存储在csv文件中。请帮帮我。 from bs4 import BeautifulSoup as bs from urllib.request import urlopen def infobox(query) : query = query url = 'https://en.wikipedia.org/wiki/'+query raw = urlopen(url) soup = bs(raw) table = soup.find('table',{'

浏览 22提问于2019-02-20得票数 0

1回答

Python web从asx抓取-无法获取公告表

、

我试图从asx页面中抓取公告表，然而，当我使用BeautifulSoup解析html时，这个表不在那里。 import requests import pandas as pd from bs4 import BeautifulSoup url='https://www2.asx.com.au/markets/trade-our-cash-market/announcements.cba' page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') table = soup.

浏览 17提问于2021-02-26得票数 0

3回答

Python BeautifulSoup剪贴表

、、、、

我正在尝试用BeautifulSoup创建一个表格抓取。我写了这段Python代码： import urllib2 from bs4 import BeautifulSoup url = "http://dofollow.netsons.org/table1.htm" # change to whatever your url is page = urllib2.urlopen(url).read() soup = BeautifulSoup(page) for i in soup.find_all('form'): print i.attrs[

浏览 0提问于2013-09-24得票数 27

回答已采纳

1回答

如何通过比较内部表格单元格和给定的标准来选择特定的表行？

、、

**我想从站点抓取所有的美国代理，我已经抓取了所有行，但不能只选择那些有美国国家的代理记录，然后我想获取带有相应端口的单独的美国代理并保存它们* from bs4 import BeautifulSoup as bs # loading web page r = requests.get("https://sslproxies.org/") # convert to a beautiful-soup object webpage = bs(r.content, "html.parser") rows = iter(webpage.find('table

浏览 4提问于2022-06-21得票数 -1

回答已采纳

1回答

为每个查询的列表项保存单独的CSV

、、、

我是Python的新手，正在处理一个web抓取脚本，该脚本有一个站点列表，每次该脚本从列表中查询一个站点时，我都需要将其保存到一个单独的CSV。目前，它似乎迭代我列表上的每个站点，但只将最后一个查询(www.website.com/3)中的项保存到CSV。我意识到一旦它遍历我的records列表，它就会被重置，但它不是应该首先保存CSV吗？除非文件只是被新数据覆盖，但如果是这种情况，我如何为每个查询增加文件名？ from typing import Counter import requests from bs4 import BeautifulSoup import sys import

浏览 8提问于2021-08-06得票数 0

1回答

使用python解析XML时出错

、、

我用BeautifulSoup解析了一个XML文件，并且在从中提取数据时遇到了困难。XML结构的一个示例如下： <Products page="0" pages="-1" records="27"> <Product id="ABC001"> <Name>This product name</Name> <Cur>USD</Cur> <Tag>Text</Tag> <Classes>

浏览 1提问于2016-08-31得票数 0

回答已采纳

1回答

find_all()函数和BeautifulSoup包的SoupStrainer有什么区别？

、

以下代码用于在屏幕上打印html_doc的标签，它是一个包含html代码的变量： from bs4 import SoupStrainer only_a_tags = SoupStrainer("a") print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) 下面的代码返回相同的结果： print(BeautifulSoup(html_doc, "html.parser").find_all("a").prettify()

浏览 1提问于2017-11-10得票数 1

2回答

如何抓取一个网站的所有首页文本内容？

、、

所以我是网络抓取的新手，我只想抓取主页的所有文本内容。这是我的代码，但它现在可以正常工作。 from bs4 import BeautifulSoup import requests website_url = "http://www.traiteurcheminfaisant.com/" ra = requests.get(website_url) soup = BeautifulSoup(ra.text, "html.parser") full_text = soup.find_all() print(full_text) 当我打印" fu

浏览 12提问于2020-03-01得票数 0

回答已采纳