我正在尝试自动从网站上的每个表格中抓取所有数据,并将每个表格输出到excel中的选项卡中。
我一直在使用https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python、https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059和Python - Web Scraping HTML table and printing to CSV等问题中目前可用的代码。
在使用这个URL时,我很难同时提取底层数据和表头。HTML格式非常密集,这使得我很难以正确的结构提取表。
我当前的代码:
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import re
import html2text
import requests
import pandas as pd
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# US english
LANGUAGE = "en-US,en;q=0.5"
def get_soup(url):
"""Constructs and returns a soup using the HTML content of `url` passed"""
# initialize a session
session = requests.Session()
# set the User-Agent as a regular browser
session.headers['User-Agent'] = USER_AGENT
# request for english content (optional)
session.headers['Accept-Language'] = LANGUAGE
session.headers['Content-Language'] = LANGUAGE
# make the request
html = session.get(url)
# return the soup
return bs(html.content, "html.parser")
def get_all_tables(soup):
"""Extracts and returns all tables in a soup object"""
return soup.find_all("table")
def get_table_headers(table):
"""Given a table soup, returns all the headers"""
headers = []
for th in table.find("tr").find_all("th"):
headers.append(th.text.strip())
return headers
def get_table_rows(table):
"""Given a table, returns all its rows"""
rows = []
for tr in table.find_all("tr")[1:]:
cells = []
# grab all td tags in this table row
tds = tr.find_all("td")
if len(tds) == 0:
# if no td tags, search for th tags
# can be found especially in wikipedia tables below the table
ths = tr.find_all("th")
for th in ths:
cells.append(th.text.strip())
else:
# use regular td tags
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
return rows
def save_as_csv(table_name, headers, rows):
pd.DataFrame(rows, columns=headers).to_csv(f"{table_name}.csv")
def main(url):
# get the soup
soup = get_soup(url)
# extract all the tables from the web page
tables = get_all_tables(soup)
print(f"[+] Found a total of {len(tables)} tables.")
# iterate over all tables
for i, table in enumerate(tables, start=1):
# get the table headers
headers = get_table_headers(table)
# get all the rows of the table
rows = get_table_rows(table)
# save table as csv file
table_name = f"table-{i}"
print(f"[+] Saving {table_name}")
save_as_csv(table_name, headers, rows)
main("https://www.sec.gov/Archives/edgar/data/1701605/000170160519000089/bkr-2019093010xq.htm")
例如,我需要代码来识别一个表,如所附图像中的表,并将所有信息放入excel格式。
来自Extract HTML Tables With Similar Data from Different Sources with Different Formatting - Python和Extract HTML Table Based on Specific Column Headers - Python等问题的代码能够搜索网址,但寻找的标准太具体了,因为我需要网址中的所有表。
任何帮助都将不胜感激!我确信有一个我看不到的优雅的解决方案
发布于 2020-04-15 20:37:23
我看了一下。你文章中的网址在很大程度上依赖于JavaScript来用它的元素填充页面。这就是为什么BeautifulSoup看不到它。模板HTML有12个表,它们最初看起来都是这样的:
<table class="table table-striped table-sm">
<tbody id="form-information-modal-carousel-page-1">
<!-- Below is populated dynamically VIA JS -->
<tr>
<td class="text-center">
<i class="fas fa-spinner fa-spin"></i>
</td>
</tr>
</tbody>
</table>
</div>
<div class="carousel-item table-responsive">
<table class="table table-striped table-bordered table-sm">
<tbody id="form-information-modal-carousel-page-2">
<!-- Below is populated dynamically VIA JS -->
...
请注意注释<!-- Below is populated dynamically VIA JS -->
。基本上,所有有趣的数据都没有放到这个HTML中。我记录了我的网络流量,页面发出了两个XHR请求。其中一个看起来很有前途,那就是MetaLinks.json
。它很大,但不幸的是表数据不在其中(仍然很感兴趣,可能对其他事情有用)。另一个XHR资源是一个实际的HTML文档,其中包含内置的表格数据。JSON会更好,因为我们不需要使用BeautifulSoup来解析它,但不管怎样。顺便说一句,这就是我们真正想要抓取的HTML。我们不想抓取您提供的URL (交互式内联XBRL查看器)-它实际上使用HTML XHR资源来填充自己。这个HTML与您单击左上角的内联XBRL查看器的汉堡菜单并选择"Open as HTML“时查看的是同一个HTML。如果您在查找时遇到困难,可以使用以下网址:https://www.sec.gov/Archives/edgar/data/1701605/000170160519000089/bkr-2019093010xq.htm
编辑-这里有一个小例子。我只是从表格中找出一些数字:
def main():
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/1701605/000170160519000089/bkr-2019093010xq.htm"
response = requests.get(url, headers={})
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for table in soup.find_all("table"):
for row in table.find_all("tr"):
for data in row.find_all("ix:nonfraction"):
print(data.text, end=" ")
print()
print()
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
输出:
3,339 3,142 9,886 9,421
2,543 2,523 7,604 7,191
5,882 5,665 17,490 16,612
2,901 2,819 8,647 8,371
1,880 1,873 5,705 5,491
679 608 2,083 1,944
71 66 183 374
54 17 128 113
5,585 5,383 16,746 16,293
297 282 744 319
14 6 124 51
59 55 174 164
224 233 446 206
— 85 — 139
107 110 269 86
117 38 177 19
60 25 97 83
57 13 80 64
...
输出结果实际上比我所展示的要长得多,但是您可以理解其中的意思。此外,我不会从表中提取所有相关数字,因为我只查看ix:nonfraction
标记,但还有其他类型(例如,十进制数字)。超文本标记语言真的很密集--你必须弄清楚如何从每一行中获取所有其他字段,处理非ix:nonfraction
标记,处理空列等。
https://stackoverflow.com/questions/61234991
复制相似问题