问Python -从复杂的HTML中自动抓取表格
EN

Stack Overflow用户

提问于 2020-04-15 17:33:28

回答 1查看 1.4K关注 0票数 0

我正在尝试自动从网站上的每个表格中抓取所有数据，并将每个表格输出到excel中的选项卡中。

我一直在使用https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python、https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059和Python - Web Scraping HTML table and printing to CSV等问题中目前可用的代码。

在使用这个URL时，我很难同时提取底层数据和表头。HTML格式非常密集，这使得我很难以正确的结构提取表。

我当前的代码：

from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import re
import html2text
import requests
import pandas as pd

USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# US english
LANGUAGE = "en-US,en;q=0.5"

def get_soup(url):
    """Constructs and returns a soup using the HTML content of `url` passed"""
    # initialize a session
    session = requests.Session()
    # set the User-Agent as a regular browser
    session.headers['User-Agent'] = USER_AGENT
    # request for english content (optional)
    session.headers['Accept-Language'] = LANGUAGE
    session.headers['Content-Language'] = LANGUAGE
    # make the request
    html = session.get(url)
    # return the soup
    return bs(html.content, "html.parser")

def get_all_tables(soup):
    """Extracts and returns all tables in a soup object"""
    return soup.find_all("table")

def get_table_headers(table):
    """Given a table soup, returns all the headers"""
    headers = []
    for th in table.find("tr").find_all("th"):
        headers.append(th.text.strip())
    return headers

def get_table_rows(table):
    """Given a table, returns all its rows"""
    rows = []
    for tr in table.find_all("tr")[1:]:
        cells = []
        # grab all td tags in this table row
        tds = tr.find_all("td")
        if len(tds) == 0:
            # if no td tags, search for th tags
            # can be found especially in wikipedia tables below the table
            ths = tr.find_all("th")
            for th in ths:
                cells.append(th.text.strip())
        else:
            # use regular td tags
            for td in tds:
                cells.append(td.text.strip())
        rows.append(cells)
    return rows

def save_as_csv(table_name, headers, rows):
    pd.DataFrame(rows, columns=headers).to_csv(f"{table_name}.csv")

def main(url):
    # get the soup
    soup = get_soup(url)
    # extract all the tables from the web page
    tables = get_all_tables(soup)
    print(f"[+] Found a total of {len(tables)} tables.")
    # iterate over all tables
    for i, table in enumerate(tables, start=1):
        # get the table headers
        headers = get_table_headers(table)
        # get all the rows of the table
        rows = get_table_rows(table)
        # save table as csv file
        table_name = f"table-{i}"
        print(f"[+] Saving {table_name}")
        save_as_csv(table_name, headers, rows)

main("https://www.sec.gov/Archives/edgar/data/1701605/000170160519000089/bkr-2019093010xq.htm")

例如，我需要代码来识别一个表，如所附图像中的表，并将所有信息放入excel格式。

来自Extract HTML Tables With Similar Data from Different Sources with Different Formatting - Python和Extract HTML Table Based on Specific Column Headers - Python等问题的代码能够搜索网址，但寻找的标准太具体了，因为我需要网址中的所有表。

任何帮助都将不胜感激！我确信有一个我看不到的优雅的解决方案

python

html

excel

web-scraping

beautifulsoup

回答 1

Stack Overflow用户

发布于 2020-04-15 20:37:23

我看了一下。你文章中的网址在很大程度上依赖于JavaScript来用它的元素填充页面。这就是为什么BeautifulSoup看不到它。模板HTML有12个表，它们最初看起来都是这样的：

<table class="table table-striped table-sm">
    <tbody id="form-information-modal-carousel-page-1">
        <!-- Below is populated dynamically VIA JS -->
            <tr>
                <td class="text-center">
                    <i class="fas fa-spinner fa-spin"></i>
                </td>
            </tr>
    </tbody>
</table>
</div>
<div class="carousel-item table-responsive">
    <table class="table table-striped table-bordered table-sm">
        <tbody id="form-information-modal-carousel-page-2">
            <!-- Below is populated dynamically VIA JS -->
            ...

请注意注释。基本上，所有有趣的数据都没有放到这个HTML中。我记录了我的网络流量，页面发出了两个XHR请求。其中一个看起来很有前途，那就是MetaLinks.json。它很大，但不幸的是表数据不在其中(仍然很感兴趣，可能对其他事情有用)。另一个XHR资源是一个实际的HTML文档，其中包含内置的表格数据。JSON会更好，因为我们不需要使用BeautifulSoup来解析它，但不管怎样。顺便说一句，这就是我们真正想要抓取的HTML。我们不想抓取您提供的URL (交互式内联XBRL查看器)-它实际上使用HTML XHR资源来填充自己。这个HTML与您单击左上角的内联XBRL查看器的汉堡菜单并选择"Open as HTML“时查看的是同一个HTML。如果您在查找时遇到困难，可以使用以下网址：https://www.sec.gov/Archives/edgar/data/1701605/000170160519000089/bkr-2019093010xq.htm

编辑-这里有一个小例子。我只是从表格中找出一些数字：

def main():

    import requests
    from bs4 import BeautifulSoup

    url = "https://www.sec.gov/Archives/edgar/data/1701605/000170160519000089/bkr-2019093010xq.htm"

    response = requests.get(url, headers={})
    response.raise_for_status()

    soup = BeautifulSoup(response.content, "html.parser")

    for table in soup.find_all("table"):
        for row in table.find_all("tr"):
            for data in row.find_all("ix:nonfraction"):
                print(data.text, end=" ")
            print()
        print()


    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

输出：

3,339 3,142 9,886 9,421 
2,543 2,523 7,604 7,191 
5,882 5,665 17,490 16,612 


2,901 2,819 8,647 8,371 
1,880 1,873 5,705 5,491 
679 608 2,083 1,944 
71 66 183 374 
54 17 128 113 
5,585 5,383 16,746 16,293 
297 282 744 319 
14 6 124 51 
59 55 174 164 
224 233 446 206 
— 85 — 139 
107 110 269 86 
117 38 177 19 
60 25 97 83 
57 13 80 64 
...

输出结果实际上比我所展示的要长得多，但是您可以理解其中的意思。此外，我不会从表中提取所有相关数字，因为我只查看ix:nonfraction标记，但还有其他类型(例如，十进制数字)。超文本标记语言真的很密集--你必须弄清楚如何从每一行中获取所有其他字段，处理非ix:nonfraction标记，处理空列等。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61234991

复制

（二）html 复杂表格

html 表格

---- <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <title>程序员之美</title> <meta charset="utf-8" /> <style type="text/css"> /*表格样式*/ table { width: 90%; back

花狗Fdog

2020/11/03

2.5K0