文章/答案/技术大牛

发布

社区首页 >问答首页 >Python and抓取及其类的第一个div标记的内容。

问Python and抓取及其类的第一个div标记的内容。
EN

Stack Overflow用户

提问于 2014-02-21 18:24:11

回答 1查看 8.6K关注 0票数 0

我正在使用Python3.3和这个网站：http://www.nasdaq.com/markets/ipos/

我的目标是只阅读即将上市的公司。它位于div class=的div标记“genTable thin floatL”中，有两个带有这个类，目标数据位于第一个。

这是我的密码

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
    for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
        s = div.string
        if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
            div_next = div.find_next('div')
            print('{} - {}'.format(s, div_next.string))

我希望它只返回

3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.

但是它使用re.match规范打印所有div类，并多次打印。我尝试在for divparent循环中插入，只检索第一个循环，但这反而导致了重复问题。

编辑:这是根据warunsl解决方案更新的代码。这个很管用。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)

divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
        s = div.string
        if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
            div_next = div.find_next('div')
            print('{} - {}'.format(s, div_next.string))

python

web-scraping

beautifulsoup

python-3.3

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-02-21 18:49:04

您提到有两个元素符合'class':'genTable thin floatL'标准。因此，为它的第一个元素运行一个for循环是没有意义的。

因此，将外部for循环替换为

divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]

现在，您不必再做soup.find_all了。这样做将搜索整个文档。您需要将搜索限制在divparent上。所以，你需要：

table = divparent.find('table')

提取日期和公司名称的代码的其余部分将是相同的，只是它们将引用table变量。

for row in table.find_all('tr'):
    for data in row.find_all('td'):
        print data.string

希望能帮上忙。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/21942286

复制

相似问题

问Python and抓取及其类的第一个div标记的内容。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python and抓取及其类的第一个div标记的内容。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python and抓取及其类的第一个div标记的内容。
EN