首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >漂亮的Soup 4的提取程序()将标记更改为NoneType

漂亮的Soup 4的提取程序()将标记更改为NoneType
EN

Stack Overflow用户
提问于 2015-10-29 05:28:44
回答 1查看 106关注 0票数 0

我试图从一个网页上刮起一个项目的名称,价格和描述。

这是HTML

代码语言:javascript
复制
...
<div id="ProductDesc">
                            <a href="javascript:void(0);" onClick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '',  '',  '')"><h5 id="productTitle">Split Sport Longsleeve T-shirt</h5></a>
                            <h5 id="productPrice">$42.00</h5>
                            <br style="clear:both;" /><br />
                        Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>
...

下面是我到目前为止掌握的代码:

代码语言:javascript
复制
line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
price = line.h5.extract()
print price.get_text()
desc = line.get_text()
print desc

它的产出如下:

代码语言:javascript
复制
Split Sport Longsleeve T-shirt
$42.00

然后是错误:

代码语言:javascript
复制
Traceback (most recent call last):
  ...
  File "/home/myfile.py", line 35, in siftInfo
    print line.get_text()
  File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 901, in get_text
    strip, types=types)])
  File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 876, in _all_strings
    for descendant in self.descendants:
  File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 1273, in descendants
    current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'

我想要的输出:

代码语言:javascript
复制
Split Sport Longsleeve T-shirt
$42.00
Style # 53TD4141 Screenprinted longsleeve cotton tee.

注意:

如果我print line而不是line.get_text()打印,它会返回:

代码语言:javascript
复制
Split Sport Longsleeve T-shirt
$42.00
<div id="ProductDesc">
                            <a href="javascript:void(0);" onclick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '',  '',  '')"></a>

                            <br style="clear:both;"/><br/>
                        Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>

编辑1:

如果我省略了关于价格的两行,并在空白中添加了一些解析,那么我得到了以下内容:

新法典:

代码语言:javascript
复制
line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
desc = line.get_text()
print (' ').join(desc.split())

输出:

代码语言:javascript
复制
Split Sport Longsleeve T-shirt
$42.00 Style # 53TD4141 Screenprinted longsleeve cotton tee.

因此,第二个line.h5.extract()在某种程度上改变了行的类型,但第一个没有改变。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-11-01 16:19:58

因为它的格式不是很好的评论,我把它放在这里。这是我运行的代码和输出:

代码语言:javascript
复制
from bs4 import BeautifulSoup
from urllib.request import urlopen

def mainTest():
    url = "http://10deep.com/store/split-sport-longsleeve-t-shirt"
    print("url is: " + url)
    page=urllib.request.urlopen(url)

    soup = BeautifulSoup(page.read())
    line = soup.find(id="ProductDesc")
    name = line.h5.extract()
    print(name.get_text())
    price = line.h5.extract()
    print(price.get_text())
    desc = line.get_text()
    print(desc)

mainTest()

输出

代码语言:javascript
复制
C:\Python34\python.exe C:/{path}/testPython.py
url is: http://10deep.com/store/split-sport-longsleeve-t-shirt
Split Sport Longsleeve T-shirt
$42.00




                        Style # 53TD4141 Screenprinted longsleeve cotton tee.

Process finished with exit code 0
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/33406556

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档