我试图从一个网页上刮起一个项目的名称,价格和描述。
这是HTML
...
<div id="ProductDesc">
<a href="javascript:void(0);" onClick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '', '', '')"><h5 id="productTitle">Split Sport Longsleeve T-shirt</h5></a>
<h5 id="productPrice">$42.00</h5>
<br style="clear:both;" /><br />
Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>
...下面是我到目前为止掌握的代码:
line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
price = line.h5.extract()
print price.get_text()
desc = line.get_text()
print desc它的产出如下:
Split Sport Longsleeve T-shirt
$42.00然后是错误:
Traceback (most recent call last):
...
File "/home/myfile.py", line 35, in siftInfo
print line.get_text()
File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 901, in get_text
strip, types=types)])
File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 876, in _all_strings
for descendant in self.descendants:
File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 1273, in descendants
current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'我想要的输出:
Split Sport Longsleeve T-shirt
$42.00
Style # 53TD4141 Screenprinted longsleeve cotton tee.注意:
如果我print line而不是line.get_text()打印,它会返回:
Split Sport Longsleeve T-shirt
$42.00
<div id="ProductDesc">
<a href="javascript:void(0);" onclick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '', '', '')"></a>
<br style="clear:both;"/><br/>
Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>编辑1:
如果我省略了关于价格的两行,并在空白中添加了一些解析,那么我得到了以下内容:
新法典:
line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
desc = line.get_text()
print (' ').join(desc.split())输出:
Split Sport Longsleeve T-shirt
$42.00 Style # 53TD4141 Screenprinted longsleeve cotton tee.因此,第二个line.h5.extract()在某种程度上改变了行的类型,但第一个没有改变。
发布于 2015-11-01 16:19:58
因为它的格式不是很好的评论,我把它放在这里。这是我运行的代码和输出:
from bs4 import BeautifulSoup
from urllib.request import urlopen
def mainTest():
url = "http://10deep.com/store/split-sport-longsleeve-t-shirt"
print("url is: " + url)
page=urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
line = soup.find(id="ProductDesc")
name = line.h5.extract()
print(name.get_text())
price = line.h5.extract()
print(price.get_text())
desc = line.get_text()
print(desc)
mainTest()输出
C:\Python34\python.exe C:/{path}/testPython.py
url is: http://10deep.com/store/split-sport-longsleeve-t-shirt
Split Sport Longsleeve T-shirt
$42.00
Style # 53TD4141 Screenprinted longsleeve cotton tee.
Process finished with exit code 0https://stackoverflow.com/questions/33406556
复制相似问题