<div id="div_1">
<p class="keywords">
<strong> Those are the main keywords </strong>
<ol>
<li>Decentralization</li>
<li>Planning</li>
</ol>
</p>
</div>
<div id="div_2">
<p class="keywords">
<strong>This is the first paragraph of the second div </strong>
<strong>This is the second paragraph of the second div </strong>
</p>
</div>
<div id="div_3">
<p> This is the first paragraph of the second div </p>
</div>
Those are the main keywords Decentralization Planning
This is the first paragraph of the second div This is the second paragraph of the second div
This is the first paragraph of the third div
这是我的代码:
soup = BeautifulSoup (open(document, encoding = "utf8"), "html.parser")
myDivs = soup.findAll("div", id = re.compile("^div_"))
for div in myDivs:
txt = div.text + "\n"
print (txt)
这会将< div >的文本返回给我,但它的每个标记(< p>、)都在一行中
你知道我该怎么做吗?
发布于 2020-04-16 07:01:17
Yap在div
> P
上运行for循环
<html>
<head></head>
<body>
<div id="div_1">
<p class="keywords">
<strong> Those are the main keywords </strong>
<ol>
<li>Decentralization</li>
<li>Planning</li>
</ol>
</p>
</div>
<div id="div_2">
<p class="keywords">
<strong>This is the first paragraph of the second div </strong>
<strong>This is the second paragraph of the second div </strong>
</p>
</div>
<div id="div_3">
<p> This is the first paragraph of the second div </p>
</div>
</body>
</html>
from bs4 import BeautifulSoup
url = r"D:\Temp\example.html"
with open(url, "r") as page:
contents = page.read()
html = BeautifulSoup(contents, 'html.parser')
html_body = html.find('body')
elements = html.find_all('div')
for div in elements:
p = div.find_all('p')
text = [i.text for i in p]
print(text)
发布于 2020-04-16 07:10:27
import re
from bs4 import BeautifulSoup
html = """
<div id="div_1">
<p class="keywords">
<strong> Those are the main keywords </strong>
<ol>
<li>Decentralization</li>
<li>Planning</li>
</ol>
</p>
</div>
<div id="div_2">
<p class="keywords">
<strong>This is the first paragraph of the second div </strong>
<strong>This is the second paragraph of the second div </strong>
</p>
</div>
<div id="div_3">
<p> This is the first paragraph of the second div </p>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("div", id=re.compile("^div_")):
target = [a.get_text(strip=True, separator=" ") for a in item.findAll("p")]
print(*target)
输出:
Those are the main keywords Decentralization Planning
This is the first paragraph of the second div This is the second paragraph of the second div
This is the first paragraph of the second div
https://stackoverflow.com/questions/61239444
复制相似问题