问.get_text()在使用美汤的span上无法正常工作
EN

Stack Overflow用户

提问于 2020-05-12 07:30:06

回答 3查看 55关注 0票数 0

我正在尝试提取following article的主体。这是我使用的代码：

from bs4 import BeautifulSoup
import requests

a_url = "https://www.business-standard.com/article/current-affairs/up-plans-100-000-covid-beds-as-325-000-stranded-labourers-return-in-2-weeks-120051100865_1.html"
y = requests.get(a_url, headers=headers)
soup2 = BeautifulSoup(y.content, 'html.parser')
body = soup2.find('span',class_= "p-content").get_text()

我认为我应该只得到文本，但这是输出：

\nspan.p-content div[id^="div-gpt"]{line-height:0;font-size:0}\n\r\n\tAmid the large-scale 
influx of migrant labourers due to the lockdown, the Uttar Pradesh government is planning to make arrangements for 100,000 covid-19 beds across the state.\n\r\n\tAs commercial and 
industrial activity in UP has started reviving under the controlled relaxations announced by the 
government, the state is gearing up to deal with exigencies, with more than a million migrants
 expected to arrive in the near future.document.write("<!--");if(isUserBanner=="free"&&
(displayConBanner==1))document.write("-->");googletag.cmd.push(function()
{googletag.defineOutOfPageSlot(\'/6516239/outofpage_1x1_desktop\',\'div-gpt-ad-1490771277198-
0\').addService(googletag.pubads());googletag.pubads().enableSyncRendering();googletag.enableSer
vices();});\n\ngoogletag.cmd.push(function(){googletag.display(\'div-gpt-ad-1490771277198-
0\');});\n\nvar banHeight=$(".article-middle-banner iframe").height();if(banHeight<=1)
{$(".article-middle-banner").height(0);$(".article-middle-
banner").next().next().remove();}displayConBanner=1;\n\r\n\tIn fact, some 325,000 stranded
 workers have returned in the past two weeks by either train or bus.\n\r\n\tSo far, the state 
government has made arrangements for more than 52,000 covid-19 beds in the public and private 
sectors hospitals.\n\r\n\tChairing a review meeting here, chief minister Yogi Adityanath 
directed officials to ramp up the number of covid beds to 75,000 by May 20 and eventually 
upgrade it to about 100,000 beds in the coming weeks.\nALSO READ: Coronavirus LIVE: 4,213 new 
cases; govt says India recovery rate 31.15%\n\n\r\n\t“The higher number of covid-19 beds would
 ensure that the patients get best medical care in the state whenever required,” UP additional
 chief secretary Awanish Kumar Awasthi said this evening.\n\r\n\tThe state has created an 
elaborate network of level 1, 2 and 3 covid-19 hospitals across the state, of which the L1 
pertain to the primary care at the district level, followed by L2 and L3 at the state level 
having superior medical facilities and equipped with oxygen and ventilator support 
respectively.\n\r\n\tBesides, the state has planned to increase the daily testing capacity to 
10,000 per day from less than 5,000 at present. The government is promoting pool testing too so that a larger number of people could be tested in a given period of time.\n\r\n\tAt present, the instance of covid-19 in UP has been the highest in the 21-40 year age category with more than 48 per cent of the total cases in UP, followed by 41-60 year, 0-20 year and 61+ year categories reporting about 26 per cent, 18 per cent and 8 per cent cases respectively.\n\r\n\t“The percentage of men patients in UP is 78.5 per cent compared to women at 21.5 per cent,” UP principal secretary, medical and health Amit Mohan Prasad said.\n\r\n\tA majority of the coronavirus patients in UP have been asymptomatic that is the patients do not experience any known symptoms of disease, but are found to be active in sample testing.\n\r\n\tMeanwhile, the state has set the target of arranging for 50-55 trains on a daily basis to speedily evacuate its workers stranded in other states, including Maharashtra, Gujarat, Punjab, Karnataka etc.\n\r\n\t“We are creating a database of migrant workers, who are coming back, so that we could identify their unique skill sets for providing them with suitable employment opportunities in UP itself,” Awasthi informed. The state is looking to provide jobs to more than two million migrant workers.\n

一些额外的HTML和谷歌广告的JS也被检索到。我该如何解决这个问题？

python-requests

python

beautifulsoup

单节点MySQL

低成本高体验，解决您的基础业务数据需求

回答 3

Stack Overflow用户

回答已采纳

发布于 2020-05-12 07:41:44

在使用get_text()获取数据之前，需要删除脚本和样式标记中的一些不需要的文本

from bs4 import BeautifulSoup
import urllib.request

with urllib.request.urlopen(
        "https://www.business-standard.com/article/current-affairs/up-plans-100-000-covid-beds-as-325-000-stranded-labourers-return-in-2-weeks-120051100865_1.html") as response:
    html = response.read()
    soup = BeautifulSoup(html, 'html.parser')
[s.extract() for s in soup('script')]
[s.extract() for s in soup('style')]
body = soup.find('span',class_= "p-content").get_text()
print(body)

你可以用它来获取所有的脚本和样式标签

票数 1

Stack Overflow用户

发布于 2020-05-12 07:39:27

from bs4 import BeautifulSoup
import urllib.request

# a_url = "https://www.business-standard.com/article/current-affairs/up-plans-100-000-covid-beds-as-325-000-stranded-labourers-return-in-2-weeks-120051100865_1.html"
# y = requests.get(a_url, headers=headers)
# soup2 = BeautifulSoup(y.content, 'html.parser')
# body = soup2.find('span',class_= "p-content").get_text()


def crawl():
    with urllib.request.urlopen("https://www.business-standard.com/article/current-affairs/up-plans-100-000-covid-beds-as-325-000-stranded-labourers-return-in-2-weeks-120051100865_1.html") as response:
        html = response.read()
        soup = BeautifulSoup(html, 'html.parser')

    for b in soup.find_all('span',{'class':'p-content'}):
        print(b.text)

crawl()

我想你得到了你想要的东西。

我想描述一下代码的细节。但是我很难写英语。所以我希望了解你自己！

票数 0

Stack Overflow用户

发布于 2020-05-12 07:48:44

这里有一个解决方案。

from bs4 import BeautifulSoup
import requests

a_url = "https://www.business-standard.com/article/current-affairs/up-plans-100-000-covid-beds-as-325-000-stranded-labourers-return-in-2-weeks-120051100865_1.html"
y = requests.get(a_url)
soup2 = BeautifulSoup(y.content, 'html.parser')
body = soup2.select('span.p-content p')
for item in body:
    print(item.getText())