首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用beautifulSoup和urllib进行网页抓取

使用beautifulSoup和urllib进行网页抓取
EN

Stack Overflow用户
提问于 2017-08-30 17:46:16
回答 2查看 559关注 0票数 0

我使用的是python 3.6,我可以使用BeautifulSou.来抓取文本。我正在用沃尔玛网站练习。我试图从沃尔玛抓取文本。这是我的代码。

代码语言:javascript
复制
from bs4 import BeautifulSoup
from urllib.request import urlopen
main_page=urlopen('http://www.walmart.com/ip/Sceptre-32-Class-HD-720P-LED-TV-X322BV-SR/55427159')
soup = BeautifulSoup(main_page,"lxml")
title=soup.select_one("h1.prod-ProductTitle.no-margin.heading-a").get_text()
price=soup.select_one("span.Price-group").get_text()
highLights=soup.select_one("div.ProductPage-short-description-body").get_text()
description=soup.select_one("div.about-desc").get_text()
print(title,"\n",highLights,"\n",description,"\n",price)

在上面的代码中,我提取了产品名称,价格,亮点和描述,但我不能提取描述(关于这个项目)。我得到的不是描述,而是其他东西。

请帮我解决这个问题。

EN

回答 2

Stack Overflow用户

发布于 2017-08-30 18:46:44

因为有两个带有“class=-desc”的div,因为您使用的是select_one,所以只返回第一个div,但需要返回第二个div。下面是调整:

代码语言:javascript
复制
description=soup.select("div.about-desc")[1].get_text()

更新:站点实际上阻止了urllib的默认用户代理,所以你应该屏蔽它。

代码语言:javascript
复制
from bs4 import BeautifulSoup
from urllib.request
user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'}
req = urllib.request.Request(url="http://www.walmart.com/ip/Sceptre-32-Class-HD-720P-LED-TV-X322BV-SR/55427159", headers=user_agent)
main_page = urllib.request.urlopen(req)
soup = BeautifulSoup(main_page,"lxml")
title=soup.select_one("h1.prod-ProductTitle.no-margin.heading-a").get_text()
price=soup.select_one("span.Price-group").get_text()
highLights=soup.select_one("div.ProductPage-short-description-body").get_text()
description=soup.select("div.about-desc")[1].get_text()
print(title,"\n",highLights,"\n",description,"\n",price)
票数 0
EN

Stack Overflow用户

发布于 2021-06-05 17:55:30

有两种选择:

使用+ beautifulsoup

  • using JSONrequests
  • 解析requests requests-html

如果你在Chrome控制台中运行它,你会得到以下响应:

代码语言:javascript
复制
test = JSON.parse(document.querySelector("#item").textContent).item.product.buyBox.products[0]
console.log(test)

代码语言:javascript
复制
import json, requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get('https://www.walmart.com/ip/Wilson-The-Duke-Official-NFL-Game-Football/5192758', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# https://stackoverflow.com/a/63151716/15164646
fus = soup.select_one('#item').string
ro = json.loads(fus)
dah = json.dumps(fus, indent=2, ensure_ascii=False)
print(dah)

输出的一部分:

代码语言:javascript
复制
{
  "item": {
    "ads": {
      "config": {
        "lazy-homepage-expose1": "800",
        "lazy-search-expose1": "800",
        "lazy-browse-expose1": "800",
        "lazy-category-expose1": "200",
        "no-category-marquee2": true,
        "no-deals-skyline1": true,
        "no-homepage-twocolumnhp": true,
        "lazy-item-expose1": "800",
        "lazy-item-marquee2": "1200",
        "lazy-item-rightrail2": "1200",
        "adblockImgSource": "//i5.walmartimages.com/dfw/63fd9f59-8bc2/8fe200ec-4c4d-4ab0-89e5-0662af6f506d/v1/ads.png",
        "displayAdsS2sScript": "//i5.wal.co/dfw/63fd9f59-a579/be6f8cae-248d-40e2-8cad-32d04468ea59/v29/usgm-s2s-midas.js",
        "displayAdsS2sScriptWithPoly": "//i5.wal.co/dfw/63fd9f59-5870/c8ceb4ee-1e68-40ec-a38e-ca0623f075a0/v29/usgm-s2s-midas-poly.js",
        "safeframeUrl": "https://i5.wal.co/dfw/63fd9f59-d6ba/07b8ea82-184c-4ea3-8ac0-5dc1981e40c8/v50/safeframe.html",
        "displayAds": true,
        "exts2s": true,
        "isTwoDayDeliveryTextEnabled": true,
        "ads2s": true,
        "bypassproxy": false,
        "adblockDetectionEnabled": false,
        "marqueeSafeframe": true,
        "exposeSafeframe": true,
        "skylineSafeframe": true,
        "leftrailSafeframe": true,
        "rightrailSafeframe": true,
        "cloud": "scus-prod-a29"
      }
}
# much more down below...

下面的代码使用了requests-html。获取“关于此项目”描述的一种方法是使用XPath

代码(在多个清单上测试):

代码语言:javascript
复制
from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://www.walmart.com/ip/Sceptre-32-Class-720P-HD-LED-TV-X322BV-SR/55427159')

# first=True means that it will grab the first occurrence and skip everything else
title = response.html.find('.prod-productTitle-buyBox', first=True).text
price = response.html.find('.prod-PriceHero', first=True).text.split('$')[1]
description = response.html.xpath('//*[@id="about-product-section"]/div/div[1]/div[1]/div[3]/p[1]/text()', first=True)
key_features = response.html.xpath('//*[@id="about-product-section"]/div/div[1]/div[1]/div[3]/ul[1]', first=True).text

print(title)
print(price)
print(description)
print(key_features)

输出:

代码语言:javascript
复制
Sceptre 32" Class 720P HD LED TV X322BV-SR
129.00
Escape into a world of splendid color and clarity with the X322BV-SR. 
Clear QAM tuner is included to make cable connection as easy as possible, without an antenna. 
HDMI input delivers the unbeatable combination of high-definition video and clear audio. 
A USB port comes in handy when you want to flip through all of your stored pictures and tune into your stored music. 
More possibilities: with HDMI, VGA, Component and Composite inputs, we offer a convenient balance between the old and new to suit your diverse preferences. 
With the ability to connect your computer, laptop, monitor, or TV to all your favorite variety of input options, VGA inputs deliver superb analog video.

Screen Size (Diag.) 31.5"
Backlight Type LED
Resolution 720p
Effective Refresh Rate 60Hz
Smart Functionality no
Aspect Ratio 16 9
Dynamic Contrast Ratio 5,000 1
Viewable Angle (H/V) 178 degrees/178 degrees
Number of Colors 16.7 M
OSD Language English, Spanish, French
Speakers/Power Output 10W x 2
Surround Sound Mode

或者,您也可以使用SerpApi中的第三方Walmart Product API。这是一个付费的API,免费试用了5000次搜索。一个完全免费的试用版目前正在开发中。

要集成的代码:

代码语言:javascript
复制
from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "walmart_product",
  "product_id": "55427159"
}

search = GoogleSearch(params)
results = search.get_dict()

title = results['product_result']['title']
price = results['product_result']['price_map']['price']
key_features = results['product_result']['detailed_description_html']
print(title)
print(price)
print(key_features)

输出:

代码语言:javascript
复制
Sceptre 32" Class 720P HD LED TV X322BV-SR
129
<b>Sceptre 32" Class 720p HD LED TV X322BV-SR</b><br /><b>Key Features </b></p><ul><li>Screen Size (Diag.) 31.5"</li><li>Backlight Type LED</li><li>Resolution 720p</li><li>Effective Refresh Rate 60Hz</li><li>Smart Functionality no</li><li>Aspect Ratio 16 9</li><li>Dynamic Contrast Ratio 5,000 1</li><li>Viewable Angle (H/V) 178 degrees/178 degrees</li><li>Number of Colors 16.7 M</li><li>OSD Language English, Spanish, French</li><li>Speakers/Power Output 10W x 2</li><li>Surround Sound Mode</li></ul><b>Connectivity </b><ul><li>Component/Composite Video 1</li><li>HDMI 2</li><li>Headphone 1</li><li>Optical Digital Audio 1</li><li>RCA Audio L+R 1</li><li>RF (Coaxial) 1</li><li>USB 2.0 1</li><li>Assembled Product Dims 28.78 x 18.39 x 7.95 Inches<br /></li></ul><b>What's In The Box </b><ul><li>Remote Control</li></ul><b>Wall-mountable </b><ul><li>Mount Pattern 100mm x 100mm</li><li>Screw Size M4</li><li>Screw Length 6mm</li></ul><b>Support and Warranty </b><ul><li>1-year limited labor and parts</li></ul><br /><br />Flat Screen TV stand sold separately. See all <b> TV stands.</b><br /><br />Flat Screen TV mount sold separately. See all <b> TV mounts. </b><br /><br />TV audio equipment sold separately. See all <b> Home Theater Systems. </b><br /><br />HDMI cables sold separately. See all <b> HDMI Cables.</b><br /><br />Accessories sold separately. See all <b> Accessories.<br /></b><br /><br /><b>ENERGY STAR<sup></sup></b><br />Products that are ENERGY STAR-qualified prevent greenhouse gas emissions by meeting strict energy efficiency guidelines set by the U.S. Environmental Protection Agency and the U.S. Department of Energy. The ENERGY STAR name and marks are registered marks owned by the U.S. government, as part of their energy efficiency and environmental activities.

免责声明,我为SerpApi工作。

票数 -1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/45957083

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档