我使用的是python 3.6,我可以使用BeautifulSou.来抓取文本。我正在用沃尔玛网站练习。我试图从沃尔玛抓取文本。这是我的代码。
from bs4 import BeautifulSoup
from urllib.request import urlopen
main_page=urlopen('http://www.walmart.com/ip/Sceptre-32-Class-HD-720P-LED-TV-X322BV-SR/55427159')
soup = BeautifulSoup(main_page,"lxml")
title=soup.select_one("h1.prod-ProductTitle.no-margin.heading-a").get_text()
price=soup.select_one("span.Price-group").get_text()
highLights=soup.select_one("div.ProductPage-short-description-body").get_text()
description=soup.select_one("div.about-desc").get_text()
print(title,"\n",highLights,"\n",description,"\n",price)在上面的代码中,我提取了产品名称,价格,亮点和描述,但我不能提取描述(关于这个项目)。我得到的不是描述,而是其他东西。
请帮我解决这个问题。
发布于 2017-08-30 18:46:44
因为有两个带有“class=-desc”的div,因为您使用的是select_one,所以只返回第一个div,但需要返回第二个div。下面是调整:
description=soup.select("div.about-desc")[1].get_text()更新:站点实际上阻止了urllib的默认用户代理,所以你应该屏蔽它。
from bs4 import BeautifulSoup
from urllib.request
user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'}
req = urllib.request.Request(url="http://www.walmart.com/ip/Sceptre-32-Class-HD-720P-LED-TV-X322BV-SR/55427159", headers=user_agent)
main_page = urllib.request.urlopen(req)
soup = BeautifulSoup(main_page,"lxml")
title=soup.select_one("h1.prod-ProductTitle.no-margin.heading-a").get_text()
price=soup.select_one("span.Price-group").get_text()
highLights=soup.select_one("div.ProductPage-short-description-body").get_text()
description=soup.select("div.about-desc")[1].get_text()
print(title,"\n",highLights,"\n",description,"\n",price)发布于 2021-06-05 17:55:30
有两种选择:
使用+ beautifulsoup
JSON或requests requests requests-html如果你在Chrome控制台中运行它,你会得到以下响应:
test = JSON.parse(document.querySelector("#item").textContent).item.product.buyBox.products[0]
console.log(test)

import json, requests
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.walmart.com/ip/Wilson-The-Duke-Official-NFL-Game-Football/5192758', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# https://stackoverflow.com/a/63151716/15164646
fus = soup.select_one('#item').string
ro = json.loads(fus)
dah = json.dumps(fus, indent=2, ensure_ascii=False)
print(dah)输出的一部分:
{
"item": {
"ads": {
"config": {
"lazy-homepage-expose1": "800",
"lazy-search-expose1": "800",
"lazy-browse-expose1": "800",
"lazy-category-expose1": "200",
"no-category-marquee2": true,
"no-deals-skyline1": true,
"no-homepage-twocolumnhp": true,
"lazy-item-expose1": "800",
"lazy-item-marquee2": "1200",
"lazy-item-rightrail2": "1200",
"adblockImgSource": "//i5.walmartimages.com/dfw/63fd9f59-8bc2/8fe200ec-4c4d-4ab0-89e5-0662af6f506d/v1/ads.png",
"displayAdsS2sScript": "//i5.wal.co/dfw/63fd9f59-a579/be6f8cae-248d-40e2-8cad-32d04468ea59/v29/usgm-s2s-midas.js",
"displayAdsS2sScriptWithPoly": "//i5.wal.co/dfw/63fd9f59-5870/c8ceb4ee-1e68-40ec-a38e-ca0623f075a0/v29/usgm-s2s-midas-poly.js",
"safeframeUrl": "https://i5.wal.co/dfw/63fd9f59-d6ba/07b8ea82-184c-4ea3-8ac0-5dc1981e40c8/v50/safeframe.html",
"displayAds": true,
"exts2s": true,
"isTwoDayDeliveryTextEnabled": true,
"ads2s": true,
"bypassproxy": false,
"adblockDetectionEnabled": false,
"marqueeSafeframe": true,
"exposeSafeframe": true,
"skylineSafeframe": true,
"leftrailSafeframe": true,
"rightrailSafeframe": true,
"cloud": "scus-prod-a29"
}
}
# much more down below...下面的代码使用了requests-html。获取“关于此项目”描述的一种方法是使用XPath。
代码(在多个清单上测试):
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://www.walmart.com/ip/Sceptre-32-Class-720P-HD-LED-TV-X322BV-SR/55427159')
# first=True means that it will grab the first occurrence and skip everything else
title = response.html.find('.prod-productTitle-buyBox', first=True).text
price = response.html.find('.prod-PriceHero', first=True).text.split('$')[1]
description = response.html.xpath('//*[@id="about-product-section"]/div/div[1]/div[1]/div[3]/p[1]/text()', first=True)
key_features = response.html.xpath('//*[@id="about-product-section"]/div/div[1]/div[1]/div[3]/ul[1]', first=True).text
print(title)
print(price)
print(description)
print(key_features)输出:
Sceptre 32" Class 720P HD LED TV X322BV-SR
129.00
Escape into a world of splendid color and clarity with the X322BV-SR.
Clear QAM tuner is included to make cable connection as easy as possible, without an antenna.
HDMI input delivers the unbeatable combination of high-definition video and clear audio.
A USB port comes in handy when you want to flip through all of your stored pictures and tune into your stored music.
More possibilities: with HDMI, VGA, Component and Composite inputs, we offer a convenient balance between the old and new to suit your diverse preferences.
With the ability to connect your computer, laptop, monitor, or TV to all your favorite variety of input options, VGA inputs deliver superb analog video.
Screen Size (Diag.) 31.5"
Backlight Type LED
Resolution 720p
Effective Refresh Rate 60Hz
Smart Functionality no
Aspect Ratio 16 9
Dynamic Contrast Ratio 5,000 1
Viewable Angle (H/V) 178 degrees/178 degrees
Number of Colors 16.7 M
OSD Language English, Spanish, French
Speakers/Power Output 10W x 2
Surround Sound Mode或者,您也可以使用SerpApi中的第三方Walmart Product API。这是一个付费的API,免费试用了5000次搜索。一个完全免费的试用版目前正在开发中。
要集成的代码:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "walmart_product",
"product_id": "55427159"
}
search = GoogleSearch(params)
results = search.get_dict()
title = results['product_result']['title']
price = results['product_result']['price_map']['price']
key_features = results['product_result']['detailed_description_html']
print(title)
print(price)
print(key_features)输出:
Sceptre 32" Class 720P HD LED TV X322BV-SR
129
<b>Sceptre 32" Class 720p HD LED TV X322BV-SR</b><br /><b>Key Features </b></p><ul><li>Screen Size (Diag.) 31.5"</li><li>Backlight Type LED</li><li>Resolution 720p</li><li>Effective Refresh Rate 60Hz</li><li>Smart Functionality no</li><li>Aspect Ratio 16 9</li><li>Dynamic Contrast Ratio 5,000 1</li><li>Viewable Angle (H/V) 178 degrees/178 degrees</li><li>Number of Colors 16.7 M</li><li>OSD Language English, Spanish, French</li><li>Speakers/Power Output 10W x 2</li><li>Surround Sound Mode</li></ul><b>Connectivity </b><ul><li>Component/Composite Video 1</li><li>HDMI 2</li><li>Headphone 1</li><li>Optical Digital Audio 1</li><li>RCA Audio L+R 1</li><li>RF (Coaxial) 1</li><li>USB 2.0 1</li><li>Assembled Product Dims 28.78 x 18.39 x 7.95 Inches<br /></li></ul><b>What's In The Box </b><ul><li>Remote Control</li></ul><b>Wall-mountable </b><ul><li>Mount Pattern 100mm x 100mm</li><li>Screw Size M4</li><li>Screw Length 6mm</li></ul><b>Support and Warranty </b><ul><li>1-year limited labor and parts</li></ul><br /><br />Flat Screen TV stand sold separately. See all <b> TV stands.</b><br /><br />Flat Screen TV mount sold separately. See all <b> TV mounts. </b><br /><br />TV audio equipment sold separately. See all <b> Home Theater Systems. </b><br /><br />HDMI cables sold separately. See all <b> HDMI Cables.</b><br /><br />Accessories sold separately. See all <b> Accessories.<br /></b><br /><br /><b>ENERGY STAR<sup></sup></b><br />Products that are ENERGY STAR-qualified prevent greenhouse gas emissions by meeting strict energy efficiency guidelines set by the U.S. Environmental Protection Agency and the U.S. Department of Energy. The ENERGY STAR name and marks are registered marks owned by the U.S. government, as part of their energy efficiency and environmental activities.免责声明,我为SerpApi工作。
https://stackoverflow.com/questions/45957083
复制相似问题