利用python requests库获取网页信息;
利用bs库解析网页内容;
pymysql用于操作mysql数据库;
将获取的信息存储至mysql数据库。
效果如图:
1.导入需要的库,并建立数据库连接。需要先安装好mysql数据库在本机上。
importrequests
frombs4importBeautifulSoup
importre
importpymysql
connect=pymysql.connect(user='root',password='root',host='localhost',port=3306,db='python',charset='utf8')
conn=connect.cursor()
conn.execute("create database if not exists Ganjizufang character set utf8;")
conn.execute("use Ganjizufang;")
sql="""create table if not exists roominfo (id INT PRIMARY KEY AUTO_INCREMENT,title VARCHAR(200),price VARCHAR(200),room_type VARCHAR(200),room_size VARCHAR(200),room_direction VARCHAR(200),
room_total_floor VARCHAR(200),elecator VARCHAR(200),room_fixtures VARCHAR(200),viliage_name VARCHAR(200),raiway_distance VARCHAR(300),url VARCHAR(200))"""
conn.execute('drop table if exists roominfo;')
conn.execute(sql)
2.获取主网页信息。加上headers,利用requests.get()方法访问网页,用BeautifulSoup和lxml解析;
foriinrange(1,20):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
link='http://cd.ganji.com/fang1/ditie/m1o'+str(i)+'l21s2298/'
r = requests.get(link,headers=headers)
print'The page is:',i
soup = BeautifulSoup(r.text,'lxml')
house_list = soup.find_all('div',class_="f-list-item ershoufang-list")
3.获取子网页信息。主网页对room 信息只有很少的描述,需要获取子网页链接,访问子网页获取很多的room信息。获取href可获取子网页链接。
forhouseinhouse_list:
next_link=house.find('dd',class_="dd-item title").find('a',class_='js-title value title-font')
# find the next level link
link1='http://cd.ganji.com'+str(next_link['href'])
r1=requests.get(link1,headers=headers)
soup1=BeautifulSoup(r1.text,'lxml')
title=soup1.find('p',class_='card-title').text.strip()
price=soup1.find('ul',class_='card-pay f-clear').find('li',class_='price').find('span',class_='num').text.strip()
# how to handle the same tag
room_type=soup1.find('ul',class_='er-list f-clear').contents[1].text.strip()
room_size=soup1.find('ul',class_='er-list f-clear').contents[3].text.strip()
room_direction=soup1.find('ul',class_='er-list f-clear').contents[5].text.strip()
room_total_floor=soup1.find('ul',class_='er-list f-clear').contents[7].text.strip()
elecator=soup1.find('ul',class_='er-list f-clear').contents[9].text.strip()
try:
room_fixtures=soup1.find('ul',class_='er-list f-clear').contents[11].text.strip()
except:
room_fixtures='no info'
try:
viliage_name=soup1.find('ul',class_='er-list-two f-clear').find('a',class_='blue').text.strip()
except:
viliage_name='no info'
raiway_distance=soup1.find('div',class_='subway-wrap').find('span',class_='content').text.strip()
try:
address=soup1.find('ul',class_='er-list-two f-clear').contents[5].text.strip()
except:
#address='No info'
phone=soup1.find('div',class_='card-info f-fr').find('div',class_='c_phone f-clear')
# how match the phone bunber, or use the click solution.
phone_number=re.search(r'[0-9]+',phone) 上面try--except因为某些room的信息不全,无法获取对应的信息,程序会报错。将没有的信息给定为‘no info’. 4.每获取一条信息,将该信息存储至mysql数据库;用insert into将信息插入到数据库;
conn.execute("insert into roominfo(title,price,room_type,room_size,room_direction,room_total_floor,elecator,room_fixtures,viliage_name,raiway_distance,url) VALUES ('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')"%(title,price,room_type,room_size,room_direction,room_total_floor,elecator,room_fixtures,viliage_name,raiway_distance,link1))
conn.commit() 5.也可以将数据存储为text或者csv文件在本地。
withopen('Ganjizufang.csv','a+')asf:
f.write(title.encode('utf-8')+',')
f.write(price.encode('utf-8') +',')
f.write(room_type.encode('utf-8') +',')
f.write(room_size.encode('utf-8') +',')
f.write(room_direction.encode('utf-8') +',')
f.write(room_total_floor.encode('utf-8') +',')
f.write(elecator.encode('utf-8') +',')
f.write(room_fixtures.encode('utf-8') +',')
f.write(viliage_name.encode('utf-8') +',')
f.write(raiway_distance.encode('utf-8') +',')
#f.write(address.encode('utf-8') + ',')
f.write(link1.encode('utf-8') +',')
#f.write(price.encode('utf-8'))
f.write('\n')
f.close()
6.每获取一页内容,暂停两秒。最后关闭数据库。
time.sleep(2)
conn.close()
connect.close() 保存csv效果如图:
7.结语:访问量过多会被网站禁用ip,要求输入验证码后才能继续访问。可以建立代理池或者用代理服务器的方法伪装ip进行访问。 ------------------------------------------------ 8.附加:本地mysql数据库操作,以上面写入的数据数据为例。 1.将安装好的mysql添加至path环境变量里; 2.windows+R输入cmd进入命令行,输入mysql;输入密码; 3.connect Ganjizufang; use Ganjizufang; select * from roominfo;获取roominfo所有的信息; select * from roominfo where price
领取专属 10元无门槛券
私享最新 技术干货