使用平行的列表创建DataFrame
>>> import pandas as pd
>>> import numpy as np
>>> fname = ["Paul", "John", "Richard", "George"]
>>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
>>> birth = [1942, 1940, 1940, 1943]
创建字典:
>>> people = {"first": fname, "last": lname, "birth": birth}
用该字典创建DataFrame:
>>> beatles = pd.DataFrame(people)
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
当调用DataFrame构造器时,Pandas会创建一个RangeIndex
对象:
>>> beatles.index
RangeIndex(start=0, stop=4, step=1)
重新指定索引:
>>> pd.DataFrame(people, index=["a", "b", "c", "d"])
first last birth
a Paul McCartney 1942
b John Lennon 1940
c Richard Starkey 1940
d George Harrison 1943
还可以用字典构成的列表构建DataFrame:
>>> pd.DataFrame(
... [
... {
... "first": "Paul",
... "last": "McCartney",
... "birth": 1942,
... },
... {
... "first": "John",
... "last": "Lennon",
... "birth": 1940,
... },
... {
... "first": "Richard",
... "last": "Starkey",
... "birth": 1940,
... },
... {
... "first": "George",
... "last": "Harrison",
... "birth": 1943,
... },
... ]
... )
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
使用columns
参数指定列的顺序:
>>> pd.DataFrame(
... [
... {
... "first": "Paul",
... "last": "McCartney",
... "birth": 1942,
... },
... {
... "first": "John",
... "last": "Lennon",
... "birth": 1940,
... },
... {
... "first": "Richard",
... "last": "Starkey",
... "birth": 1940,
... },
... {
... "first": "George",
... "last": "Harrison",
... "birth": 1943,
... },
... ],
... columns=["last", "first", "birth"],
... )
last first birth
0 McCartney Paul 1942
1 Lennon John 1940
2 Starkey Richard 1940
3 Harrison George 1943
将DataFrame写入CSV文件:
使用.to_csv
方法将DataFrame写入CSV文件:
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> from io import StringIO
>>> fout = StringIO()
>>> beatles.to_csv(fout) # 使用文件名
查看文件内容:
>>> print(fout.getvalue())
,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943
如果读取刚刚保存的CSV,会读入列名为Unnamed: 0
的冗余列:
>>> _ = fout.seek(0)
>>> pd.read_csv(fout)
Unnamed: 0 first last birth
0 0 Paul McCartney 1942
1 1 John Lennon 1940
2 2 Richard Starkey 1940
3 3 George Harrison 1943
使用index_col
参数,可以指定列:
>>> _ = fout.seek(0)
>>> pd.read_csv(fout, index_col=0)
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
如果存CSV文件时,不想保存行索引,可以将index
参数设为False
:
>>> fout = StringIO()
>>> beatles.to_csv(fout, index=False)
>>> print(fout.getvalue())
first,last,birth
Paul,McCartney,1942
John,Lennon,1940
Richard,Starkey,1940
George,Harrison,1943
Pandas是在内存中处理文件的,通常来讲,内存的大小需要是文件大小的3至10倍。
这里使用的是diamonds
数据集。使用nrows
参数读取1000行数据。
>>> diamonds = pd.read_csv("data/diamonds.csv", nrows=1000)
>>> diamonds
carat cut color clarity ... price x y z
0 0.23 Ideal E SI2 ... 326 3.95 3.98 2.43
1 0.21 Premium E SI1 ... 326 3.89 3.84 2.31
2 0.23 Good E VS1 ... 327 4.05 4.07 2.31
3 0.29 Premium I VS2 ... 334 4.20 4.23 2.63
4 0.31 Good J SI2 ... 335 4.34 4.35 2.75
.. ... ... ... ... ... ... ... ... ...
995 0.54 Ideal D VVS2 ... 2897 5.30 5.34 3.26
996 0.72 Ideal E SI1 ... 2897 5.69 5.74 3.57
997 0.72 Good F VS1 ... 2897 5.82 5.89 3.48
998 0.74 Premium D VS2 ... 2897 5.81 5.77 3.58
999 1.12 Premium J SI2 ... 2898 6.68 6.61 4.03
使用.info
方法查看消耗的内存量:
>>> diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float64
cut 1000 non-null object
color 1000 non-null object
clarity 1000 non-null object
depth 1000 non-null float64
table 1000 non-null float64
price 1000 non-null int64
x 1000 non-null float64
y 1000 non-null float64
z 1000 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 78.2+ KB
可以看到1000行数据使用了78.2KB内存。如果有10亿行数据,则要占用78GB的内存。
使用dtype
参数,设置读取的数值类型:
>>> diamonds2 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "x": np.float32,
... "y": np.float32,
... "z": np.float32,
... "price": np.int16,
... },
... )
>>> diamonds2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float32
cut 1000 non-null object
color 1000 non-null object
clarity 1000 non-null object
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
x 1000 non-null float32
y 1000 non-null float32
z 1000 non-null float32
dtypes: float32(6), int16(1), object(3)
memory usage: 49.0+ KB
改变了数值类型,对比下新的DataFrame和原先的DataFrame:
>>> diamonds.describe()
carat depth ... y z
count 1000.000000 1000.000000 ... 1000.000000 1000.000000
mean 0.689280 61.722800 ... 5.599180 3.457530
std 0.195291 1.758879 ... 0.611974 0.389819
min 0.200000 53.000000 ... 3.750000 2.270000
25% 0.700000 60.900000 ... 5.630000 3.450000
50% 0.710000 61.800000 ... 5.760000 3.550000
75% 0.790000 62.600000 ... 5.910000 3.640000
max 1.270000 69.500000 ... 7.050000 4.330000
>>> diamonds2.describe()
carat depth ... y z
count 1000.000000 1000.000000 ... 1000.000000 1000.000000
mean 0.689453 61.718750 ... 5.601562 3.457031
std 0.195312 1.759766 ... 0.611816 0.389648
min 0.199951 53.000000 ... 3.750000 2.269531
25% 0.700195 60.906250 ... 5.628906 3.449219
50% 0.709961 61.812500 ... 5.761719 3.550781
75% 0.790039 62.593750 ... 5.910156 3.640625
max 1.269531 69.500000 ... 7.050781 4.328125
可以看到通过改变数据类型,节省了38%的内存。
使用dtype
参数,将数据类型改为category
。使用.value_counts
先统计数据个数:
>>> diamonds2.cut.value_counts(
Ideal 333
Premium 290
Very Good 226
Good 89
Fair 62
Name: cut, dtype: int64
>>> diamonds2.color.value_counts()
E 240
F 226
G 139
D 129
H 125
I 95
J 46
Name: color, dtype: int64
>>> diamonds2.clarity.value_counts()
SI1 306
VS2 218
VS1 159
SI2 154
VVS2 62
VVS1 58
I1 29
IF 14
Name: clarity, dtype: int64
因为是低基数,将其转换为category
,可以节省约37%的内存:
>>> diamonds3 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "x": np.float32,
... "y": np.float32,
... "z": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... )
>>> diamonds3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float32
cut 1000 non-null category
color 1000 non-null category
clarity 1000 non-null category
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
x 1000 non-null float32
y 1000 non-null float32
z 1000 non-null float32
dtypes: category(3), float32(6), int16(1)
memory usage: 29.4 KB
使用参数usecols
,可以指定加载哪些列。这里忽略了x
、y
、z
三列:
>>> cols = [
... "carat",
... "cut",
... "color",
... "clarity",
... "depth",
... "table",
... "price",
... ]
>>> diamonds4 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... usecols=cols,
... )
>>> diamonds4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
carat 1000 non-null float32
cut 1000 non-null category
color 1000 non-null category
clarity 1000 non-null category
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB
现在使用的内存只有原始的21%。
如果每次只处理数据的一部分,可以使用chunksize
参数:
>>> cols = [
... "carat",
... "cut",
... "color",
... "clarity",
... "depth",
... "table",
... "price",
... ]
>>> diamonds_iter = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... usecols=cols,
... chunksize=200,
... )
>>> def process(df):
... return f"processed {df.size} items"
>>> for chunk in diamonds_iter:
... process(chunk)
因为CSV文件不保存数据类型,Pandas需要推断每列的数据类型是什么。如果一列的值都是整数,并且没有缺失值,则Pandas将其认定为int64
。如果一列是数值类型,但不是整数,或存在缺失值,Pandas使用的是float64
。这两种数据类型占用的内存比较大。例如,如果所有数都在200以下,你可以使用一个小的数据类型,比如np.int16
(或np.int8
,如果都是正数)。
如果某列都是非数值类型,Pandas会将其转换为object
类型。object
类型占用内存很多,因为它是将数据以Python字符串存储的,将类型改为category
,可以大大节省空间,因为它对每个字符串只存储一次。
如果价格使用int8
,会导致丢失信息。你可以使用NumPy的iinfo
函数列出NumPy整数类型的范围:
>>> np.iinfo(np.int8)
iinfo(min=-128, max=127, dtype=int8)
使用.finfo
可以查看浮点数类型的范围:
>>> np.finfo(np.float16)
finfo(resolution=0.001, min=-6.55040e+04,
max=6.55040e+04, dtype=float16)
还可以用.memory_usage
方法查询DataFrame或Series使用了多少字节。注意,其中是包含行索引的。另外,传入deep=True
,可以查询带有对象类型的Series的内存用量:
>>> diamonds.price.memory_usage()
8080
>>> diamonds.price.memory_usage(index=False)
8000
>>> diamonds.cut.memory_usage()
8080
>>> diamonds.cut.memory_usage(deep=True)
63413
一旦确定了数据类型,可以将其以二进制并带有数据类型的形式保存下来,比如Feather格式,Pandas使用的是pyarrow
库。
>>> diamonds4.to_feather("d.arr")
>>> diamonds5 = pd.read_feather("d.arr")
另一种方法是使用Parquet格式。
>>> diamonds4.to_parquet("/tmp/d.pqt")
需要安装xlwt
或openpyxl
来写入XLSX文件。
使用.to_excel
方法,进行存储:
>>> beatles.to_excel("beat.xls")
>>> beatles.to_excel("beat.xlsx")
使用read_excel
读取Excel文件:
>>> beat2 = pd.read_excel("/tmp/beat.xls")
>>> beat2
Unnamed: 0 first last birth
0 0 Paul McCartney 1942
1 1 John Lennon 1940
2 2 Richard Starkey 1940
3 3 George Harrison 1943
用参数index_col
,指定行索引:
>>> beat2 = pd.read_excel("/tmp/beat.xls", index_col=0)
>>> beat2
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
Excel保存了数据类型:
>>> beat2.dtypes
first object
last object
birth int64
dtype: object
使用sheet_name
参数命名表单:
>>> xl_writer = pd.ExcelWriter("beat2.xlsx")
>>> beatles.to_excel(xl_writer, sheet_name="All")
>>> beatles[beatles.birth < 1941].to_excel(
... xl_writer, sheet_name="1940"
... )
>>> xl_writer.save()
这个Excel文件包含两个表单,一个名字是All,一个名字是1940。
如果CSV文件是ZIP文件中的唯一文件,可以直接使用read_csv
函数:
>>> autos = pd.read_csv("data/vehicles.csv.zip")
>>> autos
barrels08 barrelsA08 ... phevHwy phevComb
0 15.695714 0.0 ... 0 0
1 29.964545 0.0 ... 0 0
2 12.207778 0.0 ... 0 0
3 29.964545 0.0 ... 0 0
4 17.347895 0.0 ... 0 0
... ... ... ... ... ...
41139 14.982273 0.0 ... 0 0
41140 14.330870 0.0 ... 0 0
41141 15.695714 0.0 ... 0 0
41142 15.695714 0.0 ... 0 0
41143 18.311667 0.0 ... 0 0
>>> autos.modifiedOn.dtype
dtype('O')
因为CSV文件中包含日期的列,它是字符串。可以在使用read_csv
时使用parse_dates
加载文件,另一种方法是加载文件后用to_datetime
方法解析:
>>> autos= pd.read_csv(
... "data/vehicles.csv.zip", parse_dates=["modifiedOn"]
... )
>>> autos.modifiedOn
0 2013-01-0...
1 2013-01-0...
2 2013-01-0...
3 2013-01-0...
4 2013-01-0...
...
41139 2013-01-0...
41140 2013-01-0...
41141 2013-01-0...
41142 2013-01-0...
41143 2013-01-0...
Name: modifiedOn, Length: 41144, dtype: datetime64[ns, tzlocal()]
>>> autos.modifiedOn
0 Tue Jan 01 00:00:00 EST 2013
1 Tue Jan 01 00:00:00 EST 2013
2 Tue Jan 01 00:00:00 EST 2013
3 Tue Jan 01 00:00:00 EST 2013
4 Tue Jan 01 00:00:00 EST 2013
...
39096 Tue Jan 01 00:00:00 EST 2013
39097 Tue Jan 01 00:00:00 EST 2013
39098 Tue Jan 01 00:00:00 EST 2013
39099 Tue Jan 01 00:00:00 EST 2013
39100 Tue Jan 01 00:00:00 EST 2013
Name: modifiedOn, Length: 39101, dtype: object
>>> pd.to_datetime(autos.modifiedOn)
0 2013-01-01
1 2013-01-01
2 2013-01-01
3 2013-01-01
4 2013-01-01
...
39096 2013-01-01
39097 2013-01-01
39098 2013-01-01
39099 2013-01-01
39100 2013-01-01
Name: modifiedOn, Length: 39101, dtype: datetime64[ns]
如果ZIP文件中有多个文件,可以使用zipfile
模块。因为数据集第二行中包含问题,将其存入kag_questions
。
>>> import zipfile
>>> with zipfile.ZipFile(
... "data/kaggle-survey-2018.zip"
... ) as z:
... print("\n".join(z.namelist()))
... kag = pd.read_csv(
... z.open("multipleChoiceResponses.csv")
... )
... kag_questions = kag.iloc[0]
... survey = kag.iloc[1:]
multipleChoiceResponses.csv
freeFormResponses.csv
SurveySchema.csv
>>> survey.head(2).T
1 2
Time from... 710 434
Q1 Female Male
Q1_OTHER_... -1 -1
Q2 45-49 30-34
Q3 United S... Indonesia
... ... ...
Q50_Part_5 NaN NaN
Q50_Part_6 NaN NaN
Q50_Part_7 NaN NaN
Q50_Part_8 NaN NaN
Q50_OTHER... -1 -1
如果压缩文件中只有一个文件,则read_csv
方法还可以读取GZIP
、BZ2
和XZ
文件。
创建SQLite数据库,存储Beatles信息:
>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
... cur = con.cursor()
... cur.execute("""DROP TABLE Band""")
... cur.execute(
... """CREATE TABLE Band(id INTEGER PRIMARY KEY,
... fname TEXT, lname TEXT, birthyear INT)"""
... )
... cur.execute(
... """INSERT INTO Band VALUES(
... 0, 'Paul', 'McCartney', 1942)"""
... )
... cur.execute(
... """INSERT INTO Band VALUES(
... 1, 'John', 'Lennon', 1940)"""
... )
... _ = con.commit()
从DataFrame读取数据库,这里使用的是SQLAlchemy:
>>> import sqlalchemy as sa
>>> engine = sa.create_engine(
... "sqlite:///data/beat.db", echo=True
... )
>>> sa_connection = engine.connect()
>>> beat = pd.read_sql(
... "Band", sa_connection, index_col="id"
... )
>>> beat
fname lname birthyear
id
0 Paul McCartney 1942
1 John Lennon 1940
使用SQL语句读取数据。可以使用SQLite或SQLAlchemy连接:
>>> sql = """SELECT fname, birthyear from Band"""
>>> fnames = pd.read_sql(sql, con)
>>> fnames
fname birthyear
0 Paul 1942
1 John 1940
JSON数据的编码和加载:
>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}
使用.read_json
读取json数据,使用字典中的键名作为列名。
>>> beatles = pd.read_json(encoded)
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
读取JSON时,Pandas支持一些特定的方式:
columns
—— (默认)将列名映射为列中的值的列表;records
—— 行的列表。每行是一个字典,一行映射到一个值;split
—— columns
映射到列名,index
映射到行索引值,data
映射到每行数据组成的列表;index
—— 将索引映射到行,每行是一个列映射到值的字典;values
—— 数据行构成的列表(每行也是列表)。不包含列和行索引的值;table
—— 将schema
映射到DataFrame的纲要,data
映射为字典的列表。参考下面的代码:
>>> records = beatles.to_json(orient="records")
>>> records
'[{"first":"Paul","last":"McCartney","birth":1942},{"first":"John","last":"Lennon","birth":1940},{"first":"Richard","last":"Starkey","birth":1940},{"first":"George","last":"Harrison","birth":1943}]'
>>> pd.read_json(records, orient="records")
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
>>> split = beatles.to_json(orient="split")
>>> split
'{"columns":["first","last","birth"],"index":[0,1,2,3],"data":[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]}'
>>> pd.read_json(split, orient="split")
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> index = beatles.to_json(orient="index")
>>> index
'{"0":{"first":"Paul","last":"McCartney","birth":1942},"1":{"first":"John","last":"Lennon","birth":1940},"2":{"first":"Richard","last":"Starkey","birth":1940},"3":{"first":"George","last":"Harrison","birth":1943}}'
>>> pd.read_json(index, orient="index")
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
>>> values = beatles.to_json(orient="values")
>>> values
'[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]'
>>> pd.read_json(values, orient="values")
0 1 2
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> (
... pd.read_json(values, orient="values").rename(
... columns=dict(
... enumerate(["first", "last", "birth"])
... )
... )
... )
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> table = beatles.to_json(orient="table")
>>> table
'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"first","type":"string"},{"name":"last","type":"string"},{"name":"birth","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"first":"Paul","last":"McCartney","birth":1942},{"index":1,"first":"John","last":"Lennon","birth":1940},{"index":2,"first":"Richard","last":"Starkey","birth":1940},{"index":3,"first":"George","last":"Harrison","birth":1943}]}'
>>> pd.read_json(table, orient="table")
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
如果要将数据转换为字典,可以使用.to_dict
方法,方便添加数据:
>>> output = beat.to_dict()
>>> output
{'fname': {0: 'Paul', 1: 'John'}, 'lname': {0: 'McCartney', 1: 'Lennon'}, 'birthyear': {0: 1942, 1: 1940}}
>>> output["version"] = "0.4.1"
>>> json.dumps(output)
'{"fname": {"0": "Paul", "1": "John"}, "lname": {"0": "McCartney", "1": "Lennon"}, "birthyear": {"0": 1942, "1": 1940}, "version": "0.4.1"}'
可以使用Pandas读取HTML中的表格:
Wikipedia中的表格
# 译者注:en.wikipedia.org 需要访问外国网站
>>> url = 'https://en.wikipedia.org/wiki/The_Beatles_discography'
>>> dfs = pd.read_html(url)
>>> len(dfs)
51
一共读取了51个df,检查一下第一个:
>>> dfs[0]
The Beatles discography The Beatles discography.1
0 The Beat... The Beat...
1 Studio a... 23
2 Live albums 5
3 Compilat... 53
4 Video al... 15
5 Music vi... 64
6 EPs 21
7 Singles 63
8 Mash-ups 2
9 Box sets 15
.read_html
有一个match
参数,可以是字符串或正则表达式。还有一个attrs
参数,用于定位HTML标签。
检查下HTML的table
元素:
<table class="wikitable plainrowheaders" style="text-align:center;">
<caption>List of studio albums,<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[A]</a></sup> with selected chart positions and certifications
</caption>
<tbody>
<tr>
<th scope="col" rowspan="2" style="width:20em;">Title
</th>
<th scope="col" rowspan="2" style="width:20em;">Release
...
虽然没有属性,但可以使用字符串List of studio albums
来匹配,缺失值na_values
用"—"表示:
>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
... url, match="List of studio albums", na_values="—"
... )
>>> len(dfs)
1
>>> dfs[0].columns
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
使用前两行作为列名,但还是很乱:
>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
... url,
... match="List of studio albums",
... na_values="—",
... header=[0, 1],
... )
>>> len(dfs)
1
>>> dfs[0]
Title Release ... Peak chart positions Certifications
Title Release ... US[8][9] Certifications
0 Please P... Released... ... NaN BPI: Gol...
1 With the... Released... ... NaN BPI: Gol...
2 Introduc... Released... ... 2 RIAA: Pl...
3 Meet the... Released... ... 1 MC: Plat...
4 Twist an... Released... ... NaN MC: 3× P...
.. ... ... ... ... ...
22 The Beat... Released... ... 1 BPI: 2× ...
23 Yellow S... Released... ... 2 BPI: Gol...
24 Abbey Road Released... ... 1 BPI: 2× ...
25 Let It Be Released... ... 1 BPI: Gol...
26 "—" deno... "—" deno... ... "—" deno... "—" deno...
>>> dfs[0].columns
MultiIndex(levels=[['Certifications', 'Peak chart positions', 'Release', 'Title'], ['AUS[3]', 'CAN[4]', 'Certifications', 'FRA[5]', 'GER[6]', 'NOR[7]', 'Release', 'Title', 'UK[1][2]', 'US[8][9]']],
codes=[[3, 2, 1, 1, 1, 1, 1, 1, 1, 0], [7, 6, 8, 0, 1, 3, 4, 5, 9, 2]])
对于这种情况,最容易的方法是更新列索引:
>>> df = dfs[0]
>>> df.columns = [
... "Title",
... "Release",
... "UK",
... "AUS",
... "CAN",
... "FRA",
... "GER",
... "NOR",
... "US",
... "Certifications",
... ]
>>> df
Title Release ... US Certifications
0 Please P... Released... ... NaN BPI: Gol...
1 With the... Released... ... NaN BPI: Gol...
2 Introduc... Released... ... 2 RIAA: Pl...
3 Meet the... Released... ... 1 MC: Plat...
4 Twist an... Released... ... NaN MC: 3× P...
.. ... ... ... ... ...
22 The Beat... Released... ... 1 BPI: 2× ...
23 Yellow S... Released... ... 2 BPI: Gol...
24 Abbey Road Released... ... 1 BPI: 2× ...
25 Let It Be Released... ... 1 BPI: Gol...
26 "—" deno... "—" deno... ... "—" deno... "—" deno...
继续清理Release这列,在HTML中,代码是这样的:
<th scope="row" rowspan="2">
<i><a href="/wiki/A_Hard_Day%27s_Night_(album)" title="A Hard Day's Night (album)">A Hard Day's Night</a></i>
<img alt="double-dagger" src="//upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" decoding="async" width="9" height="14" data-file-width="9" data-file-height="14">
</th>
将release列分别存入release_date
和label
两列:
>>> res = (
... df.pipe(
... lambda df_: df_[
... ~df_.Title.str.startswith("Released")
... ]
... )
... .assign(
... release_date=lambda df_: pd.to_datetime(
... df_.Release.str.extract(
... r"Released: (.*) Label"
... )[0].str.replace(r"\[E\]", "")
... ),
... label=lambda df_: df_.Release.str.extract(
... r"Label: (.*)"
... ),
... )
... .loc[
... :,
... [
... "Title",
... "UK",
... "AUS",
... "CAN",
... "FRA",
... "GER",
... "NOR",
... "US",
... "release_date",
... "label",
... ],
... ]
... )
>>> res
Title UK ... release_date label
0 Please P... 1 ... 1963-03-22 Parlopho...
1 With the... 1 ... 1963-11-22 Parlopho...
2 Introduc... NaN ... 1964-01-10 Vee-Jay ...
3 Meet the... NaN ... 1964-01-20 Capitol ...
4 Twist an... NaN ... 1964-02-03 Capitol ...
.. ... ... ... ... ...
21 Magical ... 31 ... 1967-11-27 Parlopho...
22 The Beat... 1 ... 1968-11-22 Apple
23 Yellow S... 3 ... 1969-01-13 Apple (U...
24 Abbey Road 1 ... 1969-09-26 Apple
25 Let It Be 1 ... 1970-05-08 Apple
直接读取线上的csv文件:
>>> url = https://github.com/mattharrison/datasets/blob/master/data/anscombes.csv
>>> dfs = pd.read_html(url, attrs={"class": "csv-data"})
>>> len(dfs)
1
>>> dfs[0]
Unnamed: 0 quadrant x y
0 NaN I 10.0 8.04
1 NaN I 14.0 9.96
2 NaN I 6.0 7.24
3 NaN I 9.0 8.81
4 NaN I 4.0 4.26
.. ... ... ... ...
39 NaN IV 8.0 6.58
40 NaN IV 8.0 7.91
41 NaN IV 8.0 8.47
42 NaN IV 8.0 5.25
43 NaN IV 8.0 6.89