pandas入门系列本期就完结了,该系列一共三期,学习后可以初步掌握经典库pandas
使用方法,前文回顾
10分钟入门Pandas-系列(1)
10分钟入门Pandas-系列(2)
pandas可以在DataFrame
中包含分类
In []: import pandas as pd
...: import numpy as np
...:
...: df = pd.DataFrame({"id": [, , , , , ],
...: "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
转换原始类别为分类数据类型
In []: df["grade"] = df["raw_grade"].astype("category")
...: df["grade"]
Out[]:
a
b
b
a
a
e
Name: grade, dtype: category
Categories (, object): [a, b, e]
重命名分类为更有意义的名称 (分配到Series.cat.categories对应位置!)
df["grade"].cat.categories = ["very good", "good", "very bad"]
重排顺分类,同时添加缺少的分类( Series.cat
方法下返回新默认序列)
In []: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium",
...: "good", "very good"])
In []: df
Out[]:
id raw_grade grade
a very good
b good
b good
a very good
a very good
e very bad
In []: df["grade"]
Out[]:
very good
good
good
very good
very good
very bad
Name: grade, dtype: category
Categories (, object): [very bad, bad, medium, good, very good]
按照分类排序,而不是按照词汇的字母顺序排序
In []: df.sort_values(by="grade")
Out[]:
id raw_grade grade
e very bad
b good
b good
a very good
a very good
a very good
按照类别列分组,也显示空类别.
In []: df.groupby("grade").size()
Out[]:
grade
very bad
bad
medium
good
very good
dtype: int64
绘图文档链接
https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.close('all')
ts = pd.Series(np.random.randn(),
index=pd.date_range('1/1/2000', periods=))
ts = ts.cumsum()
ts.plot()
如下图
df = pd.DataFrame(np.random.randn(, ), index=ts.index,
columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')
如下图
写入csv文件
df.to_csv('foo.csv')
读取csv文件
pd.read_csv('foo.csv')
写入HDF5存储
df.to_hdf('foo.h5', 'df')
读取HDF5存储
pd.read_hdf('foo.h5', 'df')
EXCEL
写入excel文件
df.head().to_excel('foo.xlsx', sheet_name='Sheet1')
读取excel文件
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
如果尝试这样操作可能会看到像这样的异常:
if pd.Series([False, True, False]):
print("I was true")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-5c782b38cd2f> in <module>
----> 1 if pd.Series([False, True, False]):
print("I was true")
D:\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
"The truth value of a {0} is ambiguous. "
"Use a.empty, a.bool(), a.item(), a.any() or a.all().".format(
-> self.__class__.__name__
)
)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
报错原因是:一个数组的真值是模棱两可的(有真亦有假),此时需要使用a.empty, a.bool(), a.item(), a.any() or a.all()
的用法