开公众号啦,分享读书心得,欢迎一起交流成长。
利用Pandas模块直接获取雅虎财经数据,方便之极。注意把官方提示把from pandas.io import data, wb
替换为from pandas_datareader import data, wb
。
Pandas for finance 文档。
上证指数000001.SS.
import pandas as pd
import numpy as np
from pandas_datareader import data, wb # 需要安装 pip install pandas_datareader
import datetime
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline
网站提供了csv格式数据下载服务。下载,然后读取sh_table = pd.read_csv('sh_table.csv')
。
# 定义获取数据的时间段
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2016,5,20)
sh = data.DataReader("000001.SS", 'yahoo', start, end)
sh.head(3) # 数据获取成功
Date | Open | High | Low | Close | Volume | Adj Close |
---|---|---|---|---|---|---|
2010-01-04 | 3243.76 | 3243.76 | 3243.76 | 3243.76 | 0 | 3243.76 |
2010-01-05 | 3282.18 | 3282.18 | 3282.18 | 3282.18 | 0 | 3282.18 |
2010-01-06 | 3254.22 | 3254.22 | 3254.22 | 3254.22 | 0 | 3254.22 |
sh.describe() # 数据整体概览
Open | High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
count | 1558.000000 | 1558.000000 | 1558.000000 | 1558.000000 | 1558.0 | 1558.000000 |
mean | 2659.427452 | 2659.427452 | 2659.427452 | 2659.427452 | 0.0 | 2659.427452 |
std | 595.170673 | 595.170673 | 595.170673 | 595.170673 | 0.0 | 595.170673 |
min | 1950.010000 | 1950.010000 | 1950.010000 | 1950.010000 | 0.0 | 1950.010000 |
25% | 2199.832500 | 2199.832500 | 2199.832500 | 2199.832500 | 0.0 | 2199.832500 |
50% | 2492.600000 | 2492.600000 | 2492.600000 | 2492.600000 | 0.0 | 2492.600000 |
75% | 2968.977500 | 2968.977500 | 2968.977500 | 2968.977500 | 0.0 | 2968.977500 |
max | 5166.350000 | 5166.350000 | 5166.350000 | 5166.350000 | 0.0 | 5166.350000 |
sh['Close'].plot(); #看一下收盘趋势
pd.read_csv
and to_csv
从文件读取数据是非常常见的操作
sh.to_csv('sh.csv',header=None)
names = ['Date','Open','High','Low','Close','Volume','Adj Close']
sh1 = pd.read_csv('sh.csv',names=names,index_col='Date')
sh1.tail(2)
Date | Open | High | Low | Close | Volume | Adj Close |
---|---|---|---|---|---|---|
2016-05-19 | 2806.91 | 2806.91 | 2806.91 | 2806.91 | 0 | 2806.91 |
2016-05-20 | 2825.48 | 2825.48 | 2825.48 | 2825.48 | 0 | 2825.48 |
# drop Volume列,所以数据都为0
sh = sh.drop('Volume',axis=1)
sh.head(2) # Volume列消失了
Date | Open | High | Low | Close | Adj Close |
---|---|---|---|---|---|
2010-01-04 | 3243.76 | 3243.76 | 3243.76 | 3243.76 | 3243.76 |
2010-01-05 | 3282.18 | 3282.18 | 3282.18 | 3282.18 | 3282.18 |
有些时候会有数据缺失,不完整,需要查看一下。
# Detect missing values,用isnull函数
pd.isnull(sh).head() # or sh.isnull()
Date | Open | High | Low | Close | Adj Close |
---|---|---|---|---|---|
2010-01-04 | False | False | False | False | False |
2010-01-05 | False | False | False | False | False |
2010-01-06 | False | False | False | False | False |
2010-01-07 | False | False | False | False | False |
2010-01-08 | False | False | False | False | False |
但返回的是一个boolean型的DataFrame,我只想先看一下是否有缺失值,不用肉眼。 利用any()函数,如果有可以用sum()函数查看有多少。
#参考 http://stackoverflow.com/questions/29530232/python-pandas-check-if-any-value-is-nan-in-dataframe
sh.isnull().values.any() # 数据很完整
False
sh.isnull().values.sum() # 没有值
0
sh.head()
Date | Open | High | Low | Close | Adj Close |
---|---|---|---|---|---|
2010-01-04 | 3243.76 | 3243.76 | 3243.76 | 3243.76 | 3243.76 |
2010-01-05 | 3282.18 | 3282.18 | 3282.18 | 3282.18 | 3282.18 |
2010-01-06 | 3254.22 | 3254.22 | 3254.22 | 3254.22 | 3254.22 |
2010-01-07 | 3192.78 | 3192.78 | 3192.78 | 3192.78 | 3192.78 |
2010-01-08 | 3196.00 | 3196.00 | 3196.00 | 3196.00 | 3196.00 |
这里把一些位置设置为np.nan,然后填充,没有实际意义,这里只是拿来练手。
设置nan如下:
sh.iloc[0,:] = np.nan
sh.iloc[[1,3],1] = np.nan
sh.iloc[2,2] = np.nan
sh.iloc[3,3] = np.nan
sh.head(4)
Date | Open | High | Low | Close | Adj Close |
---|---|---|---|---|---|
2010-01-04 | NaN | NaN | NaN | NaN | NaN |
2010-01-05 | 3282.18 | NaN | 3282.18 | 3282.18 | 3282.18 |
2010-01-06 | 3254.22 | 3254.22 | NaN | 3254.22 | 3254.22 |
2010-01-07 | 3192.78 | NaN | 3192.78 | NaN | 3192.78 |
# check NaN number per column
def num_missing(x):
return x.isnull().sum()
print "Missing values count per column:"
print sh.apply(num_missing, axis=0)
Missing values count per column:
Open 1
High 3
Low 2
Close 2
Adj Close 1
dtype: int64
dropna 有几个参数
how='all'
只有全部为NaN的行才drop,若axis=1则对列;how='any'
默认,则drop所有含NaN的行或列;inplacce=True
则inplace操作,不返回;inplace=False
,返回一个drop后的,不改变原DataFramesh.dropna(how='all',inplace=True);
sh.head(3)
Date | Open | High | Low | Close | Adj Close |
---|---|---|---|---|---|
2010-01-05 | 3282.18 | NaN | 3282.18 | 3282.18 | 3282.18 |
2010-01-06 | 3254.22 | 3254.22 | NaN | 3254.22 | 3254.22 |
2010-01-07 | 3192.78 | NaN | 3192.78 | NaN | 3192.78 |
填充fillna
sh['High'].fillna(sh['High'].mean(),inplace=True) #fill in with mean
sh['Low'].fillna(method='ffill',inplace=True) #fill in with value from previous row
sh['Close'].fillna(method='bfill',inplace=True) # fill in with value from the row behind
sh.head()
Date | Open | High | Low | Close | Adj Close |
---|---|---|---|---|---|
2010-01-05 | 3282.18 | 2658.308199 | 3282.18 | 3282.18 | 3282.18 |
2010-01-06 | 3254.22 | 3254.220000 | 3282.18 | 3254.22 | 3254.22 |
2010-01-07 | 3192.78 | 2658.308199 | 3192.78 | 3196.00 | 3192.78 |
2010-01-08 | 3196.00 | 3196.000000 | 3196.00 | 3196.00 | 3196.00 |
2010-01-11 | 3212.75 | 3212.750000 | 3212.75 | 3212.75 | 3212.75 |
sh.isnull().values.sum()
0
涨跌额是指当日股票价格与前一日收盘价格相比的涨跌数值。添加一列change,其为当日close价格与之前一天的差值。当然注意这里数据有缺失,有的日期没有记录。
change = sh.Close.diff()
change.fillna(change.mean(),inplace=True)
sh['Change'] = change
sh.head(3)
Date | Open | High | Low | Close | Adj Close | Change |
---|---|---|---|---|---|---|
2010-01-05 | 3282.18 | 2658.308199 | 3282.18 | 3282.18 | 3282.18 | -0.293509 |
2010-01-06 | 3254.22 | 3254.220000 | 3282.18 | 3254.22 | 3254.22 | -27.960000 |
2010-01-07 | 3192.78 | 2658.308199 | 3192.78 | 3196.00 | 3192.78 | -58.220000 |
# 按涨跌额排序,未改变sh,若需要,可以`inplace=True`
sh.sort_values(by='Change',ascending=False).head(3)
Date | Open | High | Low | Close | Adj Close | Change |
---|---|---|---|---|---|---|
2015-10-12 | 3287.66 | 3287.66 | 3287.66 | 3287.66 | 3287.66 | 234.88 |
2015-06-30 | 4277.22 | 4277.22 | 4277.22 | 4277.22 | 4277.22 | 224.19 |
2015-06-01 | 4828.74 | 4828.74 | 4828.74 | 4828.74 | 4828.74 | 217.00 |
sh.head()
Date | Open | High | Low | Close | Adj Close | Change |
---|---|---|---|---|---|---|
2010-01-05 | 3282.18 | 2658.308199 | 3282.18 | 3282.18 | 3282.18 | -0.293509 |
2010-01-06 | 3254.22 | 3254.220000 | 3282.18 | 3254.22 | 3254.22 | -27.960000 |
2010-01-07 | 3192.78 | 2658.308199 | 3192.78 | 3196.00 | 3192.78 | -58.220000 |
2010-01-08 | 3196.00 | 3196.000000 | 3196.00 | 3196.00 | 3196.00 | 0.000000 |
2010-01-11 | 3212.75 | 3212.750000 | 3212.75 | 3212.75 | 3212.75 | 16.750000 |
# 用shift方法错位
# sh['pct_change'] = ((sh['Change'] - sh['Change'].shift(1)) / sh['Change'])
# 或用pct_Change函数
sh['pct_change'] = sh.Change.pct_change()
sh.iloc[5:9]
Date | Open | High | Low | Close | Adj Close | Change | pct_change |
---|---|---|---|---|---|---|---|
2010-01-12 | 3273.97 | 3273.97 | 3273.97 | 3273.97 | 3273.97 | 61.22 | 2.654925 |
2010-01-13 | 3172.66 | 3172.66 | 3172.66 | 3172.66 | 3172.66 | -101.31 | -2.654851 |
2010-01-14 | 3215.55 | 3215.55 | 3215.55 | 3215.55 | 3215.55 | 42.89 | -1.423354 |
2010-01-15 | 3224.15 | 3224.15 | 3224.15 | 3224.15 | 3224.15 | 8.60 | -0.799487 |
利用applymap格式化数据,均保留两位小数。这是对其进行element-wise操作
sh.head(2)
Date | Open | High | Low | Close | Adj Close | Change | pct_change |
---|---|---|---|---|---|---|---|
2010-01-05 | 3282.18 | 2658.308199 | 3282.18 | 3282.18 | 3282.18 | -0.293509 | NaN |
2010-01-06 | 3254.22 | 3254.220000 | 3282.18 | 3254.22 | 3254.22 | -27.960000 | 94.261134 |
format = lambda x: '%.2f' % x
sh = sh.applymap(format)
sh.head(2)
Date | Open | High | Low | Close | Adj Close | Change | pct_change |
---|---|---|---|---|---|---|---|
2010-01-05 | 3282.18 | 2658.31 | 3282.18 | 3282.18 | 3282.18 | -0.29 | nan |
2010-01-06 | 3254.22 | 3254.22 | 3282.18 | 3254.22 | 3254.22 | -27.96 | 94.26 |
sh = sh.applymap(float) # 可以重新把数据类型转换为float类型
sh.ix[1]
Open 3254.22
High 3254.22
Low 3282.18
Close 3254.22
Adj Close 3254.22
Change -27.96
pct_change 94.26
Name: 2010-01-06 00:00:00, dtype: float64
利用apply找出每列最值及变化范围
def f(x):
return pd.Series([x.min(),x.max(),x.max()-x.min()],index=['min','max','range'])
sh.apply(f)
# 按月分组
sh['group_index'] = sh.index.map(lambda x: 100*x.year + x.month)
sh.head(2).append(sh.tail(2)) # 查看group_index, as expected
Date | Open | High | Low | Close | Adj Close | Change | pct_change | group_index |
---|---|---|---|---|---|---|---|---|
2010-01-05 | 3282.18 | 2658.31 | 3282.18 | 3282.18 | 3282.18 | -0.29 | NaN | 201001 |
2010-01-06 | 3254.22 | 3254.22 | 3282.18 | 3254.22 | 3254.22 | -27.96 | 94.26 | 201001 |
2016-05-19 | 2806.91 | 2806.91 | 2806.91 | 2806.91 | 2806.91 | -0.60 | -0.98 | 201605 |
2016-05-20 | 2825.48 | 2825.48 | 2825.48 | 2825.48 | 2825.48 | 18.57 | -31.95 | 201605 |
sh.groupby('group_index') # 返回一个groupby对象
<pandas.core.groupby.DataFrameGroupBy object at 0x10f7a9110>
# 对不同的列运用不同的函数聚合
def peak_to_peak(arr):
return arr.max() - arr.min()
grouped = sh.groupby('group_index').agg({
'Open' : peak_to_peak,
'High' : 'max',
'Low' : 'min',
'Change' : 'mean'
})
grouped.head(3).append(grouped.tail(3))
group_index | High | Open | Low | Change |
---|---|---|---|---|
201001 | 3273.97 | 295.57 | 2986.61 | -15.430526 |
201002 | 3060.62 | 125.91 | 2934.71 | 3.132500 |
201003 | 3128.47 | 151.53 | 2976.94 | 2.485217 |
201603 | 3018.80 | 285.63 | 2733.17 | 13.736522 |
201604 | 3082.36 | 144.04 | 2938.32 | -3.452632 |
201605 | 2997.84 | 190.93 | 2806.91 | -8.060000 |
sh.index # DatetimeIndex 提供很多便捷操作
DatetimeIndex(['2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
'2010-01-11', '2010-01-12', '2010-01-13', '2010-01-14',
'2010-01-15', '2010-01-18',
...
'2016-05-09', '2016-05-10', '2016-05-11', '2016-05-12',
'2016-05-13', '2016-05-16', '2016-05-17', '2016-05-18',
'2016-05-19', '2016-05-20'],
dtype='datetime64[ns]', name=u'Date', length=1557, freq=None)
grouped_1 = sh.groupby(pd.TimeGrouper("M")).agg({
'Open' : peak_to_peak,
'High' : 'max',
'Low' : 'min',
'Change' : 'mean'
})
grouped_1.head(3).append(grouped_1.tail(3))
Date | High | Open | Low | Change |
---|---|---|---|---|
2010-01-31 | 3273.97 | 295.57 | 2986.61 | -15.430526 |
2010-02-28 | 3060.62 | 125.91 | 2934.71 | 3.132500 |
2010-03-31 | 3128.47 | 151.53 | 2976.94 | 2.485217 |
2016-03-31 | 3018.80 | 285.63 | 2733.17 | 13.736522 |
2016-04-30 | 3082.36 | 144.04 | 2938.32 | -3.452632 |
2016-05-31 | 2997.84 | 190.93 | 2806.91 | -8.060000 |
可以downsampling,也可以upsampling
# resample('M',how='mean')这样的语法将来不支持,推荐.mean()以及.apply()
# 可以传入词典给apply
grouped_2 = sh[['Open','High','Low','Change']].resample('M').apply({
'Open' : peak_to_peak,
'High' : 'max',
'Low' : 'min',
'Change' : 'mean'
})
grouped_2.head(3).append(grouped_2.tail(3))
Date | High | Open | Low | Change |
---|---|---|---|---|
2010-01-31 | 3273.97 | 295.57 | 2986.61 | -15.430526 |
2010-02-28 | 3060.62 | 125.91 | 2934.71 | 3.132500 |
2010-03-31 | 3128.47 | 151.53 | 2976.94 | 2.485217 |
2016-03-31 | 3018.80 | 285.63 | 2733.17 | 13.736522 |
2016-04-30 | 3082.36 | 144.04 | 2938.32 | -3.452632 |
2016-05-31 | 2997.84 | 190.93 | 2806.91 | -8.060000 |
sh[['Open']].plot();
plt.show()