因子评估——双重排序

量化小白

发布于 2020-02-24 05:49:45

6.7K31

代码可运行

文章被收录于专栏：量化小白上分记量化小白上分记

运行总次数：1

代码可运行

对于因子的评估，之前的文章中总结了单因子测试的回归法、分层法以及多因子评估的Fama-MacBeth回归（链接见底部）。本文给出因子分析中的双重排序法（double sorting or bivariate sorting) 的原理及代码实现。

双重排序可以评估两个因子叠加使用是否会优于单个因子，即分析两个因子的信息重叠程度以及否有信息增益。

双重排序法的原理与Fama-French三因子中的SMB和HML构造方法一致。具体来说，对于两个因子X、Y，同时按照X、Y排序分组，即双重排序，构建投资组合，分析投资组合的表现。

双重排序在实施时特别需要注意的细节是进行独立排序还是条件排序，独立排序即分别按照X、Y进行排序，取交集得到最终的组合。条件排序则先按照一个因子X排序分层，在X的每个类别内对Y进行排序分层，得到最终的投资组合。

这两种排序的区别在于，如果使用独立排序，未考虑X、Y之间的相关性，如果X、Y之间的相关性很高，分层出来的结果差不多，得到的投资组合会集中在对角线上，会导致非对角线的组合包含的股票数目非常少。这样的不平衡情况下，对组合收益的分析意义不大。因此可以用独立排序的方法评估X、Y之间的相关性程度。

如果使用条件排序，需要考虑是先按X排序还是先按Y排序，研究的是在控制了一个因子后，另一个因子的表现。因此可以分析一个因子相比另一个因子是否有信息增益。如果有信息增益，在控制因子的每一层内，另一个因子都依然会是单调的，有明显超额收益，如果信息增益不多，在控制了一个因子之后，另一个因子的分层表现可能会没有什么差异。同时条件排序下每个组合中的数目都是相同的，不会出现不平衡情况。

这两种排序都是有用的，接下来给一个代码实现的例子。

取A股市场的市值因子和市净率因子，数据从2010年-2018年。对这两个因子做双重排序，数据和代码在后台回复“双重排序”获取。

首先对这两个因子做单因子测试，用到的函数如下

import os
os.chdir('E:\\quant\\doublesort\\')

import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import seaborn as sns

def getRet(price,freq ='d',if_shift = True):
    price = price.copy()
   
    if freq == 'w':
        price['weeks'] = price['tradedate'].apply(lambda x:x.isocalendar()[0]*100 + x.isocalendar()[1])
        ret = price.groupby(['weeks','stockcode']).last().reset_index()
        del ret['weeks']
    
    elif freq =='m':
        price['ym'] = price.tradedate.apply(lambda x:x.year*100 + x.month)
        ret = price.groupby(['ym','stockcode']).last().reset_index()
        del ret['ym']
    
    ret = ret[['tradedate','stockcode','price']]
    if if_shift:
        ret = ret.groupby('stockcode').apply(lambda x:x.set_index('tradedate').price.pct_change(1).shift(-1))
    else:
        ret = ret.groupby('stockcode').apply(lambda x:x.set_index('tradedate').price.pct_change(1))
    
    ret = ret.reset_index()
    ret = ret.rename(columns = {ret.columns[2]:'ret'})
    return ret


def getdate(x):
    if type(x) == str:
        return pd.Timestamp(x).date()
    else:
        return datetime.date(int(str(x)[:4]),int(str(x)[4:6]),int(str(x)[6:]))


def getICSeries(factors,ret,method):
    # method = 'spearman';factors = fall.copy();

    icall = pd.DataFrame()
    fall = pd.merge(factors,ret,left_on = ['tradedate','stockcode'],right_on = ['tradedate','stockcode'])
    icall = fall.groupby('tradedate').apply(lambda x:x.corr(method = method)['ret']).reset_index()
    icall = icall.drop(['ret'],axis = 1).set_index('tradedate')

    return icall


def ifst(x):
    if pd.isnull(x.entry_dt):
        return 0
    elif (x.tradedate < x.entry_dt) |(x.tradedate > x.remove_dt):
        return 0
    else:
        return 1
 

def GroupTestAllFactors(factors,ret,groups):
    # factors = f1_norm[['tradedate','stockcode','pb']].copy();ret = ret=.copy();groups = 10
    """
    一次性测试多个因子
    """
    fnames = factors.columns
    fall = pd.merge(factors,ret,left_on = ['stockcode','tradedate'],right_on = ['stockcode','tradedate'])
    Groupret = []
 
    for f in fnames: # f = fnames[2]
        if ((f != 'stockcode')&(f != 'tradedate')):
            fuse = fall[['stockcode','tradedate','ret',f]]
# fuse['groups'] = pd.qcut(fuse[f],q = groups,labels = False)
            fuse['groups'] = fuse[f].groupby(fuse.tradedate).apply(lambda x:np.ceil(x.rank()/(len(x)/groups)))
            result = fuse.groupby(['tradedate','groups']).apply(lambda x:x.ret.mean())
            result = result.unstack().reset_index()
            result.insert(0,'factor',f)
            Groupret.append(result)
 

    Groupret = pd.concat(Groupret,axis = 0).reset_index(drop = True)
 
    Groupnav = Groupret.iloc[:,2:].groupby(Groupret.factor).apply(lambda x:(1 + x).cumprod())
    Groupnav = pd.concat([Groupret[['tradedate','factor']],Groupnav],axis = 1)
 
    return Groupnav


def plotnav(Groupnav):
    """
    GroupTest作图
    """
    for f in Groupnav.factor.unique(): # f = Groupnav.factor.unique()[0]
        fnav = Groupnav.loc[Groupnav.factor ==f,:].set_index('tradedate').iloc[:,1:]
        groups = fnav.shape[1]
        lwd = [2]*groups
        ls = ['-']*groups
        
        plt.figure(figsize = (10,5))
        for i in range(groups):
            plt.plot(fnav.iloc[:,i],linewidth = lwd[i],linestyle = ls[i])
        plt.legend(list(range(groups)))
        plt.title('factor group test: ' + f ,fontsize = 20)

读取数据，计算IC

mktcap = pd.read_excel('mktcap.xlsx',encoding = 'gbk')
pb = pd.read_excel('PB_monthly.xlsx',encoding = 'gbk')
price = pd.read_excel('price.xlsx',encoding = 'gbk')
ST = pd.read_excel('ST.xlsx',encoding = 'gbk')


startdate = datetime.date(2010,12,31)
enddate = datetime.date(2018,12,31)

pb = pb.fillna(0)
price = price.loc[(price.tradedate >= startdate) & (price.tradedate <= enddate)].reset_index(drop = True)
pb = pb.loc[(pb.tradedate >= startdate) & (pb.tradedate <= enddate)].reset_index(drop = True)
mktcap = mktcap.loc[(mktcap.tradedate >= startdate) & (mktcap.tradedate <= enddate)].reset_index(drop = True)
  

ret_m = getRet(price,freq ='m',if_shift = True)


fall = pd.merge(mktcap,pb,left_on = ['tradedate','stockcode'],right_on = ['tradedate','stockcode'])
 



# 剔ST
fall = pd.merge(fall,ST,left_on = 'stockcode',right_on = 'stockcode',how = 'left')
fall['if_st'] = fall.apply(ifst,axis = 1)
fall = fall.loc[fall.if_st == 0].reset_index(drop = True)
fall = fall.drop(['if_st','entry_dt','remove_dt'],axis = 1)



# 计算IC
ics = getICSeries(fall,ret_m,'spearman')
ics.cumsum().plot(title = 'cum IC',figsize = (8,4))
ics.mean()
ics.mean()/ics.std()*np.sqrt(12)

累计IC

IC、年化ICIR如下

分层测试如下，分五层

# 分层测试
groups = 5
Groupnav = GroupTestAllFactors(fall,ret_m,groups)
plotnav(Groupnav)

接下来对这两个因子进行双重排序，生成5x5的投资组合。首先做独立排序，统计每个分组中的股票占比如下，横轴为市净率分组，纵轴为市值分组，1-5因子值逐渐增大。

从结果来看，各组的股票数占比差异不大，表明两个因子相关性不高。计算这25个投资组合的净值曲线结果如下

不是非常容易观察，计算每个投资组合的平均月度收益率，做5x5的热力图如下

可以看出，不论纵轴还是横轴，随着一个因子组别的上升，投资组合的平均收益率下降，表明因子独立且有信息增益，更细致的话还可以关注波动率、夏普。

# 独立排序

f1 = 'mktcap'
f2 = 'pb'


f = fall[['tradedate','stockcode',f1,f2]].copy()

f[f1] = f[f.columns[2]].groupby(f.tradedate).apply(lambda x:np.ceil(x.rank()/(len(x)/groups)))
f[f2] = f[f.columns[3]].groupby(f.tradedate).apply(lambda x:np.ceil(x.rank()/(len(x)/groups)))


res = f.groupby([f1,f2]).count()
res = res.iloc[:,1].reset_index()
res = res.pivot(index = f1,columns = f2,values = 'stockcode')
res = res/f.shape[0]


# 基本独立，均匀分布
plt.figure(figsize = (8,8))
sns.heatmap(res,cmap = 'YlGnBu', annot=True,square = True)
plt.show()


f = pd.merge(f,ret_m,left_on = ['tradedate','stockcode'],right_on = ['tradedate','stockcode'])

f['groups'] = f.apply(lambda x:str(int(x[f1])) + '-' + str(int(x[f2])),axis = 1)

res = f.groupby(['tradedate','groups']).apply(lambda x:x.ret.mean())
res = res.unstack().reset_index()

res.set_index('tradedate').cumsum().plot(figsize = (8,6))


yret = res.iloc[:,1:].mean()
yret = yret.reset_index()
yret.columns = ['groups','ret']
yret[f1] = yret.groups.apply(lambda x:x[0])
yret[f2] = yret.groups.apply(lambda x:x[2])


plt.figure(figsize = (8,8))
sns.heatmap(yret.pivot(index = f1,columns = f2,values = 'ret'),cmap = 'YlGnBu',annot = True,square = True)
plt.show()

接下来分别按两个因子进行条件排序，重复上述过程。

市净率按市值排序

市值按市净率排序

因子分层表现很好，表明有有信息增益。代码如下

# 先按f1分组，再按f2分组 doublesorts


f1 = 'mktcap'
f2 = 'pb'


f = fall[['tradedate','stockcode',f1,f2]].copy()

f[f1] = f[f.columns[2]].groupby(f.tradedate).apply(lambda x:np.ceil(x.rank()/(len(x)/groups)))
f[f2] = f[f.columns[3]].groupby([f.tradedate,f[f1]]).apply(lambda x:np.ceil(x.rank()/(len(x)/groups)))
f = pd.merge(f,ret_m,left_on = ['tradedate','stockcode'],right_on = ['tradedate','stockcode'])

f['groups'] = f.apply(lambda x:str(int(x[f1])) + '-' + str(int(x[f2])),axis = 1)

res = f.groupby(['tradedate','groups']).apply(lambda x:x.ret.mean())
res = res.unstack().reset_index()

res.iloc[:,1:].cumsum().plot(figsize = (8,6))


yret = res.iloc[:,1:].mean()
yret = yret.reset_index()
yret.columns = ['groups','ret']
yret[f1] = yret.groups.apply(lambda x:x[0])
yret[f2] = yret.groups.apply(lambda x:x[2])

plt.figure(figsize = (8,8))
sns.heatmap(yret.pivot(index = f1,columns = f2,values = 'ret'),cmap = 'YlGnBu',annot = True,square = True)
plt.show()
f1 = 'pb'
f2 = 'mktcap'

f = fall[['tradedate','stockcode',f1,f2]].copy()

f[f1] = f[f.columns[2]].groupby(f.tradedate).apply(lambda x:np.ceil(x.rank()/(len(x)/groups)))
f[f2] = f[f.columns[3]].groupby([f.tradedate,f[f1]]).apply(lambda x:np.ceil(x.rank()/(len(x)/groups)))
f = pd.merge(f,ret_m,left_on = ['tradedate','stockcode'],right_on = ['tradedate','stockcode'])

f['groups'] = f.apply(lambda x:str(int(x[f1])) + '-' + str(int(x[f2])),axis = 1)

res = f.groupby(['tradedate','groups']).apply(lambda x:x.ret.mean())
res = res.unstack().reset_index()

res.iloc[:,1:].cumsum().plot(figsize = (8,6))


yret = res.iloc[:,1:].mean()
yret = yret.reset_index()
yret.columns = ['groups','ret']
yret[f1] = yret.groups.apply(lambda x:x[0])
yret[f2] = yret.groups.apply(lambda x:x[2])

plt.figure(figsize = (8,8))
sns.heatmap(yret.pivot(index = f1,columns = f2,values = 'ret'),cmap = 'YlGnBu',annot = True,square = True)
plt.show()