在信息论与概率统计中,熵(entropy)是表示随机变量不确定性的度量。
设X是一个取有限个值的离散随机变量,其概率分布为:
。
则随机变量X的熵定义为:
当随机变量只有两个取值时(0-1),这时的分布也称为贝努利分布,
其熵为:。
下面作图展示贝努利分布熵和概率的关系:
### 0-1分布的H(p)曲线###
from math import log
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
p = np.arange(0.01, 1, 0.01)
Hp = []
for pi in p:
Hp.append(-pi*log(pi,2)-(1-pi)*log(1-pi,2))
plt.plot(p, Hp, 'r')
计算信息增益和信息增益比
###编写函数计算信息增益及信息增益比InfoGain() ###
#计算熵
def CalcEntropy(col):
colP= pd.crosstab(col, 'percent')/len(col)
def Entr(p):
if p == 0:
entr = 0
else:
entr = -p*1.0*log(p, 2)
return entr
entropy = list(map(Entr, colP.percent))
entropy = sum(entropy)
return entropy
#计算条件熵
def CalcHentropy(feature, y):
featP = pd.crosstab(feature, 'percent')/len(feature)
crossP = np.array(pd.crosstab(feature,y))
def entr(x):
if x[0] == 0 or x[1] == 0:
entr = 0
else:
entr = -x[0]*1.0/(x[0]+x[1])*log(x[0]*1.0/(x[0]+x[1]), 2)\
-x[1]*1.0/(x[0]+x[1])*log(x[1]*1.0/(x[0]+x[1]), 2)
return entr
hentropy=list(map(entr, crossP))
hentropy = np.dot(featP.percent, hentropy)
return hentropy
#计算信息增益和信息增益比
def InfoGain(feature, y):
feat_entr = CalcEntropy(feature)
y_entr = CalcEntropy(y)
H_entr = CalcHentropy(feature, y)
infogain = y_entr - H_entr
infogainrate = infogain/feat_entr
return infogain, infogainrate
### 数据例子 ###
A1 = np.array([elem for elem in ["青年","中年","老年"] \
for i in range(5)])
A2 = np.array(["否","否","是","是","否","否","否","是",\
"否","否","否","否","是","是","否"])
A3 = np.array(["否","否","否","是","否","否","否","是",\
"是","是","是","是","否","否","否"])
A4 = np.array(["一般","好","好","一般","一般","一般",\
"好","好","非常好","非常好","非常好",\
"好","好","非常好","一般"])
Y = np.array(["否","否","是","是","否","否","否","是","是",\
"是","是","是","是","是","否"])
data = DataFrame({'A1': A1, 'A2': A2, 'A3':A3, 'A4': A4, 'Y': Y})
infogain, infogainrate = InfoGain(A1, Y)
即特征A1的信息增益为0.083,信息增益比为0.052。
领取专属 10元无门槛券
私享最新 技术干货