逻辑斯谛回归是一种经典的线性分类方法,又被称为对数几率回归,其属于对数线性模型。
线性回归完成了数据的拟合,我们通过引入一个sigmoidsigmoidsigmoid函数,即可在线性回归模型的基础上实现分类。
sigmoid函数定义如下
y=11+e−z y = \frac{1}{1 + e^{-z}} y=1+e−z1
以二分类任务为例,取y∈{0,1}y\in {0,1}y∈{0,1},我们定义二项逻辑斯谛回归模型为如下条件概率分布:
P(Y=1∣x)=exp(w⋅x+b)1+exp(w⋅x+b)P(Y=0∣x)=11+exp(w⋅x+b) P(Y=1|x) = \frac{\exp(w\cdot x + b)}{1 + \exp(w\cdot x + b)}\ P(Y=0|x) = \frac{1}{1 + \exp(w\cdot x + b)} P(Y=1∣x)=1+exp(w⋅x+b)exp(w⋅x+b)P(Y=0∣x)=1+exp(w⋅x+b)1
一个事件的几率是指该事件发生的概率与不发生的概率的比值,如果事件发生的概率为ppp,则该事件的几率为p1−p\frac{p}{1-p}1−pp,则该事件的对数几率即为:
logp1−p \log \frac{p}{1-p} log1−pp
考虑逻辑斯谛回归模型,
logP(Y=1∣x)1−P(Y=1∣x)=w⋅x+b \log \frac{P(Y=1|x)}{1-P(Y=1|x)} = w\cdot x + b log1−P(Y=1∣x)P(Y=1∣x)=w⋅x+b
也就是说,输出Y=1Y=1Y=1的对数几率是输入xxx的线性函数。
对于给定的训练数据集,我们采用极大似然估计法来估计模型的参数,似然函数为:
∏i=1NP(yi=1∣xi)yi1−P(yi=1∣xi)1−yi \prod_{i=1}^NP(y_i=1|x_i)^{y_i}1-P(y_i=1|x_i)^{1-y_i} i=1∏NP(yi=1∣xi)yi1−P(yi=1∣xi)1−yi
对数似然函数为:
L(w,b)=∑i=1NyilogP(yi=1∣xi)+(1−yi)log(1−P(yi=1∣xi))=∑i=1NyilogP(yi=1∣xi)1−P(yi=1∣xi)+log(1−P(yi=1∣xi))=∑i=1Nyi(w⋅xi+b)−log(1+exp(w⋅x+b)) \begin{aligned} L(w,b) &= \sum_{i=1}^Ny_i\log P(y_i=1|x_i) + (1-y_i) \log (1- P(y_i=1|x_i))\ & = \sum_{i=1}^Ny_i\log \frac{P(y_i=1|x_i)}{1-P(y_i=1|x_i)} + \log (1-P(y_i=1|x_i)) \ & = \sum_{i=1}^Ny_i(w\cdot x_i + b) - \log (1 + \exp(w\cdot x + b)) \end{aligned} L(w,b)=i=1∑NyilogP(yi=1∣xi)+(1−yi)log(1−P(yi=1∣xi))=i=1∑Nyilog1−P(yi=1∣xi)P(yi=1∣xi)+log(1−P(yi=1∣xi))=i=1∑Nyi(w⋅xi+b)−log(1+exp(w⋅x+b))
然后对L(w,b)L(w,b)L(w,b)取极大值,即可得到www的估计值,通常情况下,我们将其转化为求解极小值问题.
L(w,b)=−∑i=1Nyi(w⋅xi+b)−log(1+exp(w⋅x+b)) L(w,b) = -\sum_{i=1}^Ny_i(w\cdot x_i + b) - \log (1 + \exp(w\cdot x + b)) L(w,b)=−i=1∑Nyi(w⋅xi+b)−log(1+exp(w⋅x+b))
我们通常采用的方法是梯度下降法以及牛顿法.
我们用θ\thetaθ来替代参数.
梯度下降法的参数更新为:
θ←θ−α∂L(θ)∂θ \theta \gets \theta - \alpha \frac{\partial L(\theta)}{\partial \theta} θ←θ−α∂θ∂L(θ)
牛顿法的迭代形式为:
θt+1=θt−(∂2L(θ)∂θ2)−1∂L(θ)∂θ \theta^{t+1} = \theta^{t} - (\frac{\partial^2L(\theta)}{\partial \theta^2})^{-1} \frac{\partial L(\theta)}{\partial \theta} θt+1=θt−(∂θ2∂2L(θ))−1∂θ∂L(θ)
采用向量形式表示则为:
θt+1=θt−(∂2L(θ)∂θ∂θT)−1∂L(θ)∂θ \theta^{t+1} = \theta^t - (\frac{\partial^2L(\theta)}{\partial \theta\partial\theta^T})^{-1}\frac{\partial L(\theta)}{\partial \theta} θt+1=θt−(∂θ∂θT∂2L(θ))−1∂θ∂L(θ)
下面我们来推导关于θ\thetaθ的一阶和二阶导数:
对于代价函数采取如下形式考虑,
L(θ)=−ylogy^+(1−y)log(1−y^) L(\theta) = - y\log \hat{y} + (1-y)\log(1-\hat{y}) L(θ)=−ylogy^+(1−y)log(1−y^)
其中,y^=11+exp(−z),z=θTx\hat{y} = \frac{1}{1 + \exp(-z)},z = \theta^Txy^=1+exp(−z)1,z=θTx.
根据链式求导法则
∂L∂θ=∂L∂z∂z∂θ \frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial z}\frac{\partial z}{\partial \theta} ∂θ∂L=∂z∂L∂θ∂z
∂L∂z=∂L∂y^∂y^∂z \frac{\partial L}{\partial z} = \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z} ∂z∂L=∂y^∂L∂z∂y^
我们有
∂L∂y^=y^−yy^(1−y^) \frac{\partial L}{\partial \hat{y}} = \frac{\hat{y} - y}{\hat{y}(1-\hat{y})} ∂y^∂L=y^(1−y^)y^−y
∂y^∂z=(11+exp(−z))′=11+exp(−z)−1(1+exp(−z))2=y^(1−y^) \begin{aligned} \frac{\partial \hat{y}}{\partial z} &= \Big(\frac{1}{1 + \exp(-z)}\Big)'\ & = \frac{1}{1+\exp(-z)} - \frac{1}{\big(1 + \exp(-z)\big)^2}\ & = \hat{y}(1-\hat{y}) \end{aligned} ∂z∂y^=(1+exp(−z)1)′=1+exp(−z)1−(1+exp(−z))21=y^(1−y^)
因此,我们得到了
∂L∂z=∂L∂y^∂y^∂z=y^−y \frac{\partial L}{\partial z} = \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z} = \hat{y} - y ∂z∂L=∂y^∂L∂z∂y^=y^−y
那么
∂L∂θ=∂L∂z∂z∂θ=(y^−y)∂z∂θ=(y^−y)x \begin{aligned} \frac{\partial L}{\partial \theta} &= \frac{\partial L}{\partial z}\frac{\partial z}{\partial \theta}\ & = (\hat{y} - y)\frac{\partial z}{\partial \theta}\ & = (\hat{y} - y)x \end{aligned} ∂θ∂L=∂z∂L∂θ∂z=(y^−y)∂θ∂z=(y^−y)x
这里均由增广表示,因此我们得到了迭代公式
θ^=θ−α∂L∂θ=θ−α(y^−y)x \begin{aligned} \hat{\theta} &= \theta - \alpha\frac{\partial L}{\partial \theta} \ & = \theta - \alpha(\hat{y}- y)x \end{aligned} θ^=θ−α∂θ∂L=θ−α(y^−y)x
二阶导数同理我们可以得到
∂2L(θ)∂θ∂θT=xxTy^(1−y^) \frac{\partial^2L(\theta)}{\partial \theta\partial\theta^T} = xx^T\hat{y}(1-\hat{y}) ∂θ∂θT∂2L(θ)=xxTy^(1−y^)
李航-统计机器学习
周志华-机器学习