在写AI/机器学习相关的论文或者博客的时候经常需要用到LaTex的公式,然而作为资深“伸手党”的我在网上搜索的时候,居然没有找到相关现成资源@-@
那么,我就把自己经常会遇到的公式整理如下,以NLP和一些通用指标函数为主。有需要的可以自取,当然发现有问题或者遗漏的也欢迎指正和补充。
A collection of classical ML equations in Latex . Some of them are provided with simple notes and paper link. Hopes to help writings such as papers and blogs.
Better viewed at https://blmoistawinde.github.io/ml_equations_latex/
encoder hidden state hth_tht at time step ttt ht=RNNenc(xt,ht−1)h_t = RNN_{enc}(x_t, h_{t-1})ht=RNNenc(xt,ht−1)
decoder hidden state sts_tst at time step ttt
st=RNNdec(yt,st−1)s_t = RNN_{dec}(y_t, s_{t-1})st=RNNdec(yt,st−1)
h_t = RNN_{enc}(x_t, h_{t-1})
s_t = RNN_{dec}(y_t, s_{t-1})
The RNNencRNN_{enc}RNNenc, RNNdecRNN_{dec}RNNdec are usually either
The attention weight αij\alpha_{ij}αij, the iiith decoder step over the jjjth encoder step, resulting in context vector cic_ici
ci=∑j=1Txαijhjc_i = \sum_{j=1}^{T_x} \alpha_{ij}h_jci=j=1∑Txαijhj
αij=exp(eij)∑k=1Txexp(eik) \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})} αij=∑k=1Txexp(eik)exp(eij)
eik=a(si−1,hj) e_{ik} = a(s_{i-1}, h_j) eik=a(si−1,hj)
c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j
\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}
e_{ik} = a(s_{i-1}, h_j)
aaa is an specific attention function, which can be
Paper: Neural Machine Translation by Jointly Learning to Align and Translate
eik=vTtanh(W[si−1;hj])e_{ik} = v^T tanh(W[s_{i-1}; h_j])eik=vTtanh(W[si−1;hj])
e_{ik} = v^T tanh(W[s_{i-1}; h_j])
Paper: Effective Approaches to Attention-based Neural Machine Translation
If sis_isi and hjh_jhj has same number of dimension.
eik=si−1Thje_{ik} = s_{i-1}^T h_jeik=si−1Thj
otherwise
eik=si−1TWhje_{ik} = s_{i-1}^T W h_jeik=si−1TWhj
e_{ik} = s_{i-1}^T h_j
e_{ik} = s_{i-1}^T W h_j
Finally, the output oio_ioi is produced by:
st=tanh(W[st−1;yt;ct])s_t = tanh(W[s_{t-1};y_t;c_t])st=tanh(W[st−1;yt;ct]) ot=softmax(Vst)o_t = softmax(Vs_t)ot=softmax(Vst)
s_t = tanh(W[s_{t-1};y_t;c_t])
o_t = softmax(Vs_t)
Paper: Attention Is All You Need
Attention(Q,K,V)=softmax(QKTdk)V Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})VAttention(Q,K,V)=softmax(dkQKT)V
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
where dk\sqrt{d_k}dk is the dimension of the key vector kkk and query vector qqq .
MultiHead(Q,K,V)=Concat(head1,...,headh)WOMultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^OMultiHead(Q,K,V)=Concat(head1,...,headh)WO
where headi=Attention(QWiQ,KWiK,VWiV) head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i) headi=Attention(QWiQ,KWiK,VWiV)
MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)
Paper: Generative Adversarial Networks
minGmaxDEx∼pdata(x)[logD(x)]+Ez∼pgenerated(z)[1−logD(G(z))] \min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}] + \mathbb{E}_{z\sim p_{\text{generated}}(z)}[1 - \log{D(G(z))}] GminDmaxEx∼pdata(x)[logD(x)]+Ez∼pgenerated(z)[1−logD(G(z))]
\min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}] + \mathbb{E}_{z\sim p_{\text{generated}}(z)}[1 - \log{D(G(z))}]
Paper: Auto-Encoding Variational Bayes
To produce a latent variable z such that z∼qμ,σ(z)=N(μ,σ2)z \sim q_{\mu, \sigma}(z) = \mathcal{N}(\mu, \sigma^2)z∼qμ,σ(z)=N(μ,σ2), we sample ϵ∼N(0,1)\epsilon \sim \mathcal{N}(0,1)ϵ∼N(0,1), than z is produced by
z=μ+ϵ⋅σz = \mu + \epsilon \cdot \sigmaz=μ+ϵ⋅σ
z \sim q_{\mu, \sigma}(z) = \mathcal{N}(\mu, \sigma^2)
\epsilon \sim \mathcal{N}(0,1)
z = \mu + \epsilon \cdot \sigma
Above is for 1-D case. For a multi-dimensional (vector) case we use:
ϵ⃗∼N(0,I) \vec{\epsilon} \sim \mathcal{N}(0, \textbf{I}) ϵ∼N(0,I)
z⃗∼N(μ⃗,σ2I) \vec{z} \sim \mathcal{N}(\vec{\mu}, \sigma^2 \textbf{I}) z∼N(μ,σ2I)
\epsilon \sim \mathcal{N}(0, \textbf{I})
\vec{z} \sim \mathcal{N}(\vec{\mu}, \sigma^2 \textbf{I})
Related to Logistic Regression. For single-label/multi-label binary classification.
σ(z)=11+e−z\sigma(z) = \frac{1} {1 + e^{-z}}σ(z)=1+e−z1
\sigma(z) = \frac{1} {1 + e^{-z}}
For multi-class single label classification.
σ(zi)=ezi∑j=1Kezj for i=1,2,…,K\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,Kσ(zi)=∑j=1Kezjezi for i=1,2,…,K
\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K
Relu(z)=max(0,z)Relu(z) = max(0, z)Relu(z)=max(0,z)
Relu(z) = max(0, z)
Below xxx and yyy are DDD dimensional vectors, and xix_ixi denotes the value on the iiith dimension of xxx.
∑i=1D∣xi−yi∣\sum_{i=1}^{D}|x_i-y_i|i=1∑D∣xi−yi∣
\sum_{i=1}^{D}|x_i-y_i|
∑i=1D(xi−yi)2\sum_{i=1}^{D}(x_i-y_i)^2i=1∑D(xi−yi)2
\sum_{i=1}^{D}(x_i-y_i)^2
It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval.
Lδ={12(y−y^)2if∣(y−y^)∣<δδ((y−y^)−12δ)otherwise L_{\delta}= \left\{\begin{matrix} \frac{1}{2}(y - \hat{y})^{2} & if \left | (y - \hat{y}) \right | < \delta\\ \delta ((y - \hat{y}) - \frac1 2 \delta) & otherwise \end{matrix}\right. Lδ={21(y−y^)2δ((y−y^)−21δ)if∣(y−y^)∣<δotherwise
L_{\delta}=
\left\{\begin{matrix}
\frac{1}{2}(y - \hat{y})^{2} & if \left | (y - \hat{y}) \right | < \delta\\
\delta ((y - \hat{y}) - \frac1 2 \delta) & otherwise
\end{matrix}\right.
−(ylog(p)+(1−y)log(1−p))-{(y\log(p) + (1 - y)\log(1 - p))}−(ylog(p)+(1−y)log(1−p))
−∑c=1Myo,clog(po,c)-\sum_{c=1}^My_{o,c}\log(p_{o,c})−c=1∑Myo,clog(po,c)
-{(y\log(p) + (1 - y)\log(1 - p))}
-\sum_{c=1}^My_{o,c}\log(p_{o,c})
M - number of classes log - the natural log y - binary indicator (0 or 1) if class label c is the correct classification for observation o p - predicted probability observation o is of class c
NLL(y)=−log(p(y))NLL(y) = -{\log(p(y))}NLL(y)=−log(p(y))
Minimizing negative loglikelihood
minθ∑y−log(p(y;θ))\min_{\theta} \sum_y {-\log(p(y;\theta))}θminy∑−log(p(y;θ))
is equivalent to Maximum Likelihood Estimation(MLE).
maxθ∏yp(y;θ)\max_{\theta} \prod_y p(y;\theta)θmaxy∏p(y;θ)
Here p(y)p(y)p(y) is a scaler instead of vector. It is the value of the single dimension where the ground truth yyy lies. It is thus equivalent to cross entropy (See wiki).\
NLL(y) = -{\log(p(y))}
\min_{\theta} \sum_y {-\log(p(y;\theta))}
\max_{\theta} \prod_y p(y;\theta)
Used in Support Vector Machine(SVM).
max(0,1−y⋅y^)max(0, 1 - y \cdot \hat{y})max(0,1−y⋅y^)
max(0, 1 - y \cdot \hat{y})
KL(y^∣∣y)=∑c=1My^clogy^cycKL(\hat{y} || y) = \sum_{c=1}^{M}\hat{y}_c \log{\frac{\hat{y}_c}{y_c}}KL(y^∣∣y)=c=1∑My^clogycy^c
JS(y^∣∣y)=12(KL(y∣∣y+y^2)+KL(y^∣∣y+y^2))JS(\hat{y} || y) = \frac{1}{2}(KL(y||\frac{y+\hat{y}}{2}) + KL(\hat{y}||\frac{y+\hat{y}}{2}))JS(y^∣∣y)=21(KL(y∣∣2y+y^)+KL(y^∣∣2y+y^))
KL(\hat{y} || y) = \sum_{c=1}^{M}\hat{y}_c \log{\frac{\hat{y}_c}{y_c}}
v
The ErrorErrorError below can be any of the above loss.
A regression model that uses L1 regularization technique is called Lasso Regression.
Loss=Error(Y−Y^)+λ∑1n∣wi∣Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n |w_i|Loss=Error(Y−Y)+λ1∑n∣wi∣
Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n |w_i|
A regression model that uses L1 regularization technique is called Ridge Regression.
Loss=Error(Y−Y^)+λ∑1nwi2Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n w_i^{2}Loss=Error(Y−Y)+λ1∑nwi2
Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n w_i^{2}
Some of them overlaps with loss, like MAE, KL-divergence.
Accuracy=TP+TFTP+TF+FP+FNAccuracy = \frac{TP+TF}{TP+TF+FP+FN}Accuracy=TP+TF+FP+FNTP+TF
Precision=TPTP+FPPrecision = \frac{TP}{TP+FP}Precision=TP+FPTP
Recall=TPTP+FNRecall = \frac{TP}{TP+FN}Recall=TP+FNTP
F1=2∗Precision∗RecallPrecision+Recall=2∗TP2∗TP+FP+FNF1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}F1=Precision+Recall2∗Precision∗Recall=2∗TP+FP+FN2∗TP
Accuracy = \frac{TP+TF}{TP+TF+FP+FN}
Precision = \frac{TP}{TP+FP}
Recall = \frac{TP}{TP+FN}
F1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}
Sensitivity=Recall=TPTP+FNSensitivity = Recall = \frac{TP}{TP+FN}Sensitivity=Recall=TP+FNTP
Specificity=TNFP+TNSpecificity = \frac{TN}{FP+TN}Specificity=FP+TNTN
Sensitivity = Recall = \frac{TP}{TP+FN}
Specificity = \frac{TN}{FP+TN}
AUC is calculated as the Area Under the SensitivitySensitivitySensitivity(TPR)-(1−Specificity)(1-Specificity)(1−Specificity)(FPR) Curve.
MAE, MSE, equation above.
The Mutual Information is a measure of the similarity between two labels of the same data. Where ∣Ui∣|U_i|∣Ui∣ is the number of the samples in cluster UiU_iUi and ∣Vi∣|V_i|∣Vi∣ is the number of the samples in cluster ViV_iVi , the Mutual Information between cluster UUU and VVV is given as:
MI(U,V)=∑i=1∣U∣∑j=1∣V∣∣Ui∩Vj∣NlogN∣Ui∩Vj∣∣Ui∣∣Vj∣ MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N} \log\frac{N|U_i \cap V_j|}{|U_i||V_j|} MI(U,V)=i=1∑∣U∣j=1∑∣V∣N∣Ui∩Vj∣log∣Ui∣∣Vj∣N∣Ui∩Vj∣
MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N}
\log\frac{N|U_i \cap V_j|}{|U_i||V_j|}
Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by some generalized mean of H(labels_true) and H(labels_pred)), See wiki.
Skip RI, ARI for complexity.
Also skip metrics for related tasks (e.g. modularity for community detection[graph clustering], coherence score for topic modeling[soft clustering]).
Skip nDCG (Normalized Discounted Cumulative Gain) for its complexity.
Average Precision is calculated as:
AP=∑n(Rn−Rn−1)Pn\text{AP} = \sum_n (R_n - R_{n-1}) P_nAP=n∑(Rn−Rn−1)Pn
\text{AP} = \sum_n (R_n - R_{n-1}) P_n
where RnR_nRn and PnP_nPn are the precision and recall at the nnnth threshold,
MAP is the mean of AP over all the queries.
Cosine(x,y)=x⋅y∣x∣∣y∣Cosine(x,y) = \frac{x \cdot y}{|x||y|}Cosine(x,y)=∣x∣∣y∣x⋅y
Cosine(x,y) = \frac{x \cdot y}{|x||y|}
Similarity of two sets UUU and VVV.
Jaccard(U,V)=∣U∩V∣∣U∪V∣Jaccard(U,V) = \frac{|U \cap V|}{|U \cup V|}Jaccard(U,V)=∣U∪V∣∣U∩V∣
Jaccard(U,V) = \frac{|U \cap V|}{|U \cup V|}
Relevance of two events xxx and yyy.
PMI(x;y)=logp(x,y)p(x)p(y)PMI(x;y) = \log{\frac{p(x,y)}{p(x)p(y)}}PMI(x;y)=logp(x)p(y)p(x,y)
PMI(x;y) = \log{\frac{p(x,y)}{p(x)p(y)}}
For example, p(x)p(x)p(x) and p(y)p(y)p(y) is the frequency of word xxx and yyy appearing in corpus and p(x,y)p(x,y)p(x,y) is the frequency of the co-occurrence of the two.
This repository now only contains simple equations for ML. They are mainly about deep learning and NLP now due to personal research interests.
For time issues, elegant equations in traditional ML approaches like SVM, SVD, PCA, LDA are not included yet.
Moreover, there is a trend towards more complex metrics, which have to be calculated with complicated program (e.g. BLEU, ROUGE, METEOR), iterative algorithms (e.g. PageRank), optimization (e.g. Earth Mover Distance), or even learning based (e.g. BERTScore). They thus cannot be described using simple equations.
https://blog.floydhub.com/gans-story-so-far/
https://ermongroup.github.io/cs228-notes/extras/vae/
Thanks for a-rodin’s solution to show Latex in Github markdown, which I have wrapped into latex2pic.py
.