交叉熵
source: wikipedia
[1] http://willwolf.io/2017/05/18/minimizing_the_negative_log_likelihood_in_english/
[2] https://www.quora.com/What-are-the-differences-between-maximum-likelihood-and-cross-entropy-as-a-loss-function
[3] https://jhui.github.io/2017/01/05/Deep-learning-Information-theory/
[4] https://en.wikipedia.org/wiki/Categorical_distribution