Lecture 10: Mutual information
- Mutual Information(MI)
- detect non-linear, discrete features
- preprocessing: continuous features are first discreted into bins (by domain knowledge or equal width or equal frequency)
- Entropy is a measuer used to assess the amount of uncertainty in an outcome
- about how to calculate MI is on the formula sheet
- mutual information indicate that the amount of information about X we gain from knowing Y
- Normalised mutual information: NMI can be used to provide a more interpretable measure of correlation than MI
- difference with Pearson correlation:
- MI can measure non-linear relation
- MI is very effective for use with discrete features
-understand the meaning of the variables in the (normalised) mutual information and how they can be calculated. Be able to compute this measure on a pair of features. The formula for (normalised) mutual information will be provided on the exam.
- Normalised Mutual Information (NMI)
- Range [0,1]
- Large = high correlation
- Small = low correlation
-understand the role of data discretization in computing (normalised) mutual information
- Variable discretization
- Doman knowledge: assign thresholds manually
- Equal-width bin
- Divide the range of continuous feature into equal length intervals
- Width =
- Equal frequency bin
- Divide range of continuous feature into equal frequency intervals
- Sort the values & divide -> each bin has same number of objects
- Freq =
-understand the meaning of the entropy of a random variable and how to interpret an entropy value. Understand its extension to conditional entropy
- Entropy
- A measure used to assess the amount of uncertainty in an outcome
- quantify degree of uncertainty
- amount of randomness
- Low entropy = more certain
- Higher entropy = less certain
- X = feature
- pi = proportion of points in the i-th bin
- Example
- H(x) >= 0
- Entropy is maximized for uniform distribution
- conditional entropy H(y|x)
- Measures how much information needed to describe outcome Y, given that outcome X is known
- The amount of information shared between two variables X & Y
- The amount of information about X gained by knowing Y
- The amount of information about Y gained by knowing X
- MI (X, Y)
- Large -> highly correlated (more dependent)
- Small -> low correlation (more independent)
- >=0
- 0 <= MI(X,Y) <= min( H(X),H(Y) )
-be able to interpret the meaning of the (normalised) mutual information between two variables
- Where X and Y are features (columns) in a dataset. MI (mutual information) is a measure of correlation
- the amount of information about X we gain by knowing Y, or the amount of information about Y we gain by knowing X
- MI(X,Y) is always at least zero, may be larger than 1
- A correlation measure that can detect non-linear relationships
- Operates with discrete features
- Pre-processing: continues features are discretised into bins
- In fact, one can show it is true that
- 0 ≤ MI(X,Y) ≤ min(H(X),H(Y)) (where min(a,b) indicates the minimum of a and b)
- Thus if want a measure in the interval [0,1], we can define normalized mutual information (NMI)
- NMI(X,Y) = MI(X,Y) / min(H(X),H(Y))
- NMI(X,Y)
- large: X and Y are highly correlated (more dependent)
- small: X and Y have low correlation (more independent)
-understand the use of (normalised) mutual information for computing correlation of some feature with a class feature and why this is useful. Understand how this provides a ranking of features, according to their predictiveness of the class
- NMI is a good measure for determining the quality of clustering.
- It is an external measure because we need the class labels of the instances to determine the NMI.
- Since it’s normalized we can measure and compare the NMI between different clusterings having different number of clusters.
-understand that normalised mutual information can be used to provide a more interpretable measure of correlation than mutual information. The formula for normalised mutual information will be provided on the exam
-understand the advantages and disadvantages of using (normalised) mutual information for computing correlation between a pair of features. Understand the main differences between this and Pearson correlation.
- Advantage
- Can detect both linear & non-linear dependencies
- Applicable and very effective for use with discrete features
- Disadvantage
- If feature is continuous, it first must be discretised to compute mutual information
- Making choices about what bins to use
- Different bin choices will lead to different estimations of mutual information
- Difference with Pearson correlation:
- MI can measure non-linear relation
- MI is very effective for use with discrete features