Lecture 8: Hierarchical clustering and dimension reduction
-be able to explain the steps of (agglomerative) hierarchical clustering, using single linkage (min)
- Two main types of hierarchical clustering
- Agglomerative:
- Start with the points as individual clusters
- At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
- Divisive:
- Start with one, all-inclusive cluster
- At each step, split a cluster until each cluster contains a point (or there are k clusters)
- Traditional hierarchical algorithms use a similarity or distance matrix
- Merge or split one cluster at a time
- Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
-Merge the two closest clusters
-Update the proximity matrix
-Until only a single cluster remains
- Key operation is the computation of the proximity of two clusters
- Different approaches to defining the distance between clusters distinguish the different algorithms
- We define the similarity to be the minimum distance between the clusters . This is also known as single linkage.
- Single linkage
- Similarity of two clusters is based on the two most similar (closest) points in the different clusters
- Determined by one pair of points, i.e., by one link in the proximity graph.
-understand how a hierarchical clustering corresponds to a tree structure (dendrogram)
- define inter-cluster similarity
- MIN (single linkage) = similarity of two clusters is based on the two most similar (closest) points in different clusters
- Strength: can handle non-elliptical shapes
- Limitations: sensitive to noise & outliers
- MAX (Complete Linkage)
- Strength: less susceptible to noise & outliers
- Limitations: tends to break large clusters
- Problems & Limitation
- One a decision is made to combine two clusters, it cannot be undone
- No objective function is directly minimized
-be able to explain why knowledge of clustering can assist in outlier detection
- An outlier is expected to be far away from any groups of normal objects
- Each object is associated with exactly once cluster and its outlier score is equal to the distance from its cluster centre.
-understand the concept of a dissimilarity matrix and the steps for its construction
- Clustering methods are typically distance based. Represent each object/instance as a vector (a row in the data, with values for each different feature) and then can compute Euclidean distance between pairs of vectors.
- Commonly normalise (scale) each attribute into range [0,1] via a pre-processing step before computing distances
- Compute all pairwise distances between objects. This gives a dissimilarity matrix
- The diagonal of D is all zeros
- D is symmetric about its leading diagonal
- D(i,j)=D(j,i) for all i and j
- Objects follow the same order along rows and columns
- In general, visualising the (raw) dissimilarity matrix may not reveal enough useful information
- Further processing is needed
-appreciate the relative advantages and disadvantages of k-means versus hierarchical clustering
- K-Means:
- Advantages
- Easy to implement
- With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).
- k-Means may produce tighter clusters than hierarchical clustering
- An instance can change cluster (move to another cluster) when the centroids are recomputed.
- Disadvantages
- Difficult to predict the number of clusters (K-Value)
- Initial seeds have a strong impact on the final results
- The order of the data has an impact on the final results
- Sensitive to scale: rescaling your datasets (normalization or standardization) will completely change results. While this itself is not bad, not realizing that you have to spend extra a4en(on to scaling your data might be bad.
- Hierarchical Clustering:
- Advantages
- Hierarchical clustering outputs a hierarchy, ie a structure that is more informative than the unstructured set of flat clusters returned by k-means. Therefore, it is easier to decide on the number of clusters by looking at the dendrogram (see suggestion on how to cut a dendrogram in lab8).
- Easy to implement
- Disadvantages
- It is not possible to undo the previous step: once the instances have been assigned to a cluster, they can no longer be moved around.
- Time complexity: not suitable for large datasets
- Initial seeds have a strong impact on the final results
- The order of the data has an impact on the final results
- Very sensitive to outliers
-appreciate the following motivations for dimensionality reduction i) reduce amount of time and memory required by data processing algorithms, ii) allow data to be more easily visualized, iii) help eliminate irrelevant features or noise
- Motivation
- The curse of dimensionality: data analysis techniques which work well at lower dimensions, often perform poorly as the dimensionality of the analysed data increases of features
- Dimensionality increases
- data becomes increasingly sparse & all the distances between pairs of points begin to look the same
- impacts any algorithm that is based on distances between objects
- Purpose
- avoid curse of dimensionality
- reduce amount of time & memory required by data processing algorithms
- allow data to be more easily visualised
- may help to eliminate irrelevant features / reduce noise
-understand the concept of dimensionality reduction of a dataset (what is the input and what is the output and what is their relationship)
- Method
- input: A dataset with N features & K objects
- select a subset of the original features
- output: A transformed dataset with n<N features & K objects
- n is often 2 or 3 -> easily visualised
- do not need to be a subset of the input N features
- can be new features whose values are constructed using some function applied to the input N features
-understand that dimensionality reduction may be performed by i) selecting a subset of the original features or ii) generating a small number of new features
-understand the purpose of using PCA for dimensionality reduction. Understand the potential benefits of using PCA for data visualisation
Principal Components Analysis (PCA)
- Intuition: select less features but won't lost much information to describe objects
- Motivation: it's a systematical way to implement dimensionality reduction which is much better than doing it manually
- PCA Idea: find the new axis line with the largest variance among data
- Not much information lost when drop pc2 in 2D plot since it contains the least variance
- Disadvantage: can lose much information without care; takes too long for large dataset
- Second Principal Component: orthogonal to the first, capture as much of the rest variability as possible
- Find a new set of features that better captures the variability of the data
- First dimension chosen to capture as much of the variability as possible
- Second dimension is orthogonal to the first & captures as much as the remaining variability as possible
- Third dimension is orthogonal to the first & second, captures as much of the remaining variability as possible
- Goal: find a projection that captures the largest amount of variation in data
- Principal Component Analysis (PCA) is a well-established mathematical technique for reducing the dimensionality of data, while keeping as much variation as possible.
-understand the intuition of how PCA works. It is not necessary to understand the mathematical formulas used for PCA.
- Select less features but won't lost much information to describe objects
-Understand intuitively what is meant by ”First Principal Component” and ”Second Principal Component”
- Find a new set of features that better captures the variability of the data
- First dimension chosen to capture as much of the variability as possible.
- The second dimension is orthogonal to the first, and subject to that constraint, captures as much of the remaining variability as possible,
- The third dimension is orthogonal to the first and second, and subject to that constraint, captures as much of the remaining variability as possible.
-understand how to interpret a 2-D visualization of the first two principal components of a dataset
- Step1: Standardization
- Step2: Covariance Matrix computation
- Step3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components