Lectures 12 and 13: Classification and regression techniques: decision tree and k-nearest neighbor
-understand what is meant by the task of classification and be able to identify scenarios where it is useful. Understand how it relates to data wrangling.
- Classification
- Definition: given a collection of records(training set), each record contains a set of attributes, one of the attributes is the class. Find a predictive model for class attributes as a function of the values of other attributes. then what is the difference between this and prediction? Is this only implement on known data? same thing! classification is to predict a category.
- Useful scenarios:
- animal classification
- bank classifying borrower
- detecting tax fraud/tax cheats
- Given a collection of records (training set)
- Each record contains a set of attributes + one class label
- Find a predictive model for class label as a function of the values of other attributes
- Goal: previously unseen records should be assignment a class as accurately as possible
- Test set = used to determine the accuracy of the model
- The given data set is divided into training & test sets
- Training set used to build the mode
- Test set used to validate the model
-understand what is meant by the task of regression and be able to identify scenarios where it is useful. Understand how it relates to data wrangling.
- Regression
- given a collection of records each record contains a set of attributes, one of the attributes is the target variable. Regression is use other attributes to fit the target variable.
- useful scenarios:
- predict a number/discrete data.
- predicting ice-creams consumption from temperature
- predict activity level of a target gene
- Similar to classification, but for continuous variables
- Given a collection of records (training set)
- Each record contain a set of attributes, one of target variable
- Learn predictive model from data
-understand the use of accuracy as a metric for measuring the performance of a classification method. Understand how TP,TN,FP and FN are used in the accuracy calculation. The formula for accuracy will be provided on the exam
- Appreciate the benefits and limitations of accuracy as a performance metric for classification
- suppose there are 2 classes 0 and 1, with 9990 and 10 instances each
- can be misleading if all test cases are predicted as class 0
- Metrics for performance evaluation
- TP: should say yes, and we said yes.
- FN: should say yes, but we said no.
- FP: should say no, but we said yes.
- TN: should say no, and we said no.
- First letter represent the result we get is right(T) or wrong(F), the second letter represent how right or wrong we are.
- Accuracy formula is on the formula sheet
- The accuracy may misleading at here, it may not detect any other value in a dataset but still get very high accuracy. Therefore, we need other concept here, for example, recall rate might be useful.
- Divide training data into
- Learn decision tree using the training set
- Evaluate performance of decision tree on the test set
- Can be summarized in a Confusion Matrix (contingency table)
- Actual class
- Predicted class
Example:
- For an accurate decision tree classifier -> minimise:
- False positives
- False negatives
- Limitations of accuracy
- E.g. a 2-class problem:
- Number of class 0 = 9990
- Number of class 1 = 10
- If model predicts everything to be class 0
- Accuracy = 9990/10000 = 99.9%
- Accuracy is misleading as model does not detect any class 1 example
-understand the operation and rationale of the k nearest neighbor algorithm for classification
- Requires
- Set of stored records
- Distance Metric to compute distance between records
- Value of k -> number of nearest neighbours to retrieve
- To classify an unknown record:
- Compute distance to other training records
- Euclidean distance
- Pearson coefficient
- Identify k nearest neighbours
- data points that have the k smallest distance
- Use class labels of nearest neighbours to determine the class label of unknown record
- Take the majority vote of class labels among k-nearst neighbours
- OR weight the vote according to distance
- Choosing the value of k
- Too small -> sensitive to noise points
- Too large -> neighbourhood may include points from other classes
- K-nearest neighbors(K-NN / KNN)
- k-nearest neighbors of a record x are data points that have the k smallest distance to x
- process
- compute the distance to other training records
- identify k nearest neighbors
- use class labels of nearest neighbors to determine the class label of unknown record
- Advantage
- easier to calculate than decision tree
- can handle datasets with complex structure
- Disadvantages
- hard to determine the size of k
- if k is too small, sensitive to noise points
- if k is too large, neighborhood may include points from other classes
- Classification may be slow for large datasets
- Need to store training data in order to make the prediction
-understand the operation and rationale of the decision tree algorithm for classification
- A flow-chart-like tree structure
- Internal node = denotes a test on an attribute
- Branch = an outcome of the test
- Leaf nodes = class labels / class distribution
- Advantage:
- Easy to interpret
- Relatively efficient to construct
- Fast for making a decision about a test instance
- Disadvantage:
- Sometimes too simple for data with complex structure
- For complex datasets, the tree might grow very big & not easy to understand
- May behave strangely for some types of features (e.g. student ID)
- Decision Tree
- How to specify test condition
- Depends on attribute types
- nominal (e.g. family, sports, luxury for car)
- ordinal (e.g.small, large, medium for size)
- continuous (e.g. ranged data)
- Depends on number of ways to split
- 2-way split
- Multi-way split
- Splitting Based on nominal attributes
- Multi-way split: use many partitions as distinct values
- Binary split: divides values into two subsets. Need to find optimal partitioning.
- how good is a splits
- the impurity of parent node(before splitting) - the sum of the impurity of its children nodes, notice that each child node need to scale by a parameter, which is (number of data points in this child node) / (number of data points in parent node).
- the results called gain, the large the gain the better split this is.
- Create the decision tree
- Step 1: Calculate information gain [Left Child], [Right Child] for each feature
- Step 2: Choose feature + split with the highest information gain & use this as the root node and its split
- Step 3: Do recursively, terminating when a node consists of only Cheat = No / Cheat = Yes
-Understand how a decision tree may be used to make predictions about the class of a test instance
- Hunt’s algorithm
- Dt = set of training records that reach a node t
- If Dt contains records that belong the same class yt, then t is a leaf node labelled as yt
- If Dt is an empty set, then t is a leaf node labelled by the default class yd
- If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
-Understand the key steps in building a decision tree. How to split the instances, how to specify the attribute test condition, how to determine the best split and how to decide when to stop splitting
- Specify test condition
- Depends on attribute types
- Depends on number of ways to split
- 2-way split
- Multi-way split
| | | |
---|
| Divides values
into two subsets
Need to find
optimal partitioning | Divides values
into two subsets
Need to find
optimal partitioning | Binary decision:
(A <v) or (A>= v)
Consider all
possible splits & finds the best cut
Can be more
compute intensive |
| Use as many
partitions as distinct values | Use as many
partitions as distinct values | Discretization to form an ordinal categorical attribute
Static:
discretise once at the beginning
Dynamic:
ranges can be found by equal interval / equal frequency / clustering |
Binary decision:
(A <v) or (A>= v)
- Consider all possible splits & finds the best cut
- Can be more compute intensive
Multi-way split
- Use as many partitions as distinct values
- Use as many partitions as distinct values
Discretization to form an ordinal categorical attribute
- Static: discretise once at the beginning
- Dynamic: ranges can be found by equal interval / equal frequency / clustering
- Determine best split
- Nodes with homogeneous class distribution are preferred
- Entropy: measure of node impurity
- Low entropy = low uncertainty & high purity
- High entropy = high uncertainty & low purity
- Measures homogeneity of a node
- Maximum -> records are equally distributed among all classes
- Minimum -> all records belong to one class
- How good is a split?
- Compare the entropy of parent node (before splitting)
- Compare the entropy of children nodes (after splitting)
- Information gain =
- H(vj) = impurity measure of node vj
- j = children node index
- N(vj) = number of data points in child node vj
- N = number of data points in parent node
- The information gain = mutual information between the class feature & feature being split on
- Splitting using the information gain = to choose the feature with highest information shared with the class variable
- Compute the gain of all splits & choose the largest gain
Example:
- When to stop
- If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt
- If Dt is an empty set, then t is a leaf node labeled by the default class, yd
-understand the advantages and disadvantages of using k nearest neighbor or decision tree for classification and the reasons for these advantages and disadvantages
- KNN-Classification
- Advantage:
- The cost of the learning process is zero
- No assumptions about the characteristics of the concepts to learn have to be done
- Complex concepts can be learned by local approximation using simple procedures
- Disadvantages:
- Outlier sensitivity: K-NN algorithm is very sensitive to outliers as it simply chose the neighbors based on distance criteria.
- Missing Value treatment: K-NN inherently has no capability of dealing with missing value problem.
- KNN slow algorithm: K-NN might be very easy to implement but as dataset grows efficiency or speed of algorithm declines very fast.
- Curse of Dimensionality: KNN works well with small number of input variables but as the numbers of variables grow K-NN algorithm struggles to predict the output of new data point.
- K-NN needs homogeneous features: If you decide to build k-NN using a common distance, like Euclidean or Manhattan distances, it is completely necessary that features have the same scale, since absolute differences in features weight the same, i.e., a given distance in feature 1 must means the same for feature 2.
- Optimal number of neighbors: One of the biggest issues with K-NN is to choose the optimal number of neighbors to be consider while classifying the new data entry.
- Imbalanced data causes problems: k-NN doesn’t perform well on imbalanced data. If we consider two classes, A and B, and the majority of the training data is labeled as A, then the model will ultimately give a lot of preference to A. This might result in getting the less common class B wrongly classified.
- Decision Tree Classification
- Advantage:
- Decision Trees are easy to explain. It results in a set of rules.
- It follows the same approach as humans generally follow while making decisions.
- Interpretation of a complex Decision Tree model can be simplified by its visualizations. Even a naive person can understand logic.
- The Number of hyper-parameters to be tuned is almost null.
- Disadvantages:
- There is a high probability of overfitting in Decision Tree.
- Generally, it gives low prediction accuracy for a dataset as compared to other machine learning algorithms.
- Information gain in a decision tree with categorical variables gives a biased response for attributes with greater no. of categories.
- Calculations can become complex when there are many class labels.
-understand the use of entropy as a node impurity measure for decision tree node splitting. Understand the benefits of entropy for this task and why it is effective for assessing the goodness of a split
- Entropy
- We have seen entropy in the feature correlation section, where it was used to measure the amount of uncertainty in an outcome
- Entropy can also be viewed as an impurity measure
- The set {A,B,C,A,A,A,A,A} has low entropy: low uncertainty and high purity
- The set {A,B,C,D,B,E,A,F} has high entropy: high uncertainty and low purity
-appreciate why it is necessary to separate the dataset into training and testing in order to fairly evaluate the performance of a classifier
- Why we need to split the dataset into training set and test set
- Learn decision tree using training set and evaluate performance on the test set. This gives unbiased evaluation.
-appreciate that when distribution of testing data is quite different from distribition of training data, accuracy of classifier may be degraded
-appreciate the benefits and limitations of accuracy as a performance metric for classification
- Limitations of accuracy
- E.g. a 2-class problem:
- Number of class 0 = 9990
- Number of class 1 = 10
- If model predicts everything to be class 0
- Accuracy = 9990/10000 = 99.9%
- Accuracy is misleading as model does not detect any class 1 example