发布2021-05-19 13:32:22
Lectures 4 and 5: Data cleaning: missing values and outliers detection

-be able to explain the need for and the motivation behind data preprocessing and data cleaning

  • Measuring data quality
    • Accuracy
      • Correct or wrong, accurate or not
    • Completeness
      • Not recorded, unavailable
    • Consistency
      • E.g. discrepancies in representation
    • Timeliness
      • Updated in a timely way
    • Believability
      • Do I trust the data is correct?
    • Interpretability
      • How easily can I understand the data?

-be able to explain the need for and what is involved in each of the major data pre-processing activities (data cleaning, integration, reduction and transformation)

  • Data Cleaning
    • Many tools exist
      • Data scrubbing
      • Data discrepancy detection
      • Data auditing
      • ETL (Extract Transform Load) tools: users specify transformations via a graphical interface
    • Noisy data
      • Truncated fields (exceeded 80 character limit)
      • Text incorrectly split across cells (e.g. separator issues)
      • Salary=“-5”
      • Some causes:
        • Imprecise instruments
        • Data entry issues
        • Data transmission issues
    • Inconsistent data
      • Different naming representations (“Melbourne University” versus “University of Melbourne”) or (“three” versus “3”)
      • Different date formats (“3/4/2016” versus “3rd April 2016”)
      • Age=20, Birthdate=“1/1/2002”
      • Two students with the same student id
      • Outliers
        • E.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999
          • No good if it is list of ages of hospital patients
          • Might be ok though for a listing of people number of contacts on Linkedin though
        • Can use automated techniques, but also need domain knowledge
    • Intentionally disguised data
      • Everyone’s birthday is January 1st?
      • Email address is xx@xx.com
      • Adriaans and Zantige
        • “Recently, a colleague rented a car in the USA. Since he was Dutch, his post-code did not fit the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.”
      • How to handle
        • Look for “unusual” or suspicious values in the dataset, using knowledge about the domain
    • Incomplete (missing data)
      • Lacking feature values
        • Name=“”
        • Age=null
      • Some types of missing data (Rubin 1976)
        • Missing completely at random: Data are missing independently of observed and unobserved data.
          • E.g/ Coin flipping to decide whether or not to answer an exam question.
        • Missing not at random
          • I create a dataset by surveying the class about how healthy they feel. What is the meaning of missing values for those who don’t respond?
  • Data Integration
    • Bringing data from multiple sources together
      • Resolve conflicts
      • Detect duplicates
  • Data Reduction
    • Decrease number of feature / instance in a dataset
      • Modifying sampling strategies
      • Removing irrelevant features & reducing noise
    • Makes data easier to visualise & faster to analyse
  • Data Transformation
    • Standardisation

-understand the terminologies: features, attributes, instances, objects.

  • Columns = features / attributes
  • Data rows = instances / objects

-understand the difference between categorical/discrete features versus continuous features

-be able to explain the reasons why data might be missing, what are the possible causes?

  • Causes:
    • Malfunction of equipment (e.g. sensors)
    • Not recorded due to misunderstanding
    • May not be considered important at time of entry
    • Deliberate

-understand the difference between data missing completely at random versus data missing not completely at random

  • If the missing values on a variable are related to the values of that variable itself

-understand what is meant by noise in data or ”noisy” data

  • Any data that has been received, stored, or changed in such a manner that it cannot be read or used by the program that originally created it can be described as noisy

-understand the following strategies for handling missing data and their relative advantages/disadvantages (delete all instances with a missing value, manual correction, imputation)

  • Delete all instances with a missing value (case deletion)
    • Easy to analyse the new complete data
    • May produce bias on analysis if new sample size small or structure exists in the missing value
  • Manually correct
    • Human finds the missing value and fills it in using expert knowledge
  • Imputation
    • Replace the missing value with a substitute one
    • After imputing all missing values, can see standard analysis techniques for complete datasets
    • Fill in with 0
    • Fill in with mean
    • Fill in with median value (if skewed distribution)
    • Fill in mode value (categorical)
      • Use mode (most frequent value) imputation for categorical features
    • Fill in Category mean

-understand the following strategies for imputation of missing values and their relative advantages/disadvantages (fill in with zeros, fill in with mean/median value, fill in with category mean)

  • Fill in with 0
    • Simple
    • Won’t break application programs
    • Limited utility for analysis
  • Fill in with mean
    • Popular method
      • Can be good for supervised classification
      • Apply separately to each attribute
    • Drawbacks
      • Reduces the variance of the feature
      • Incorrect view of the distribution of that attribute
      • Relationships to other features changes
  • Fill in with median value (if skewed distribution)
  • Fill in Category mean

-be able to explain the importance of finding outliers and give concrete examples where this would be useful

  • The outlier objects deviate from this generating process
  • Outliers can be different from the noise data
    • Noise is random error or variance in a measured variable
    • Noise should be removed before outlier detection
  • Outliers are interesting: Violation of the mechanism that generates the normal data

-be able to explain what is an outlier

  • Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism
  • From a statistics perspective:
    • Normal (non-outlier) objects are generated using some statistical process
  • How to detect Outliers
    • 1-D data
      • Boxplot
      • Histogram
      • Statistical tests
    • 2-D Data: Scatter plot and eyeball
    • 3-D data: Can also use scatter plot and eyeball
    • >3-D data: Statistical or algorithmic methods

-be able to explain the difference between a global outlier and a contextual outlier

  • Global outlier (or point anomaly)
    • Object is Og if it significantly deviates from the rest of the data set
    • Ex. Intrusion detection in computer networks
    • Issue: Find an appropriate measurement of deviation
  • Contextual outlier (or conditional outlier)
    • Object is Oc if it deviates significantly based on a selected context
    • Is 5o in Melbourne an outlier? (depending on summer or winter?)
      • Attributes of data should be divided into two groups
      • Contextual attributes: defines the context, e.g., time & location
      • Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature
    • Issue: How to define or formulate meaningful context?

-be able to explain how a histogram can be used to detect outliers, their relative advantages/disadvantages for this task and be able to construct and interpret a histogram for data understanding and outlier detection

  • Outlier detection using histogram:
    • Figure shows the histogram of purchase amounts in transactions
    • A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000
  • Problem: Hard to choose an appropriate bin size for histogram
    • Too small bin size → normal objects in empty/rare bins, false positive
    • Too big bin size → outliers in some frequent bins, false negative

-be able to draw and read a 2-D scatter plot and visually identify outliers from it

-be able to construct and interpret a Tukey Boxplot and explain why it is a useful tool for data understanding and outlier detection

-it is not necessary to know Grubb’s test for outlier detection


