前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Assessing correlations

Assessing correlations

原创
作者头像
403 Forbidden
发布2021-05-19 13:35:44
2490
发布2021-05-19 13:35:44
举报
文章被收录于专栏:hsdoifh biuwedsyhsdoifh biuwedsy

Lecture 9: Assessing correlations

-be able to explain why identifying correlations is useful for data wrangling/analysis

  • Discover relation
  • One step towards discovering causality

-understand what is correlation between a pair of features

  • Correlation is used to detect pairs of variables that might have some relationship
  • Correlation does not necessarily imply causality
  • Purpose:
    • discover relationships
    • feature ranking : select the best features for building better predictive models

-understand how correlation can be identified using visualisation

  • Scatterplot

-understand the concept of a linear relation, versus a non linear relation for a pair of features

-understand why the concept of correlation is important, where it is used and understand why correlation is not the same as causation

  • Discover relationships
  • One step towards discovering causality

A causes B

  • Examples:
    • Gene A causes lung cancer
  • Feature ranking: select the best features for building better predictive models
    • A good feature to use, is a feature that has high correlation with the outcome one is trying to predict
  • Can hint at potential causal relationships
  • Does not imply causality!

-understand the use of Euclidean distance for computing correlation between two features and its advantages/ disadvantages

  • Problem of Euclidean Distance
    • Objects can be represented with different measure scales
    • Does not give a clear intuition about how well variables are correlated
    • Cannot discover variables with similar behaviour / dynamics but at different scale
    • Cannot discover variables with similar behaviour / dynamics but in the opposite direction (negative correlation)
  • Advantages: easy to implement
  • Disadvantages: objects can be represented with different measuer scales, and Euclidean distance does not give a clear intuition about how well variable are correlated.

-understand the use of Pearson correlation coefficient for computing correlation between two features and its advantages/ disadvantages

  • Advantage: can easily find how data are related from the result
  • Disadvantages: can only find linear relation

-understand the meaning of the variables in the Pearson correlation coefficient formula and how they can be calculated. Be able to compute this coefficient on a simple pair of features. The formula for this coefficient will be provided on the exam.

  • rxy in range [-1, 1]
    • 1 = perfect positive linear correlation
    • -1 = perfect negative linear correlation
    • 0 = no correlation
    • l r l = strength of linear correlation

Example

  • Degree
    • 0.5 = large
    • 0.3 ~ 0.5 = moderate
    • 0.1 ~ 0.3 = small
    • < 0.1 = trivial
  • Scale invariant: r(x,y) = r(x, Ky)
  • Location invariant: r(x,y) = r(x, K+y)
  • Can only detect linear relationship: y = a × x + b + noise

-be able to interpret the meaning of a computed Pearson correlation coefficient

We will define a correlation measure rxy, assessing samples

from two features x and y

– Assess how close their scatter plot is to a straight line (a

linear relationship)

-understand the advantages and disadvantages of using the Pearson correlation coefficient for assessing the degree of relationship between two features

SAME AS PEARSON COORELATION

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档