我正在学习如何建立一个简单的线性模型,以找到一个统一的价格基于其平方米和房间的数量。我有一个具有多个特征的.csv数据集,'Price‘当然是其中之一,但它包含几个可疑的值,如'1’或'4000‘。我想根据平均值和标准差删除这些值,因此我使用以下函数删除异常值:
import numpy as np
import pandas as pd
def reject_outliers(data):
u = np.mean(data)
s = np.std(data)
data_filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return data_filtered
然后我构造函数来建立线性回归:
def linear_regression(data):
data_filtered = reject_outliers(data['Price'])
print(len(data)) # based on the lenght I see that several outliers have been removed
下一步是定义数据/预测器。我设置了我的功能:
features = data[['SqrMeters', 'Rooms']]
target = data_filtered
X = features
Y = target
这是我的问题。如何才能获得X和Y的相同观测值?现在我的样本数量不一致(去除异常值后,X为5000,Y为4995 )。感谢您在此主题中提供的帮助。
发布于 2018-01-05 13:01:08
要素和标注的长度应相同
并且您应该将整个数据对象传递给reject_outliers:
def reject_outliers(data):
u = np.mean(data["Price"])
s = np.std(data["Price"])
data_filtered = data[(data["Price"]>(u-2*s)) & (data["Price"]<(u+2*s))]
return data_filtered
您可以这样使用它:
data_filtered=reject_outliers(data)
features = data_filtered[['SqrMeters', 'Rooms']]
target = data_filtered['Price']
X=features
y=target
发布于 2018-01-05 13:59:39
以下工作适用于Pandas DataFrames (数据):
def reject_outliers(data):
u = np.mean(data.Price)
s = np.std(data.Price)
data_filtered = data[(data.Price > u-2*s) & (data.Price < u+2*s)]
return data_filtered
https://stackoverflow.com/questions/48114054
复制相似问题