面对缺失值三种处理方法:
对于dropna和fillna,dataframe和series都有,在这主要讲datafame的
使用DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
参数说明:
建议在使用时将全部的缺省参数都写上,便于快速理解 examples:
df = pd.DataFrame( {
"name": ['Alfred', 'Batman', 'Catwoman'], "toy": [np.nan, 'Batmobile', 'Bullwhip'], "born": [pd.NaT, pd.Timestamp("1940-04-25") pd.NaT]}) >>> df name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT # Drop the rows where at least one element is missing. >>> df.dropna() name toy born 1 Batman Batmobile 1940-04-25 # Drop the columns where at least one element is missing. >>> df.dropna(axis='columns') name 0 Alfred 1 Batman 2 Catwoman # Drop the rows where all elements are missing. >>> df.dropna(how='all') name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT # Keep only the rows with at least 2 non-NA values. >>> df.dropna(thresh=2) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT # Define in which columns to look for missing values. >>> df.dropna(subset=['name', 'born']) name toy born 1 Batman Batmobile 1940-04-25 # Keep the DataFrame with valid entries in the same variable. >>> df.dropna(inplace=True) >>> df name toy born 1 Batman Batmobile 1940-04-25
可以使用dropna 或者drop函数
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D']) >>>df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 # 删除列 >>> df.drop(['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11 >>> df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11 # 删除行(索引) >>> df.drop([0, 1]) A B C D 2 8 9 10 11
使用DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
f = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1], [np.nan, np.nan, np.nan, 5], [np.nan, 3, np.nan, 4]], columns=list('ABCD')) >>> df A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4 # 使用0代替所有的缺失值 >>> df.fillna(0) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4 # 使用后边或前边的值填充缺失值 >>> df.fillna(method='ffill') A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4 >>>df.fillna(method='bfill') A B C D 0 3.0 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN 3.0 NaN 5 3 NaN 3.0 NaN 4 # Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively. # 每一列使用不同的缺失值 >>> values = {
'A': 0, 'B': 1, 'C': 2, 'D': 3} >>> df.fillna(value=values) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4 #只替换第一个缺失值 >>>df.fillna(value=values, limit=1) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4
房价分析: 在此问题中,只有bedroom一列有缺失值,按照此三种方法处理代码为:
# option 1 将含有缺失值的行去掉 housing.dropna(subset=["total_bedrooms"]) # option 2 将"total_bedrooms"这一列从数据中去掉 housing.drop("total_bedrooms", axis=1) # option 3 使用"total_bedrooms"的中值填充缺失值 median = housing["total_bedrooms"].median() housing["total_bedrooms"].fillna(median)
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。