学习栏比较。如何基于两列创建新列?
我可以做两个条件水果或蔬菜。但对于第三个条件,却做不到。:(
df
basket1 basket2
0 fruit fruit
1 vegetable vegetable
2 vegetable both
3 fruit both
结果
纽德夫
basket1 basket2 total
0 fruit fruit fruit
1 vegetable vegetable vegetable
2 vegetable both Unknown
3 fruit both fruit
非常感谢你的帮助!
发布于 2018-11-20 21:17:20
更新
重新审视这一点,DataFrame.apply
是缓慢的AF。让我们看看其他一些选项,然后进行比较。
DataFrame.apply
的其他选项
numpy.where
当我们只有两个选项时,可以应用此方法。在您的例子中,这是正确的,因为我们返回df.a
时,df.a == df.b
或df.a == 'fruit' and df.b == 'both'
。语法是np.where(condition, value_if_true, value_if_false)
。
In [42]: df['np_where'] = np.where(
...: ((df.a == df.b) | ((df.a == 'fruit') & (df.b == 'both'))),
...: df.a,
...: 'Unknown'
...: )
numpy.select
如果有多个条件,您将使用此选项。它的语法是np.select(condition, values, default)
,其中default
是一个可选参数。
In [43]: conditions = df.a == df.b, (df.a == 'fruit') & (df.b == 'both')
In [44]: choices = df['a'], df['a']
In [45]: df['np_select'] = np.select(conditions, choices, default='Unknown')
请注意,为了演示的目的,我创建了两个条件,即使结果产生相同的结果。
比较备选方案
正如您所看到的,这三种方法都有相同的结果。
In [47]: df
Out[47]:
a b np_where np_select df_apply
0 fruit fruit fruit fruit fruit
1 vegetable vegetable vegetable vegetable vegetable
2 vegetable both Unknown Unknown Unknown
3 fruit both fruit fruit fruit
但它们在速度上如何比较呢?为了检查这一点,让我们创建一个更新的、更大的DataFrame
。我们这样做是为了看看我们的选项是如何处理大量数据的。
In [48]: df_large = pd.DataFrame({
...: 'a': np.random.choice(['fruit', 'vegetable'], size=1_000_000),
...: 'b': np.random.choice(['fruit', 'vegetable', 'both'], size=1_000_000)
...: })
In [49]: %timeit df_large['np_where'] = np.where(((df_large.a == df_large.b) | ((df_large.a == 'fruit')
...: & (df_large.b == 'both'))), df_large.a, 'Unknown')
379 ms ± 64.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [50]: %timeit df_large['np_select'] = np.select(((df_large.a == df_large.b), ((df_large.a == 'fruit'
...: ) & (df_large.b == 'both'))), (df_large.a, df_large.a), default='Unknown')
580 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [51]: %timeit df_large['df_apply'] = df_large.apply(total, axis=1)
40.5 s ± 6 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
哇!如您所见,DataFrame.apply
比我们的其他两个选项慢得多,np.where
使np.select
处于边缘。
结论
np.where
np.select
DataFrame.apply
(特别是对于大型数据集)!资源
创建自己的函数并使用DataFrame.apply
In [104]: def total(r):
...: if r.a == r.b:
...: return r.a
...: elif (r.a == 'fruit') and (r.b == 'both'):
...: return r.a
...: return 'Unknown'
...:
In [105]: df = pd.DataFrame({'a': ['fruit', 'vegetable', 'vegetable', 'fruit'], 'b': ['fruit', 'vegetable', 'both', 'both']})
In [106]: df
Out[106]:
a b
0 fruit fruit
1 vegetable vegetable
2 vegetable both
3 fruit both
In [107]: df['total'] = df.apply(total, axis=1)
In [108]: df
Out[108]:
a b total
0 fruit fruit fruit
1 vegetable vegetable vegetable
2 vegetable both Unknown
3 fruit both fruit
发布于 2018-11-20 22:22:17
df["total"] = df.apply(lambda x: x.a if (x.a == x.b) or ((x.a == 'fruit') and (x.b == 'both')) else 'Unkonwn', axis = 1)
输出
a b total
0 fruit fruit fruit
1 vegetable vegetable vegetable
2 vegetable both Unkonwn
3 fruit both fruit
发布于 2018-11-20 21:38:01
以下是使用np.select
的解决方案
df['total'] = np.select([df['a']==df['b'], (df['a']=='fruit')&(df['b']=='both')], [df['a'], 'fruit'], 'Unkown')
输出:
a b total
0 fruit fruit fruit
1 vegetable vegetable vegetable
2 vegetable both Unknown
3 fruit both fruit
https://stackoverflow.com/questions/53405453
复制相似问题