我有一个df,我想筛选出一个基于分组的列。我想保持按组合分组(cc
、odd
、tree1
和tree2
),如果天>4时,则保留它,否则丢弃它。
df = pd.DataFrame()
df['cc'] = ['BB', 'BB', 'BB', 'BB','BB', 'BB','BB', 'BB', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ']
df['odd'] = [3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435]
df['tree1'] = ['ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP']
df['tree2'] = ['ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK']
df['day'] = [1, 2, 3, 4, 3, 4, 5, 6, 2, 3, 4, 5, 1, 3, 5, 7, 1, 2, 6, 8, 2, 4, 6, 8]
df
我尝试过这样做,但是这会删除任何日值小于4的行。
df_grouped = df.groupby(['cc', 'odd', 'tree1', 'tree2']).filter(df['day'] > 4)
我得到了这个错误TypeError: 'Series' object is not callable
试过这个
df_grouped = df.groupby(['cc', 'odd', 'tree1', 'tree2']).filter(lambda x: x['day'] > 4)
我得到了这个错误TypeError: filter function returned a Series, but expected a scalar bool
。
我搜索并试图解决这些错误,但建议的解决方案对我无效。我想得到如下所示的df:
df1 = pd.DataFrame()
df1['cc'] = ['BB', 'BB','BB', 'BB', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'DD', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ', 'ZZ']
df1['odd'] = [3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435, 3434, 3434, 3434, 3434, 3435, 3435, 3435, 3435]
df1['tree1'] = ['SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP', 'ASP', 'ASP', 'ASP', 'ASP', 'SAP', 'SAP', 'SAP', 'SAP']
df1['tree2'] = ['ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK', 'ATK', 'ATK','ATK','ATK','ATK','ATK','ATK','ATK']
df1['day'] = [3, 4, 5, 6, 2, 3, 4, 5, 1, 3, 5, 7, 1, 2, 6, 8, 2, 4, 6, 8]
df1
我尝试过使用any
的逻辑函数,但我无法使它工作,它只返回True
或False
给我,而不是过滤的数据文件。
发布于 2018-06-05 14:53:39
现在我已经理解了您想要的东西,让我们尝试一下transform
+ any
之类的东西
df[df.assign(key=df.day > 4)
.groupby(['cc', 'odd', 'tree1', 'tree2']).key.transform('any')
]
或,
df[df.day.gt(4).groupby([df.cc, df.odd, df.tree1, df.tree2]).transform('any')]
cc odd tree1 tree2 day
4 BB 3435 SAP ATK 3
5 BB 3435 SAP ATK 4
6 BB 3435 SAP ATK 5
7 BB 3435 SAP ATK 6
8 DD 3434 ASP ATK 2
9 DD 3434 ASP ATK 3
10 DD 3434 ASP ATK 4
11 DD 3434 ASP ATK 5
12 DD 3435 SAP ATK 1
13 DD 3435 SAP ATK 3
14 DD 3435 SAP ATK 5
15 DD 3435 SAP ATK 7
16 ZZ 3434 ASP ATK 1
17 ZZ 3434 ASP ATK 2
18 ZZ 3434 ASP ATK 6
19 ZZ 3434 ASP ATK 8
20 ZZ 3435 SAP ATK 2
21 ZZ 3435 SAP ATK 4
22 ZZ 3435 SAP ATK 6
23 ZZ 3435 SAP ATK 8
发布于 2018-06-05 14:53:38
你想要:
In[116]:
df_grouped = df.groupby(['cc', 'odd', 'tree1', 'tree2']).filter(lambda x: (x['day'] > 4).any())
df_grouped
Out[116]:
cc odd tree1 tree2 day
4 BB 3435 SAP ATK 3
5 BB 3435 SAP ATK 4
6 BB 3435 SAP ATK 5
7 BB 3435 SAP ATK 6
8 DD 3434 ASP ATK 2
9 DD 3434 ASP ATK 3
10 DD 3434 ASP ATK 4
11 DD 3434 ASP ATK 5
12 DD 3435 SAP ATK 1
13 DD 3435 SAP ATK 3
14 DD 3435 SAP ATK 5
15 DD 3435 SAP ATK 7
16 ZZ 3434 ASP ATK 1
17 ZZ 3434 ASP ATK 2
18 ZZ 3434 ASP ATK 6
19 ZZ 3434 ASP ATK 8
20 ZZ 3435 SAP ATK 2
21 ZZ 3435 SAP ATK 4
22 ZZ 3435 SAP ATK 6
23 ZZ 3435 SAP ATK 8
因此,这将筛选出组内没有一个'day'
值大于4的组。
时间
%timeit df[df.day.gt(4).groupby([df.cc, df.odd, df.tree1, df.tree2]).transform('any')]
%timeit df.groupby(['cc', 'odd', 'tree1', 'tree2']).filter(lambda x: (x['day'] > 4).any())
%timeit df[df.assign(key=df.day > 4).groupby(['cc', 'odd', 'tree1', 'tree2']).key.transform('any')]
100 loops, best of 3: 5.9 ms per loop
100 loops, best of 3: 5.42 ms per loop
100 loops, best of 3: 3.62 ms per loop
所以@coldspeed的第一个方法是这里最快的
https://stackoverflow.com/questions/50702709
复制相似问题