我正在尝试使用Pythons pandas数据帧从一个数据帧中选择条目,条件是另一个数据帧:
第一个数据帧给出了每个人的优先日期:
import pandas as pd
df_priority = pd.DataFrame({'Person': ['Alfred', 'Bob', 'Charles'], 'Start Date': ['2018-01-01', '2018-03-01', '2018-05-01'] , 'End Date': ['2018-02-01', '2018-04-01', '2018-06-01']})
df_priority.head()
Start Date End Date Person
0 2018-01-01 2018-02-01 Alfred
1 2018-03-01 2018-04-01 Bob
2 2018-05-01 2018-06-01 Charles
第二个dataframe给出了每个人和每个月的销售额:
df_sales = pd.DataFrame({'Person': ['Alfred', 'Alfred', 'Alfred','Bob','Bob','Bob','Bob','Bob','Bob','Charles','Charles','Charles','Charles','Charles','Charles'],'Date': ['2018-01-01', '2018-02-01', '2018-03-01', '2018-01-01', '2018-02-01', '2018-03-01','2018-04-01', '2018-05-01', '2018-06-01', '2018-01-01', '2018-02-01', '2018-03-01','2018-04-01', '2018-05-01', '2018-06-01'], 'Sales': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]})
df_sales.head(15)
Date Person Sales
0 2018-01-01 Alfred 1
1 2018-02-01 Alfred 2
2 2018-03-01 Alfred 3
3 2018-01-01 Bob 4
4 2018-02-01 Bob 5
5 2018-03-01 Bob 6
6 2018-04-01 Bob 7
7 2018-05-01 Bob 8
8 2018-06-01 Bob 9
9 2018-01-01 Charles 10
10 2018-02-01 Charles 11
11 2018-03-01 Charles 12
12 2018-04-01 Charles 13
13 2018-05-01 Charles 14
14 2018-06-01 Charles 15
现在,我想要每个人在其优先级日期范围内的销售数字,即结果应该是:
Date Person Sales
0 2018-01-01 Alfred 1
1 2018-02-01 Alfred 2
5 2018-03-01 Bob 6
6 2018-04-01 Bob 7
13 2018-05-01 Charles 14
14 2018-06-01 Charles 15
有什么帮助吗?
发布于 2020-09-04 20:53:24
您可以在多个列上应用lambda以实现所需的结果:
# custom function that gives the prioritized date range for each person by person name
def salesByNameAndDate(name):
start_date = df_priority[df_priority['Person'] == name]['Start Date'].values[0]
end_date = df_priority[df_priority['Person'] == name]['End Date'].values[0]
date_range = pd.date_range(start=start_date, end=end_date)
return date_range
# return sales value if the date is inside the date range for this person or "nothing" if the date is outside this range
df_sales['new_sales'] = df_sales.apply(lambda x: x['Sales'] if x['Date'] in salesByNameAndDate(x['Person']) else 'nothing',axis=1)
# after that you drop all "nothing" and duplicate column "new_sales"
new_df = df_sales[df_sales['new_sales'] != 'nothing'].drop('new_sales', axis=1)[['Date', 'Person', 'Sales']]
# output
Date Person Sales
0 2018-01-01 Alfred 1
1 2018-02-01 Alfred 2
5 2018-03-01 Bob 6
6 2018-04-01 Bob 7
13 2018-05-01 Charles 14
14 2018-06-01 Charles 15
发布于 2020-09-04 21:09:00
我要做的是使用某个唯一标识符连接数据帧(我希望名称就是那个唯一标识符),使用end_date过滤新的数据帧,然后分组:
df=df_sales.join(df_priority, on='Person', how='left')
df[(df.['End Date']>df.Date) & (df['Start Date']<df.Date)].groupby('Person','Date').sum()
Left join是join方法中的默认选项,但只需清楚地说明left join更有意义。您必须确保这两列都是date time数据类型。您可以使用:pandas.to_datetime()
来确保这一点。
https://stackoverflow.com/questions/63740667
复制相似问题