我正在寻找一种高效的方法来迭代数据帧并执行代码,对于每一行,都会根据过去或将来行中的值做一些事情。
我通过一个可能超过200'000行的datetime索引数据帧逐行迭代(使用for循环)。根据两列之一(Bi和Icat)中的值,我在第三列(To_set)中设置了一个值。在每一行中执行的代码包括一个条件,该条件使用当前索引和时间增量在前一行(列Bi中)中查找值。
目前,循环遍历数据帧需要很长时间,我想知道是否有更快或更优雅的方法可用。
代码循环通过的数据帧有三列(Bi、Icat、to_set) -下面是df的一个片段。
注意:我的代码已经遍历了df并设置了列'To_ set‘中的值。因为我最初用None
而不是pd.np.nan
初始化列,所以没有值。
Bi Icats To_set
2014-11-28 10:17:00 NaN NaN None
2014-11-28 10:30:00 NaN 0.040220 0.04022
2014-11-28 10:32:00 NaN NaN None
2014-11-28 10:35:00 0.217 NaN 0.217
2014-11-28 10:38:00 0.365 NaN 0.365
2014-11-28 10:44:00 0.227 NaN 0.227
2014-11-28 10:45:00 NaN 0.040220 None
2014-11-28 10:47:00 0.149 NaN 0.149
2014-11-28 10:50:00 0.109 NaN 0.109
2014-11-28 10:56:00 NaN NaN None
2014-11-28 10:59:00 0.065 NaN 0.065
2014-11-28 11:00:00 NaN 0.063687 None
2014-11-28 11:14:00 NaN NaN None
2014-11-28 11:15:00 NaN 0.047007 0.0470067
2014-11-28 11:30:00 NaN 0.041165 0.041165
2014-11-28 11:35:00 NaN NaN None
2014-11-28 11:45:00 NaN 0.040600 0.0406
2014-11-28 12:00:00 NaN 0.039667 0.0396667
2014-11-28 12:15:00 NaN 0.039460 0.03946
2014-11-28 12:30:00 NaN 0.038955 0.038955
目前执行循环的代码如下所示:
注意:'Bi‘的列索引是3,'Icats’的列索引是4,'To_set‘的列索引是5
df['New'] = pd.np.nan
for i in range(len(df)):
if pd.notnull(df.iloc[i,3]):
# if there is a value in Bi, take it always
df.iloc[i,5] = df.iloc[i,3]
continue
if pd.notnull(df.iloc[i,4]):
# take icats value only if there was no Bi value in the
# past 10 mins
# --> find the index of the last Bi value with last_valid_index() (a),
# and if timedelta between (a) and i > 10 mins, take Icats value
try:
if df.iloc[:i,3].last_valid_index() < (df.index[i].to_pydatetime() -
datetime.timedelta(minutes=10)):
# last bi value is older than startTime.
# Take the current icats value
df.iloc[i,5] = df.iloc[i,4]
except TypeError:
df.iloc[i,5] = df.iloc[i,4]
# have to include a try statement because until code hits the first real value
# in bi, the if condition below throws an error
有没有一种更好或者更优雅的方法来逐行迭代数据帧,其中你想访问以前或将来行中的值?我知道有像df.itertuples()
这样的东西,但我不认为它能让我查看前面的行。
编辑:
我以这样一种方式重写了代码,即它不需要查看前一行,而是将前一行所需的所有信息保存在变量中。显然,这样运行起来要快得多。通过这种方式,我可以使用df.itertuples()
来进一步提高代码的速度。然而,我最初的问题仍然是:有没有一种优雅的方法来迭代数据帧并将前一行中的值用于条件语句?
firstBiValueMet = False
for i in range(len(df)):
if pd.notnull(df.iloc[i,3]):
# save time for future calculations
firstBiValueMet = True
lastTime = df.index[i].to_pydatetime()
# if there is a value in Bi, take it always
df.iloc[i,5] = df.iloc[i,3]
continue
if pd.notnull(df.iloc[i,4]) and firstBiValueMet == False:
# in this case, take icats value anyway
df.iloc[i,5] = df.iloc[i,4]
if pd.notnull(df.iloc[i,4]) and firstBiValueMet == True and df.index[i] - lastTime > datetime.timedelta(minutes=10):
# take icats value only if there was no Bi value in the
# past 10 mins
df.iloc[i,5] = df.iloc[i,4]
if i%15000 == 0:
print(i)
发布于 2017-09-02 06:23:35
在To_set
列中显示None
对您有多重要?
这个问题很难在for
循环中完成,因为决定将To_set
设置为什么取决于之前与时间相关的行条件。
这里有一个不依赖于for循环的“开箱即用”的方法。它也没有将None
作为To_set
的值的概念,而只是保存当前To_set
值的运行记录。
DataFrame娱乐
import pandas as pd
import numpy as np
timestamps = [pd.Timestamp('2014-11-28 10:17:00'), pd.Timestamp('2014-11-28 10:30:00'), pd.Timestamp('2014-11-28 10:32:00'), pd.Timestamp('2014-11-28 10:35:00'), pd.Timestamp('2014-11-28 10:38:00'), pd.Timestamp('2014-11-28 10:44:00'), pd.Timestamp('2014-11-28 10:45:00'), pd.Timestamp('2014-11-28 10:47:00'), pd.Timestamp('2014-11-28 10:50:00'), pd.Timestamp('2014-11-28 10:56:00'), pd.Timestamp('2014-11-28 10:59:00'), pd.Timestamp('2014-11-28 11:00:00'), pd.Timestamp('2014-11-28 11:14:00'), pd.Timestamp('2014-11-28 11:15:00'), pd.Timestamp('2014-11-28 11:30:00'), pd.Timestamp('2014-11-28 11:35:00'), pd.Timestamp('2014-11-28 11:45:00'), pd.Timestamp('2014-11-28 12:00:00'), pd.Timestamp('2014-11-28 12:15:00'), pd.Timestamp('2014-11-28 12:30:00')]
data = {'Bi': [np.nan, np.nan, np.nan, 0.217, 0.365, 0.22699999999999998, np.nan, 0.149, 0.109,
np.nan, 0.065, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Dummy1': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Dummy2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Dummy3': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Icats': [np.nan, 0.04022, np.nan, np.nan, np.nan, np.nan, 0.04022, np.nan, np.nan, np.nan, np.nan,
0.063687, np.nan, 0.047007, 0.041165, np.nan, 0.0406, 0.039667, 0.03946,
0.038955000000000004],
'To_set': ['None', 0.040219999999999999, 'None', '0.217', '0.365', '0.227',
'None', '0.149', '0.109', 'None', '0.065', 'None', 'None',
'0.0470067', '0.041165', 'None', '0.0406', '0.0396667', '0.03946',
'0.038955']}
columns = ['Dummy1', 'Dummy2', 'Dummy3', 'Bi', 'Icats', 'To_set']
original_df = pd.DataFrame(data, index=timestamps, columns=columns)
original_df
看起来像这样:
Dummy1 Dummy2 Dummy3 Bi Icats To_set
2014-11-28 10:17:00 NaN NaN NaN NaN NaN None
2014-11-28 10:30:00 NaN NaN NaN NaN 0.040220 0.04022
2014-11-28 10:32:00 NaN NaN NaN NaN NaN None
2014-11-28 10:35:00 NaN NaN NaN 0.217 NaN 0.217
2014-11-28 10:38:00 NaN NaN NaN 0.365 NaN 0.365
2014-11-28 10:44:00 NaN NaN NaN 0.227 NaN 0.227
2014-11-28 10:45:00 NaN NaN NaN NaN 0.040220 None
2014-11-28 10:47:00 NaN NaN NaN 0.149 NaN 0.149
2014-11-28 10:50:00 NaN NaN NaN 0.109 NaN 0.109
2014-11-28 10:56:00 NaN NaN NaN NaN NaN None
2014-11-28 10:59:00 NaN NaN NaN 0.065 NaN 0.065
2014-11-28 11:00:00 NaN NaN NaN NaN 0.063687 None
2014-11-28 11:14:00 NaN NaN NaN NaN NaN None
2014-11-28 11:15:00 NaN NaN NaN NaN 0.047007 0.0470067
2014-11-28 11:30:00 NaN NaN NaN NaN 0.041165 0.041165
2014-11-28 11:35:00 NaN NaN NaN NaN NaN None
2014-11-28 11:45:00 NaN NaN NaN NaN 0.040600 0.0406
2014-11-28 12:00:00 NaN NaN NaN NaN 0.039667 0.0396667
2014-11-28 12:15:00 NaN NaN NaN NaN 0.039460 0.03946
2014-11-28 12:30:00 NaN NaN NaN NaN 0.038955 0.038955
下面是下一部分的代码,然后我将对其进行解释:
df = original_df.copy()
df.drop('To_set', axis=1, inplace=True)
new_index = pd.DatetimeIndex(start=df.index.min(), end=df.index.max(), freq='1min')
df = df.reindex(new_index)
df['Bi'] = df['Bi'].ffill(limit=10)
df['To_set_NEW'] = df['Bi'].combine_first(df['Icats']).ffill()
compare_df = df.loc[original_df.index]
df
df
中的To_set
列,并使用新索引以1分钟的频率填充缺失的时间段。如果您的df在很长一段时间内,这种方法 :)可能会很糟糕,因为它将在每天的每一分钟填充一行。如果没有内存错误,continue...Bi
但限制为10最多填充。combine_first
设置Bi
或Icats
。这之所以有效,是因为如果Bi
尚未进行10分钟的正向填充,并且Icats
具有值,则该值将为selected.Icats
compare_df
to original_df
,以评估它是否执行了您想要的操作。您可以将输出与以下内容进行比较:
output = pd.DataFrame({'To_set': original_df['To_set'], 'To_set_NEW': compare_df['To_set_NEW']})
输出如下所示:
To_set To_set_NEW
2014-11-28 10:17:00 None NaN
2014-11-28 10:30:00 0.04022 0.040220
2014-11-28 10:32:00 None 0.040220
2014-11-28 10:35:00 0.217 0.217000
2014-11-28 10:38:00 0.365 0.365000
2014-11-28 10:44:00 0.227 0.227000
2014-11-28 10:45:00 None 0.227000
2014-11-28 10:47:00 0.149 0.149000
2014-11-28 10:50:00 0.109 0.109000
2014-11-28 10:56:00 None 0.109000
2014-11-28 10:59:00 0.065 0.065000
2014-11-28 11:00:00 None 0.065000
2014-11-28 11:14:00 None 0.065000
2014-11-28 11:15:00 0.0470067 0.047007
2014-11-28 11:30:00 0.041165 0.041165
2014-11-28 11:35:00 None 0.041165
2014-11-28 11:45:00 0.0406 0.040600
2014-11-28 12:00:00 0.0396667 0.039667
2014-11-28 12:15:00 0.03946 0.039460
2014-11-28 12:30:00 0.038955 0.038955
就是所有这些最佳实践吗?
也许不是,但这是一种不同的看待它的方式。在这里,np.where(cond, what to do if true, else this)
可能也很方便。问题是,根据当前行的时间戳,滚动时间段限制为10分钟。也许其他人有更好的主意!
https://stackoverflow.com/questions/45993156
复制相似问题