我有这样的数据,但有时我的里程数是错误的。里程应该增加,但有时有错误的数字-太低或太高。是否有可能清除R中的数据?你有什么想法吗?对于这个错误,我可以从下面和上面使用平均值记录,但如何捕获错误的顺序?
CarID FuelTransactionDate Mileage
AAA555 05.01.2019 5060
AAA555 30.01.2019 7800
AAA555 14.02.2019 9100
AAA555 24.02.2019 9900
AAA555 07.04.2019 101110 <- mistake
AAA555 12.04.2019 12500
AAA555 15.05.2019 13000
AAA555 09.06.2019 13422
BBB788 15.05.2018 15000
BBB788 04.06.2018 15200
BBB788 19.06.2018 16150
BBB788 16.07.2018 100 <- mistake
BBB788 27.08.2018 17500
BBB788 10.09.2018 17999
BBB788 13.10.2018 18200
BBB788 02.11.2018 18555 发布于 2020-10-21 13:36:36
下面是一种方法,演示如何识别异常值,然后使用approx填充它们。首先,我寻找里程减少的情况--您可以在if_else中添加任何想要签入的附加条件,以识别异常值:
dd %>%
group_by(CarID) %>%
dplyr::mutate(
# replace mistakes with NA
MileageNA = if_else(Mileage < lag(Mileage, 1, default = 0), NA_integer_, Mileage),
# fill in missing values with approx
# approx is nicely robust in case you have multiple mistakes in a row
# See the help page and the rule argument to control behavior
# in case you have mistakes as the first or last observations
MileageCorrected = approx(MileageNA, xout = 1:n())$y
)
# # A tibble: 16 x 5
# # Groups: CarID [2]
# CarID FuelTransactionDate Mileage MileageNA MileageCorrected
# <chr> <chr> <int> <int> <dbl>
# 1 AAA555 05.01.2019 5060 5060 5060
# 2 AAA555 30.01.2019 7800 7800 7800
# 3 AAA555 14.02.2019 9100 9100 9100
# 4 AAA555 24.02.2019 9900 9900 9900
# 5 AAA555 07.04.2019 101110 101110 101110
# 6 AAA555 12.04.2019 12500 NA 57055
# 7 AAA555 15.05.2019 13000 13000 13000
# 8 AAA555 09.06.2019 13422 13422 13422
# 9 BBB788 15.05.2018 15000 15000 15000
# 10 BBB788 04.06.2018 15200 15200 15200
# 11 BBB788 19.06.2018 16150 16150 16150
# 12 BBB788 16.07.2018 100 NA 16825
# 13 BBB788 27.08.2018 17500 17500 17500
# 14 BBB788 10.09.2018 17999 17999 17999
# 15 BBB788 13.10.2018 18200 18200 18200
# 16 BBB788 02.11.2018 18555 18555 18555发布于 2020-10-21 13:36:02
如果您想要确定错误发生在哪里,可以选择使用ave + cummax + cummin和基R
within(
df,
err <- ave(
Mileage,
CarID,
FUN = function(x) replace(cummax(x) == rev(cummax(rev(x))), length(x), 0) + replace(cummin(x) == rev(cummin(rev(x))), 1, 0)
)
)这给
CarID FuelTransactionDate Mileage err
1 AAA555 05.01.2019 5060 0
2 AAA555 30.01.2019 7800 0
3 AAA555 14.02.2019 9100 0
4 AAA555 24.02.2019 9900 0
5 AAA555 07.04.2019 101110 1
6 AAA555 12.04.2019 12500 0
7 AAA555 15.05.2019 13000 0
8 AAA555 09.06.2019 13422 0
9 BBB788 15.05.2018 15000 0
10 BBB788 04.06.2018 15200 0
11 BBB788 19.06.2018 16150 0
12 BBB788 16.07.2018 100 1
13 BBB788 27.08.2018 17500 0
14 BBB788 10.09.2018 17999 0
15 BBB788 13.10.2018 18200 0
16 BBB788 02.11.2018 18555 0发布于 2020-10-21 13:43:17
我只是把我的评论放在一个答案中,以便更好地显示输出:下面是代码:
library(dplyr)
library(ggplot2)
df %>% group_by(CarID) %>%
summarise(min = min(Mileage),
max = max(Mileage))
df %>% group_by(CarID) %>% mutate(rate = Mileage/lag(Mileage, n = 1, default = NA)) # if < 1 then the previous value was higher.
df %>% group_by(CarID) %>% mutate(rate = Mileage - lag(Mileage, n = 1, default = NA)) # if < 0 then the previous value was higher.
ggplot(data = df, aes(x = CarID, y = Mileage)) +
geom_boxplot()可以使用的一些输出:

使用dplyr删除大小写时,n< n+1 警告您可能要删除异常值之前!
> df %>%
+ group_by(CarID) %>%
+ mutate(rate = Mileage - lag(Mileage, n = 1, default = NA)) %>%
+ filter(rate > 0)
# A tibble: 12 x 4
# Groups: CarID [2]
CarID FuelTransactionDate Mileage rate
<chr> <chr> <int> <int>
1 AAA555 30.01.2019 7800 2740
2 AAA555 14.02.2019 9100 1300
3 AAA555 24.02.2019 9900 800
4 AAA555 07.04.2019 101110 91210
5 AAA555 15.05.2019 13000 500
6 AAA555 09.06.2019 13422 422
7 BBB788 04.06.2018 15200 200
8 BBB788 19.06.2018 16150 950
9 BBB788 27.08.2018 17500 17400
10 BBB788 10.09.2018 17999 499
11 BBB788 13.10.2018 18200 201
12 BBB788 02.11.2018 18555 355数据:
df <- structure(list(CarID = c("AAA555", "AAA555", "AAA555", "AAA555",
"AAA555", "AAA555", "AAA555", "AAA555", "BBB788", "BBB788", "BBB788",
"BBB788", "BBB788", "BBB788", "BBB788", "BBB788"), FuelTransactionDate = c("05.01.2019",
"30.01.2019", "14.02.2019", "24.02.2019", "07.04.2019", "12.04.2019",
"15.05.2019", "09.06.2019", "15.05.2018", "04.06.2018", "19.06.2018",
"16.07.2018", "27.08.2018", "10.09.2018", "13.10.2018", "02.11.2018"
), Mileage = c(5060L, 7800L, 9100L, 9900L, 101110L, 12500L, 13000L,
13422L, 15000L, 15200L, 16150L, 100L, 17500L, 17999L, 18200L,
18555L)), class = "data.frame", row.names = c(NA, -16L))https://stackoverflow.com/questions/64464186
复制相似问题