我有一个dataframe,它看起来如下所示:
d b c a
1 3400 100 3 -1
2 3400 50 3 1
3 3400 100 1 -1
4 3408 50 1 1
5 3412 100 3 1
6 3423 50 1 1
7 3434 100 1 1
8 3436 100 3 1
9 3438 50 3 1
10 3445 50 1 1
11 3454 100 3 1
12 3465 100 1 1我想按列a和b分组,条件是组以c列value= 3开头,如果列d值在第一个组条目前面+ 30,则组结束(因此区间长度= 30,但每个区间的起点可以是另一个区间)。然后我要计算每组中的行数。
因此,该示例的预期输出应该是:
b a rowcount
100 -1 2 ( starting at d = 3400)
50 1 3 ( starting at d = 3400)
100 1 3 (starting at d= 3412)
50 1 2 (starting at d= 3438)
100 1 2 (starting at d= 3454)我试过:
df<-df%>%
group_by(b,a,first(c) == 3 & lead(d) - d < 30)
summarise(number = n())但这并没有给出我想要的输出。如有任何意见,敬请见谅!
更新:新示例:
d b c a
1 3400 100 3 1
2 3400 100 3 1
3 3400 100 1 1
4 3408 100 1 1
5 3412 100 3 1
6 3434 100 3 1
7 3436 100 1 1
8 3438 100 3 1
9 3445 100 1 1
10 3443 100 3 1
11 3444 100 1 1
12 3463 100 3 1
13 3463 100 1 1
14 3463 100 3 1您的代码输出如下:
a b count desc addition_info
<dbl> <dbl> <int> <chr> <chr>
1 1 100 5 ( starting at d = 3400) There is 3 `c == 3` in this group
2 1 100 6 ( starting at d = 3434) There is 3 `c == 3` in this group
3 1 100 3 ( starting at d = 3463) There is 2 `c == 3` in this group但是第三组是错误的,因为d= 29,因此<30。为什么是这种情况?因此,在这个示例中,正确的输出应该是:
a b count desc addition_info
<dbl> <dbl> <int> <chr> <chr>
1 1 100 5 ( starting at d = 3400) There is 3 `c == 3` in this group
2 1 100 9 ( starting at d = 3434) There is 3 `c == 3` in this group发布于 2021-04-05 15:44:32
稍微调整一下你的例子,但我认为这在很大程度上足够了。(另见代码下面的说明)
df <- read.table(text = " d b c a
1 3400 100 3 1
2 3400 100 3 1
3 3400 100 1 1
4 3408 100 1 1
5 3412 100 3 1
6 3434 100 3 1
7 3436 100 1 1
8 3438 100 3 1
9 3445 100 1 1
10 3443 100 3 1
11 3444 100 1 1
12 3463 100 3 1
13 3463 100 1 1
14 3463 100 3 1
15 3465 100 3 1", header = T)
#added one row in df
> df
d b c a
1 3400 100 3 1
2 3400 100 3 1
3 3400 100 1 1
4 3408 100 1 1
5 3412 100 3 1
6 3434 100 3 1
7 3436 100 1 1
8 3438 100 3 1
9 3445 100 1 1
10 3443 100 3 1
11 3444 100 1 1
12 3463 100 3 1
13 3463 100 1 1
14 3463 100 3 1
15 3465 100 3 1现在按照这个策略
library(tidyverse)
library(data.table) # for rleid()
df %>% mutate(r = row_number()) %>%
group_by(b, a) %>% mutate(grp_no = rleid(accumulate(d, ~ifelse(.y - .x > 30, .y, .x)))) %>%
group_by(b, a, grp_no) %>%
summarise(row_count = n(), r = first(r), d = first(d)) %>%
arrange(r) %>%
mutate(additional = paste("group starts at d =", d)) %>%
select(-r, -d)
# A tibble: 3 x 5
# Groups: b, a [1]
b a grp_no row_count additional
<int> <int> <int> <int> <chr>
1 100 1 1 5 group starts at d = 3400
2 100 1 2 9 group starts at d = 3434
3 100 1 3 1 group starts at d = 3465对于第一个示例,它的输出是
# A tibble: 5 x 5
# Groups: b, a [3]
b a grp_no row_count additional
<int> <int> <int> <int> <chr>
1 100 -1 1 2 group starts at d = 3400
2 50 1 1 3 group starts at d = 3400
3 100 1 1 3 group starts at d = 3412
4 50 1 2 2 group starts at d = 3438
5 100 1 2 2 group starts at d = 3454注意:您也可以在上面的语法中使用dplyr::dense_rank而不是rleid,如下所示
df %>% mutate(r = row_number()) %>%
group_by(b, a) %>%
mutate(grp_no = dense_rank(accumulate(d, ~ifelse(.y - .x > 30, .y, .x)) )) %>%
group_by(b, a, grp_no) %>%
summarise(row_count = n(), r = first(r), d = first(d)) %>%
arrange(r) %>%
mutate(additional = paste("group starts at d =", d)) %>%
select(-r, -d)c==3 EndNote:现在我不知道你的的逻辑是如何适应这一点的?如果你能澄清我可以再试一次
https://stackoverflow.com/questions/66925758
复制相似问题