我有一个dataframe,它看起来如下所示:
      d   b c  a
1  3400 100 3 -1
2  3400  50 3  1
3  3400 100 1 -1
4  3408  50 1  1
5  3412 100 3  1
6  3423  50 1  1
7  3434 100 1  1
8  3436 100 3  1
9  3438  50 3  1
10 3445  50 1  1
11 3454 100 3  1
12 3465 100 1  1我想按列a和b分组,条件是组以c列value= 3开头,如果列d值在第一个组条目前面+ 30,则组结束(因此区间长度= 30,但每个区间的起点可以是另一个区间)。然后我要计算每组中的行数。
因此,该示例的预期输出应该是:
b    a   rowcount 
100 -1    2        ( starting at d = 3400)
50   1    3        ( starting at d = 3400)
100  1    3        (starting at d= 3412)
50   1    2        (starting at d= 3438)
100  1    2        (starting at d= 3454)我试过:
df<-df%>%
  group_by(b,a,first(c) == 3  & lead(d) - d < 30)
summarise(number = n())但这并没有给出我想要的输出。如有任何意见,敬请见谅!
更新:新示例:
      d   b c a
1  3400 100 3 1
2  3400 100 3 1
3  3400 100 1 1
4  3408 100 1 1
5  3412 100 3 1
6  3434 100 3 1
7  3436 100 1 1
8  3438 100 3 1
9  3445 100 1 1
10 3443 100 3 1
11 3444 100 1 1
12 3463 100 3 1
13 3463 100 1 1
14 3463 100 3 1您的代码输出如下:
      a     b count desc                    addition_info                    
  <dbl> <dbl> <int> <chr>                   <chr>                            
1     1   100     5 ( starting at d = 3400) There is 3 `c == 3` in this group
2     1   100     6 ( starting at d = 3434) There is 3 `c == 3` in this group
3     1   100     3 ( starting at d = 3463) There is 2 `c == 3` in this group但是第三组是错误的,因为d= 29,因此<30。为什么是这种情况?因此,在这个示例中,正确的输出应该是:
      a     b count desc                    addition_info                    
  <dbl> <dbl> <int> <chr>                   <chr>                            
1     1   100     5 ( starting at d = 3400) There is 3 `c == 3` in this group
2     1   100     9 ( starting at d = 3434) There is 3 `c == 3` in this group发布于 2021-04-05 15:44:32
稍微调整一下你的例子,但我认为这在很大程度上足够了。(另见代码下面的说明)
df <- read.table(text = "     d   b c a
1  3400 100 3 1
2  3400 100 3 1
3  3400 100 1 1
4  3408 100 1 1
5  3412 100 3 1
6  3434 100 3 1
7  3436 100 1 1
8  3438 100 3 1
9  3445 100 1 1
10 3443 100 3 1
11 3444 100 1 1
12 3463 100 3 1
13 3463 100 1 1
14 3463 100 3 1
15 3465 100 3 1", header = T)
#added one row in df
> df
      d   b c a
1  3400 100 3 1
2  3400 100 3 1
3  3400 100 1 1
4  3408 100 1 1
5  3412 100 3 1
6  3434 100 3 1
7  3436 100 1 1
8  3438 100 3 1
9  3445 100 1 1
10 3443 100 3 1
11 3444 100 1 1
12 3463 100 3 1
13 3463 100 1 1
14 3463 100 3 1
15 3465 100 3 1现在按照这个策略
library(tidyverse)
library(data.table) # for rleid()
df %>% mutate(r = row_number()) %>%
  group_by(b, a) %>% mutate(grp_no = rleid(accumulate(d, ~ifelse(.y - .x > 30, .y, .x)))) %>%
  group_by(b, a, grp_no) %>%
  summarise(row_count = n(), r = first(r), d = first(d)) %>%
  arrange(r) %>%
  mutate(additional = paste("group starts at d =", d)) %>%
  select(-r, -d)
# A tibble: 3 x 5
# Groups:   b, a [1]
      b     a grp_no row_count additional              
  <int> <int>  <int>     <int> <chr>                   
1   100     1      1         5 group starts at d = 3400
2   100     1      2         9 group starts at d = 3434
3   100     1      3         1 group starts at d = 3465对于第一个示例,它的输出是
# A tibble: 5 x 5
# Groups:   b, a [3]
      b     a grp_no row_count additional              
  <int> <int>  <int>     <int> <chr>                   
1   100    -1      1         2 group starts at d = 3400
2    50     1      1         3 group starts at d = 3400
3   100     1      1         3 group starts at d = 3412
4    50     1      2         2 group starts at d = 3438
5   100     1      2         2 group starts at d = 3454注意:您也可以在上面的语法中使用dplyr::dense_rank而不是rleid,如下所示
df %>% mutate(r = row_number()) %>%
  group_by(b, a) %>% 
  mutate(grp_no = dense_rank(accumulate(d, ~ifelse(.y - .x > 30, .y, .x)) )) %>%
  group_by(b, a, grp_no) %>%
  summarise(row_count = n(), r = first(r), d = first(d)) %>%
  arrange(r) %>%
  mutate(additional = paste("group starts at d =", d)) %>%
  select(-r, -d)c==3 EndNote:现在我不知道你的的逻辑是如何适应这一点的?如果你能澄清我可以再试一次
https://stackoverflow.com/questions/66925758
复制相似问题