下面的问题使我头痛得厉害。
我有一个很大的数据集,看起来像这样。
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 ....正如你所看到的,在我的观察中有很多NAs。其他列看起来类似,数据集有+100.000行。所以它很大。
我想要做的是,我想用以下方式聚合我的数据。例如,C1:我想在2000-01年到2012-12年的时间框架内,为每个名字、每年和每个月建立每月平均数。
每月平均数应使用每月可用的日期计算。
计算完成后,我的数据集应该如下所示。
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01 monthly average
A 2008-02 monthly average
A 2008-03 monthly average
A 2008-04 monthly average
A 2008-05 monthly average
A 2008-06 monthly average
A 2008-07 monthly average
A 2008-08 monthly average
A 2008-09 monthly average
A 2008-10 monthly average
A 2008-11 monthly average
A 2008-12 monthly average
A 2009-01 monthly average
B 2008-01 monthly average
B 2008-02 monthly average
B 2008-03 monthly average
B 2008-04 monthly average
B 2008-05 monthly average
B 2008-06 ....因此,我的输出数据应该显示一年中每个月的每个名称。如果这个月只有NA值,那么这些值要么是NA值,要么是这个月的月平均值。
例如:
Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120在这方面,我们期望:
Name Date C1
A 2008-01 (100+120)/2 = 110例如:
Name Date C1
A 2008-01-03 NA
A 2008-01-05 NA
A 2008-01-07 NA在这方面,我们期望:
Name Date C1
A 2008-01 NA例如:
Name Date C1
A 2008-01-03 100
A 2008-01-05 50
A 2008-01-07 120在这方面,我们期望:
Name Date C1
A 2008-01 (100+50+120)/3 = 90因为我是相对较新的r,我不知道如何解决这个问题,我希望找到一个人谁可以解决这个问题,并告诉我,这样的事情是如何解决的。我真的很感谢你的支持:)
发布于 2018-05-11 17:48:21
您可以通过dplyr::summarise_all来计算所有列( C1、C2..etc )的平均值。
首先group_by on Name和YearMon,取消选择Date列,然后使用summarise_all
library(dplyr)
library(lubridate)
#Added C2 to demonstrate calculation for multiple columns in one go.
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%
group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
select(-Date) %>%
summarise_all("mean", na.rm=TRUE)
#OR - Use summarise_at and calculate mean for all columns starting with 'C'
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%
group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
summarise_at(vars(starts_with("C")), mean, na.rm=TRUE)
# A tibble: 12 x 4
# Groups: Name [?]
Name YearMon C1 C2
<chr> <chr> <dbl> <dbl>
1 A 2008-01 110 220
2 A 2008-02 NaN NaN
3 A 2008-03 50.0 100
4 A 2008-07 70.0 140
5 A 2009-01 40.0 80.0
6 A 2010-01 NaN NaN
7 A 2010-03 25.0 50.0
8 A 2011-07 10.0 20.0
9 B 2008-01 4.00 8.00
10 B 2008-02 11.0 22.0
11 B 2008-03 13.0 26.0
12 B 2008-07 NaN NaN 数据:
df <- read.table(text =
"Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 NA",
header = TRUE, stringsAsFactors = FALSE)发布于 2018-05-11 17:52:37
library(dplyr)
#generating sample data
data <- data.frame(Name = c(rep("A",25), rep("B",50)),
Date = seq(as.Date("2018-01-01"), as.Date("2020-01-12"), by = 10),
C1 = rep(c(100,NA,NA,NA,NA,500,320,102,412,NA,200,NA,145,800,230),5))
#grouping by Name and Month and summarizing mean of values
data %>%
group_by(Name, month = cut(Date, "month")) %>%
summarise(C1 = mean(C1, na.rm = TRUE)) %>% mutate(C1 = ifelse(is.nan(C1),NA,C1))https://stackoverflow.com/questions/50297265
复制相似问题