我需要过滤我的数据,根据某些条件,这是以某种方式分层的。
我的出口数据看起来像这样,但针对多个国家和年份。
df3dgt <- data.frame(
"Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA", "USA", "USA","USA","EU", "EU","EU","EU","EU", "EU","EU","EU","EU"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU", "EU","EU","EU","USA", "USA", "USA","USA","USA", "USA", "USA","USA","USA"),
"Commodity code" = c("1", "11", "111", "112", "12","2", "21","211", "22", "3", "1", "11", "111", "112", "2", "21", "211", "212", "22"),
"Value" = c( 100, 50, 25, 5, 40, 200, 170, 170, 30, 220, 190, 190, 120, 30, 300, 200, 150, 50, 100),
stringsAsFactors = FALSE)商品代码在不同的层次上聚合数据。例如,111 (例如,11 (苹果)和112 (如香蕉)是商品11 (如水果)的亚类,类似地,11(水果)和12 (蔬菜)是1(如食品)的亚类。
我需要对数据进行过滤,以便将完整的数据与其他数据分开。
我想根据两个条件进行过滤:
(1)对子商品类别的value等于上一级聚合上报的值的数据进行过滤。例如,美国出口到欧盟的商品代码1不完整。商品112 (val=5)和商品111(val=25)不等于商品11 (val=50)的价值。同样,11 (val=50)和12 (val=40)的价值不等于商品代码1 (100)的价值,相反,欧盟对美国的第二类出口是完整的。商品211 (val=150)和212 (val=50)合计到商品21 (val=200)的水平。此外,产品类别21 (200)和22 (100)的价值合计为商品2的水平。
2)我还知道如何分别过滤仅在较高级别的商品代码中报告的数据。关于哪些数据仅在较高级别上报,请参考下面的商品代码说明性列表:
Comlist <- c("1", "11", "111", "112", "12","2", "21","211", "22","221", "3","31", "32", "311", "321")
Comlist <- as.data.frame(Comlist)因此,我想在美国和欧盟之间的出口中过滤商品22,因为我知道类别221存在,并且没有报告。同样,对于类别3,它不会在其较低的级别中报告。
为了处理(1),我当时正在考虑一个级别(首先是两位数和三位数的产品类别,然后是一位和二位数)。我首先为每个级别的产品类别创建一个新变量
# create new variable Prodcat1
df1 <- df %>%
group_by(Reporter, Partner) %>%
mutate(Prodcat1 = str_extract(Product.cat., "^.{1}"))
# create new variable Prodcat2 for my 2nd level product category
df2 <- df1 %>%
group_by(Reporter, Partner) %>%
mutate(Prodcat2 = str_extract(Product.cat., "^.{2}")) 然后我过滤
df2.Incomplete <- df2 %>%
group_by(Reporter, Partner, Prodcat2) %>%
filter(sum(Val[2:n()]) < Val[1])`然而,这只包括具有两位或三位数的商品代码的数据,而我也想纳入不完整数据的第一位。例如,它报告具有“商品代码”111和11的行,而不是不完整组的商品代码"1“的行。此外,考虑到我有一百个国家和产品类别要考虑,我不确定是否继续筛选情况(2)。
非常感谢您的帮助。
发布于 2019-01-08 07:55:16
你的目标到底是什么仍然有点令人困惑。但这里有一个函数可以提供帮助:
函数
# check function
checkLowerLevels <- function (df = df3dgt, defaultValue = TRUE) {
# set column commodity.code to integer
class(df$Commodity.code) <- 'integer'
# checks
out <- vapply(1:nrow(df), function (k, defaultValue) {
# Checks at lower level if values add up to parent level
# if no lower exists, default value is TRUE.
#
# For a given incomplete parent return parent + children
# separated by ';' (can be split using strsplit).
currentCode <- .subset2(df, 3)[k]
currentRepo <- .subset2(df, 1)[k]
currentPart <- .subset2(df, 2)[k]
lowerLevels <- (.subset2(df, 3) > currentCode*10-1) & (.subset2(df, 3) < (currentCode+1)*10) & (.subset2(df, 1) == currentRepo) & (.subset2(df, 2) == currentPart)
check <- ifelse(!any(lowerLevels), defaultValue, sum(.subset2(df, 4L)[lowerLevels]) == .subset2(df, 4L)[k])
res <- NA_character_
if (check) return (c(res,check))
res <- paste(currentCode, paste(.subset2(df, 3L)[lowerLevels], collapse = ';'), sep = ';')
return(c(res,check))
}, character(2), defaultValue = defaultValue)
out <- t(out)
ans <- data.frame(check=out[,2], subset=out[,1], stringsAsFactors = FALSE)
class(ans$check) <- 'logical'
ans
}结果
下面是一个应用程序:
# create new columns in data frame
df3dgt <- cbind(df3dgt, checkLowerLevels())
df3dgt
# Reporter Partner Commodity.code Value check subset
# 1 USA EU 1 100 FALSE 1;11;12
# 2 USA EU 11 50 FALSE 11;111;112
# 3 USA EU 111 25 TRUE <NA>
# 4 USA EU 112 5 TRUE <NA>
# 5 USA EU 12 40 TRUE <NA>
# 6 USA EU 2 200 TRUE <NA>
# 7 USA EU 21 170 TRUE <NA>
# 8 USA EU 211 170 TRUE <NA>
# 9 USA EU 22 30 TRUE <NA>
# 10 USA EU 3 220 TRUE <NA>
# 11 EU USA 1 190 TRUE <NA>
# 12 EU USA 11 190 FALSE 11;111;112
# 13 EU USA 111 120 TRUE <NA>
# 14 EU USA 112 30 TRUE <NA>
# 15 EU USA 2 300 TRUE <NA>
# 16 EU USA 21 200 TRUE <NA>
# 17 EU USA 211 150 TRUE <NA>
# 18 EU USA 212 50 TRUE <NA>
# 19 EU USA 22 100 TRUE <NA>说明
checkLowerLevels检查由报告器和合作伙伴过滤的任何代码的完整性(这样,n = 1针对USA -> EU和'EU -> n USA -> EU和一次’EU->>H117如果代码不完整,它还将以n;n1;n2;...;nm格式返回code < code >D18和子级别n1, n2, ..., nm
https://stackoverflow.com/questions/54081416
复制相似问题