我在Pig中有一个数据集,如下所示:
6009544 "NY" 6009545 "NY"
6009544 "NY" 6009545 "NY"
6009548 "NY" 6009546 "OR"
6009546 "OR" 6009546 "OR"
6009545 "NY" 6009546 "OR"
6009548 "NY" 6009547 "AZ"
6009547 "AZ" 6009547 "AZ"
6009547 "AZ" 6009548 "NY"
6009544 "NY" 6009548 "NY"第一行是这样写的:“6009544专利起源于纽约,并引用了起源于纽约的6009545专利。”我试图为每个州寻找来自同一个州的专利的百分比。所以我的预期产出应该是
NY: .5
OR: 1
AZ: .5由于6项专利起源于纽约,3项引用了同样起源于纽约的专利。源自俄勒冈州的1项专利引用了同样起源于纽约的一项专利。在亚利桑那州的2项专利中,有1项引用了同样起源于亚利桑那州的专利。
有人能建议一个好的方法来在猪身上表演吗?
发布于 2014-10-09 07:50:05
你能试试这个吗?
input.txt
6009544 "NY" 6009545 "NY"
6009544 "NY" 6009545 "NY"
6009548 "NY" 6009546 "OR"
6009546 "OR" 6009546 "OR"
6009545 "NY" 6009546 "OR"
6009548 "NY" 6009547 "AZ"
6009547 "AZ" 6009547 "AZ"
6009547 "AZ" 6009548 "NY"
6009544 "NY" 6009548 "NY"
PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(\\d+)\\s+"(\\w+)"\\s+(\\d+)\\s+"(\\w+)"')) AS (f1:int,f2:chararray,f3:int,f4:chararray);
C = GROUP B BY f2;
D = FOREACH C {
FilterByPatent = FILTER B BY f2==f4;
CityPatentCount = COUNT(B.f2);
GENERATE group,((float)COUNT(FilterByPatent)/(float)CityPatentCount);
}
DUMP D;
Output:
(AZ,0.5)
(NY,0.5)
(OR,1.0)发布于 2015-06-23 20:01:59
我使用空格更改示例数据和单独的数据:
A = load '/padata' using PigStorage(' ' ) as (pno:int,pcity:chararray,pci:int,pccity:chararray);
b = group A by pcity ;
r = foreach b {
copcity= COUNT(A.pcity) ;
samdata = FILTER A by pcity==pccity;
csamdata = COUNT(samdata);
percent = (float)csamdata/(float)copcity;
generate group,percent ;
}
dump r ; 产出:-
("AZ",0.5)
("NY",0.5)
("OR",1.0)https://stackoverflow.com/questions/26271820
复制相似问题