有这样一个日志文件,其中有许多条目:
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
Report Job information
Job ID : 12345
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:18:52.08
Report mode : Simple
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
Report Job information
Job ID : 12367
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General1
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:45:52.08
Report mode : Simple
我需要能够从日志中提取这个部分,并将其放入数据帧格式。日志将有许多这样的条目,每次脚本在日志中看到这个设置条目时,它都应该附加到数据帧中。
Report Job information
Job ID : 12367
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General1
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:45:52.08
Report mode : Simple
我试过这个:
i<-c("app.log")
xx <- readLines(i)
ii <- 10 + grep("Report Job information",xx, fixed=TRUE)[1]
line_count <- Filter(function(v) v!="", strsplit(xx[ii]," ")[[2]])
有什么办法可以做到吗?
鉴于这些数据:
Job ID : 12345
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:18:52.08
Report mode : Simple
Each set of entry need to be one row. For example:
title=Report Job information,jobid=12345, JobName=Background Execution, job priority=5,jobgroup=normal, reportgroup=General,ModifiedDate=2016-09-16 15:18:52.08 etc
发布于 2016-10-10 13:16:55
下面是数据,请阅读(注意,使用dplyr
/magrittr
管道):
inData <-
"2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
Report Job information
Job ID : 12345
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:18:52.08
Report mode : Simple
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
2016-09-16 15:18:52,555 DEBUG Thread-3
Report Job information
Job ID : 12367
Job name : Background Execution
Job priority : 5
Job group : normal
Report group : General1
Job description : Brief
User : user01
Report : General101
Modified user : 12
Modified date : 2016-09-16 15:45:52.08
Report mode : Simple" %>%
strsplit("\\n") %>%
unlist
我认为最简单的方法是先找到所有的首发位置。在这里,我假设“”位于每个报表的顶部(但是它不需要包含在data.frame中,所以我在后面开始一个)。
reportStarts <-
grep("Report Job information", inData) + 1
接下来,我找到报告的结尾。不了解更多的结构,我假设“报告模式”是最后报告的事情。相反,如果它是一组行数,或者是其他一些中断指示符,则可能需要修改此步骤。
reportStops <-
grep("Report mode", inData)
接下来,我循环遍历每个报表(使用idx
来指示正在处理的报告)。该方法首先限制在本报告中的行,然后将每个行除以“:”,这似乎是字段指示符,并使用“:”之前的部分来命名输出(如果值还包括“:”,则使用粘贴作为防御)。最后,我将每个行插入一个data.frame,然后用bind_rows
绑定行,以确保它适当地处理可能出现在某些报告中的任何额外列。
lapply(1:length(reportStarts), function(idx){
inData[reportStarts[idx]:reportStops[idx]] %>%
strsplit(" : ") %>%
sapply(function(thisField){
setNames(paste(thisField[-1]
, collapse = " : ")
, thisField[1])
}) %>%
rbind %>%
data.frame()
}) %>%
bind_rows()
输出此data.frame
Job.ID Job.name Job.priority Job.group Report.group Job.description User Report Modified.user Modified.date Report.mode
1 12345 Background Execution 5 normal General Brief user01 General101 12 2016-09-16 15:18:52.08 Simple
2 12367 Background Execution 5 normal General1 Brief user01 General101 12 2016-09-16 15:45:52.08 Simple
发布于 2016-10-10 13:30:29
你可以试试:
library(stringr)
foo <- function(x, y){
tmp <- y[c(x:(x+11))] # extract
tmp <- str_split(tmp, ": ") # split and optimize
unlist(lapply(tmp, tail, 1)) # unlist and output last element
}
data.frame(t(cbind(sapply(i, foo, xx))))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 Report Job information 12345 Background Execution 5 normal General Brief user01 General101 12
2 Report Job information 12367 Background Execution 5 normal General1 Brief user01 General101 12
X11 X12
1 2016-09-16 15:18:52.08 Simple
2 2016-09-16 15:45:52.08 Simple
https://stackoverflow.com/questions/39958733
复制相似问题