本人最近接到一项任务,要爬一项数据,这个数据在某个网页的表格里面,数据量几百。打开调试模式发现接口返回的就是一个html页面,只要当做string处理。(解析html文件用xpath爬虫有些麻烦)方案采用了正则匹配所有的单元行,然后提取单元格内容,这里面遇到了一些其他问题:
分享代码供大家参考:
public static void main(String[] args) {
String url = "https://docs.oracle.com/cd/E13214_01/wli/docs92/xref/xqisocodes.html";
HttpGet httpGet = getHttpGet(url);
JSONObject httpResponse = getHttpResponse(httpGet);
String content = httpResponse.getString("content");
List<String> strings = regexAll(content, "<tr.+</a>" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + "</div>");
int size = strings.size();
for (int i = 0; i < size; i++) {
String s = strings.get(i).replaceAll("<.+>", EMPTY).replaceAll(LINE, EMPTY);
String[] split = s.split(" ", 2);
String sql = "INSERT country_code (country,code) VALUES (\"%s\",\"%s\");";
output(String.format(sql, split[0].replace(SPACE_1, EMPTY), split[1].replace(SPACE_1, EMPTY)));
}
testOver();
}
其中的一些封装方法如下:
/**
* 返回所有匹配项
*
* @param text 需要匹配的文本
* @param regex 正则表达式
* @return
*/
public static List<String> regexAll(String text, String regex) {
List<String> result = new ArrayList<>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
result.add(matcher.group());
}
return result;
}
最终拼接的sql部分结果为:
INSERT country_code (country,code) VALUES ("German","de");
INSERT country_code (country,code) VALUES ("Greek","el");
INSERT country_code (country,code) VALUES ("Greenlandic","kl");
INSERT country_code (country,code) VALUES ("Guarani","gn");
INSERT country_code (country,code) VALUES ("Gujarati","gu");
INSERT country_code (country,code) VALUES ("Hausa","ha");
INSERT country_code (country,code) VALUES ("Hebrew","he");
INSERT country_code (country,code) VALUES ("Hindi","hi");
INSERT country_code (country,code) VALUES ("Hungarian","hu");
INSERT country_code (country,code) VALUES ("Icelandic","is");
INSERT country_code (country,code) VALUES ("Indonesian","id");
INSERT country_code (country,code) VALUES ("Interlingua","ia");
INSERT country_code (country,code) VALUES ("Interlingue","ie");
INSERT country_code (country,code) VALUES ("Inuktitut","iu");
INSERT country_code (country,code) VALUES ("Inupiak","ik");
INSERT country_code (country,code) VALUES ("Irish","ga");
INSERT country_code (country,code) VALUES ("Italian","it");
INSERT country_code (country,code) VALUES ("Japanese","ja");