我有一个迭代器,它对WARC文档序列进行操作,并为每个文档生成修改后的令牌列表:
class MyCorpus(object):
def __init__(self, warc_file_instance):
self.warc_file = warc_file_instance
def clean_text(self, html):
soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
for script in soup(["script",
-[\d]+) | ([\d]+ - [\d]+) | CAR
SMOULDERING | GAS BOTTLE EXPLOSION | INPUT | OFF | OPPOSITE | CNR |
SPARKINGHEARD | WASHAWAY AS A
RESULT OF ACCIDENT | ENTRANCE | ENT |FIRE| LHS | RHS | POWER LINES
ARCING AND SPARKING