我正在尝试将docx文件转换为文本,但一直收到错误。我用的是python 2-7
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)回溯:
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 764: character maps to <undefined>发布于 2017-06-25 06:01:06
它看起来不像\u2019,也可能不像\u2018。这些是左边和右边的单引号。我会将unicode数据编码为ascii,并忽略任何无法转换的内容,以便将其删除:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
txt = para.text.encode('ascii', 'ignore')
fullText.append(txt)
return '\n'.join(fullText)发布于 2017-06-25 06:05:06
看起来这个单引号有问题。你能做类似这样的事情吗:
import docx
def getText(filename):
doc = docx.Document(filename)
new_doc = doc.replace(u"\u2019", "'")
fullText = []
for para in new_doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)从我的手机上回复所以我不能测试。
https://stackoverflow.com/questions/44741226
复制相似问题