readworddocpython

python读取word文档内容

import fnmatch, os, sys, win32com.client readpath=r'D:\123' wordapp = win32com.client.gencache.EnsureDispatch("Word.Application") try: for path, dirs, files in os.walk(readpath): for filename in files: if not fnmatch.fnmatch(filename, '*.docx'):continue doc = os.path.abspath(os.path.join(path,filename)) print 'processing %s。

' % doc wordapp.Documents.Open(doc) docastext = doc[:-4] + 'txt' wordapp.ActiveDocument.SaveAs(docastext,FileFormat=win32com.client.constants.wdFormatText) wordapp.ActiveDocument.Close() finally: wordapp.Quit() print 'end' f=open(r'd:\123\test.txt','r') for line in f.readlines(): print line.decode('gbk') f.close()。

python读取word文档内容

import fnmatch, os, sys, win32com.client

readpath=r'D:\123'

wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")

try:

for path, dirs, files in os.walk(readpath):

for filename in files:

if not fnmatch.fnmatch(filename, '*.docx'):continue

doc = os.path.abspath(os.path.join(path,filename))

print 'processing %s。' % doc

wordapp.Documents.Open(doc)

docastext = doc[:-4] + 'txt'

wordapp.ActiveDocument.SaveAs(docastext,FileFormat=win32com.client.constants.wdFormatText)

wordapp.ActiveDocument.Close()

finally:

wordapp.Quit()

print 'end'

f=open(r'd:\123\test.txt','r')

for line in f.readlines():

print line.decode('gbk')

f.close()

python如何读取word文件

>>> def PrintAllParagraphs(doc): count=doc.Paragraphs.Count for i in range(count-1,-1,-1): pr=doc.Paragraphs[i].Range print pr.Text >>> app=my.Office.Word.GetInstance()>>> doc=app.Documents[0]>>> PrintAllParagraphs(doc)1.什么是域域应用基础>>> @staticmethod def GetInstance(): u'''获取Word应用程序的Application对象''' import win32com.client return win32com.client.Dispatch('Word.Application')my.Office.Word.GetInstance的方法实现如上，是一个使用win32com操纵Word Com的接口的封装所有Paragraph即段落对象，都是通过Paragraph.Range.Text来访问它的文字的。

如何用python读取word

使用Python的内部方法open（)读取文本文件

try:

f=open('/file','r')

print(f.read())

finally:

if f:

f.close（)如果读取word文档推荐使用第三方插件，python-docx 可以在官网上下载

使用方式

# -*- coding: cp936 -*-

import docx

document = docx.Document（文件路径）

docText = '\n\n'.join([

paragraph.text.encode('utf-8') for paragraph in document.paragraphs

])

print docText

python能打开word文档吗

首先下载安装win32comfrom win32com import client as wcword = wc.Dispatch('Word.Application')doc = word.Documents.Open('c:/test')doc.SaveAs('c:/test.text', 2)doc.Close()word.Quit（)这种方式产生的text文档，不能用python用普通的r方式读取，为了让python可以用r方式读取，应当写成doc.SaveAs('c:/test', 4)注意：系统执行完成后，会自动产生文件后缀txt（虽然没有指明后缀）。

在xp系统下面，应当，open(r'c:\text','r')wdFormatDocument = 0wdFormatDocument97 = 0wdFormatDocumentDefault = 16wdFormatDOSText = 4wdFormatDOSTextLineBreaks = 5wdFormatEncodedText = 7wdFormatFilteredHTML = 10wdFormatFlatXML = 19wdFormatFlatXMLMacroEnabled = 20wdFormatFlatXMLTemplate = 21wdFormatFlatXMLTemplateMacroEnabled = 22wdFormatHTML = 8wdFormatPDF = 17wdFormatRTF = 6wdFormatTemplate = 1wdFormatTemplate97 = 1wdFormatText = 2wdFormatTextLineBreaks = 3wdFormatUnicodeText = 7wdFormatWebArchive = 9wdFormatXML = 11wdFormatXMLDocument = 12wdFormatXMLDocumentMacroEnabled = 13wdFormatXMLTemplate = 14wdFormatXMLTemplateMacroEnabled = 15wdFormatXPS = 18照着字面意思应该能对应到相应的文件格式，如果你是office 2003可能支持不了这么多格式。word文件转html有两种格式可选wdFormatHTML、wdFormatFilteredHTML（对应数字 8、10），区别是如果是wdFormatHTML格式的话，word文件里面的公式等ole对象将会存储成wmf格式，而选用 wdFormatFilteredHTML的话公式图片将存储为gif格式，而且目测可以看出用wdFormatFilteredHTML生成的HTML 明显比wdFormatHTML要干净许多。

当然你也可以用任意一种语言通过com来调用office API，比如PHP.from win32com import client as wcword = wc.Dispatch('Word.Application')doc = word.Documents.Open(r'c:/test1.doc')doc.SaveAs('c:/test1.text', 4)doc.Close()import restrings=open(r'c:\test1.text','r').read()result=re.findall('\(\s*[A-D]\s*\)|\(\xa1*[A-D]\xa1*\)|\(\s*[A-D]\s*\)|\(\xa1*[A-D]\xa1*\)',strings)chan=re.sub('\(\s*[A-D]\s*\)|\(\xa1*[A-D]\xa1*\)|\(\s*[A-D]\s*\)|\(\xa1*[A-D]\xa1*\)','()',strings)question=open(r'c:\question','a+')question.write(chan)question.close()answer=open(r'c:\answeronly','a+')for i,a in enumerate(result): m=re.search('[A-D]',a) answer.write(str(i+1)+' '+m.group()+'\n')answer.close()chan=re.sub(r'\xa3\xa8\s*[A-D]\s*\xa3\xa9','()',strings）#不要（），容易引起歧义。

如何在 Linux 上使用 Python 读取 word 文件信息

必须说明：不同于Illustrator、InDesign、CorelDRAW、OpenOffice DRAW、Incscape等工具，Word是流动分页的，文件内容本身并不存储分页结果。

具体分页时断在哪里、最后分出多少页，都需要现场渲染所有的图文内容之后才能确定。（简而言之就是：Word文件中仅包含了一行一行的文本，与页面设置中指定的页面尺寸。

Word每次打开文件时都会一行一行“摆放”文本数据，发现一页装不下了自动新开一页。当然真正的Word渲染引擎肯定有更复杂的行为。）

从.doc/.docx文件中直接读出页面数量，这本身就是个伪命题。所以千万别在“直接读取页面数量”这个方向上寻求方案——软件开发的技法不好可以改正，但路线错了必死无疑！你需要调动一套能够真的把Word文件的内容渲染出来的工具（支持二次开发的）。

只有把Word文件的所有内容渲染成为可以观看的图形，才能准确得知页面的总数。在Linux上很可能LibreOffice可以吧。

而在Windows上就当然是用Word本身了。注意Word的分页结论是没有保证的。

缺少字体、字形不同、软件环境不同等各种原因，都会造成不同电脑上打开同一个Word文件的页数不一致。这一点对服务器也没有例外。

得到了页数也只能参考使用，而不要100%信赖。

python 如何识别docx中的公式

import fnmatch, os, sys, win32com.client

readpath=r'D:\123'

wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")

try:

for path, dirs, files in os.walk(readpath):

for filename in files:

if not fnmatch.fnmatch(filename, '*.docx'):continue

doc = os.path.abspath(os.path.join(path,filename))

print 'processing %s。' % doc

wordapp.Documents.Open(doc)

docastext = doc[:-4] + 'txt'

wordapp.ActiveDocument.SaveAs(docastext,FileFormat=win32com.client.constants.wdFormatText)

wordapp.ActiveDocument.Close()

finally:

wordapp.Quit()

print 'end'

f=open(r'd:\123\test.txt','r')

for line in f.readlines():

print line.decode('gbk')

f.close()

转载请注明出处51数据库 » readworddocpython