GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
Text::Extract::Word(3) User Contributed Perl Documentation Text::Extract::Word(3)

Text::Extract::Word - Extract text from Word files

 # object-based interface
 use Text::Extract::Word;
 my $file = Text::Extract::Word->new("test1.doc");
 my $text = $file->get_text();
 my $body = $file->get_body();
 my $footnotes = $file->get_footnotes();
 my $headers = $file->get_headers();
 my $annotations = $file->get_annotations();
 my $bookmarks = $file->get_bookmarks();
 
 # specify :raw if you don't want the text cleaned
 my $raw = $file->get_text(':raw');

 # legacy interface
 use Text::Extract::Word qw(get_all_text);
 my $text = get_all_text("test1.doc");

This simple module allows the textual contents to be extracted from a Word file. The code was ported from Java code, originally part of the Apache POE project, but extensive code changes were made internally.

Passed either a file name or an open file handle, this constructor returns an instance that can be used to query the file contents.

All the query methods accept an optional filter argument that can take the value ':raw' -- if this is passed the original Word file contents will be returned without any attempt to clean the text.

The default filter attempts to remove Word internal characters used to identify fields (including field instructions), and translate common Unicode 'fancy' quotes into more conventional ISO-8859-1 equivalents, for ease of processing. Table cell markers are also translated into tabs, and paragraph marks into Perl newlines.

Returns the text for the main body of the Word document. This excludes headers, footers, and annotations.

Returns the header and footer texts for the Word document, as a single scalar string.

Returns the footnote and endnode texts for the Word document, as a single scalar string.

Returns the annotation texts for the Word document, as a single scalar string.

Returns the concatenated text from the body, headers, footnotes, and annotations of the the Word document, as a single scalar string.

Returns the bookmark texts for the Word document, as a hash reference. The keys in the hash are the bookmark names (Word requires that these are unique) and the values are the filtered bookmark texts.

This method can be used to get Word form text data out of a Word file. All text fields in a Word form will normally be labelled as bookmarks, and will be returned by this method. Non-textual form fields (including drop-downs) will not be returned, as these are not labelled as bookmarks.

The only function exportable by this module, when called on a file name, returns the raw text contents of the Word file. The contents are returned as UTF-8 encoded text. This is unfiltered, for compatibility with previous versions of the module.

handle non-textual form fields

support for legacy Word - the module does not extract text from Word version 6 or earlier

OLE::Storage also has a script "lhalw" (Let's Have a Look at Word) which extracts text from Word files. This is simply a much smaller module with lighter dependencies, using OLE::Storage_Lite for its storage management.

Stuart Watt, stuart@morungos.com

Copyright (c) 2010 Stuart Watt. All rights reserved.
2012-03-08 perl v5.32.1

Search for    or go to Top of page |  Section 3 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.