Reading Word 2007 Documents

Microsoft Word 2007 has begun saving files in an XML format. To read
the raw data, simply unzip (yes, unzip) the .docx file. Then look
for a file called "document.xml" This is the raw data without all
of the fonts and cool stuff.

Simply run the perl script below, and your Word 2007 (docx) file will
be converted into plain text.

Email me if you have questions ryan@dragonfires.net

#!/usr/bin/perl

# input XML from STDIN
@lines = <>;

foreach $i (@lines) {
    #Add carriage return to where MS Word has them.
    $i =~ s:\<\/w\:r\>:^M:g;	#Note: ^M is a special charcter
    
    #Remove all XML Tags
    $i =~ s/<[^>]*>//gs;

    #Print it.  We're done.
    print $i;
}