poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoni Mylka <antoni.my...@gmail.com>
Subject Re: Initial Word 6/95 support
Date Fri, 02 Jul 2010 22:17:04 GMT
W dniu 2010-07-02 23:04, Nick Burch pisze:
> Hi All
>
> As you might've seen from my commits in the last few days, I've added
> some initial support to HWPF for word 6 and word 95 files. I've only
> been working with a view to doing text extraction (so I can ditch the
> text mining library from a work project). With lots of trial and error,
> some offset tips from WV's FIB parsing code, and some refactoring, we
> can now get text and paragraphs out of word 6 and word 95 files!
>
> To play with this, you'll want HWPFOldDocument / Word6Extractor (catch
> OldWordFileFormatException and switch to the old one as needed)
>
> I've got this working with various sample files producing by doing
> save-as from newer software. This means that it's not impossible that
> real Word 6 / Word 95 files will break it, especially if they're
> quick-saved (I didn't have any examples)
>
> As usual, please upload files that don't work to new bugzilla entries,
> or even better upload the broken file and the patch that fixes it :)

A great idea.

Did you try to compare the Word6Extractor against the one from text 
mining? How well does it extract text?

BTW, http://code.google.com/p/text-mining/ contains examples of 
fastsaved files you could use in your tests, they probably can't be 
committed to ASF for legal reasons (can they????), but they make great 
tests nonetheless.

Antoni Myłka
antoni.mylka@gmail.com


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message