tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (398J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Patches for parser.microsoft.WordExtractor
Date Mon, 01 Jul 2013 22:03:09 GMT
Dear Denis,

Thank you for your contribution to Tika!

Filing an issue would be great, head over here:


Please sign up for an account, create an issue
and then attach your patch there. I for one would
welcome the contribution and am happy to help shepherd
it into the sources.

Thank you!


Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

-----Original Message-----
From: kildishev <kildishev@ispras.ru>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Monday, July 1, 2013 5:00 AM
To: "dev@tika.apache.org" <dev@tika.apache.org>
Cc: Khoroshilov <khoroshilov@ispras.ru>
Subject: Patches for parser.microsoft.WordExtractor

>Dear Tika developers,
>My name is Denis Kildishev and I am working for Institute for System
>Programming of the Russian Academy of Sciences (ISPRAS). We use Apache
>Tika in our open source project Requality
>(https://forge.ispras.ru/projects/reqdb) for doc->xhtml conversion. One
>of our requirements is getting xhtml visual representation close to
>original doc one.
>Working with current version of Tika we found that some improvements
>be made over it. I'd like to introduce some modifications that were
>on Word Extractor from parsers package. They includes support of lists,
>table borders(according to 2007 specification) and some additional
>changes on styling and indents. Also, in our version of this parser we
>have XHTML commands buffer that helps to deal with a problem of nested
>tables. If it is possible, I'd like to contribute those changes back to
>the Tika project. As a first of possible patches I'd like to present
>changes over table representation.
>This patch includes changes over table representation. The information
>about border color is related to specification of 2007 format. Spanning
>of cells is taken from poi html parser.
>Some of patches, including this one, alters the structure of generated
>XHTML file. Different
>changes are made over existing unit tests to deal with this fact. All
>those changes preserve original original test purposes, but in
>way. As an example may be a check of table to be on output file. As for
>trunk version, it is checked by looking for clear "<table>"
>When we introduces styling to table, this construction tends to be
>so, we can looks for "<table" instead.
>I will create a corresponding ticket and I will attach my patch there.
>It is my first contribution to an Apache project, so I would appreciate
>if you guide me how to proceed with it.
>Yours sincerely,
>Denis Kildishev
>Software Engineering Department, ISPRAS

View raw message