tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (398J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Patches for parser.microsoft.WordExtractor
Date Mon, 01 Jul 2013 22:03:09 GMT
Dear Denis,

Thank you for your contribution to Tika!

Filing an issue would be great, head over here:

https://issues.apache.org/jira/browse/TIKA

Please sign up for an account, create an issue
and then attach your patch there. I for one would
welcome the contribution and am happy to help shepherd
it into the sources.

Thank you!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: kildishev <kildishev@ispras.ru>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Monday, July 1, 2013 5:00 AM
To: "dev@tika.apache.org" <dev@tika.apache.org>
Cc: Khoroshilov <khoroshilov@ispras.ru>
Subject: Patches for parser.microsoft.WordExtractor

>Dear Tika developers,
>
>My name is Denis Kildishev and I am working for Institute for System
>Programming of the Russian Academy of Sciences (ISPRAS). We use Apache
>Tika in our open source project Requality
>(https://forge.ispras.ru/projects/reqdb) for doc->xhtml conversion. One
>of our requirements is getting xhtml visual representation close to
>original doc one.
>
>Working with current version of Tika we found that some improvements
>can
>be made over it. I'd like to introduce some modifications that were
>made
>on Word Extractor from parsers package. They includes support of lists,
>table borders(according to 2007 specification) and some additional
>changes on styling and indents. Also, in our version of this parser we
>have XHTML commands buffer that helps to deal with a problem of nested
>tables. If it is possible, I'd like to contribute those changes back to
>the Tika project. As a first of possible patches I'd like to present
>changes over table representation.
>
>This patch includes changes over table representation. The information
>about border color is related to specification of 2007 format. Spanning
>of cells is taken from poi html parser.
>
>Some of patches, including this one, alters the structure of generated
>XHTML file. Different
>changes are made over existing unit tests to deal with this fact. All
>those changes preserve original original test purposes, but in
>different
>way. As an example may be a check of table to be on output file. As for
>current
>trunk version, it is checked by looking for clear "<table>"
>construction.
>When we introduces styling to table, this construction tends to be
>wrong,
>so, we can looks for "<table" instead.
>
>I will create a corresponding ticket and I will attach my patch there.
>It is my first contribution to an Apache project, so I would appreciate
>if you guide me how to proceed with it.
>
>Yours sincerely,
>Denis Kildishev
>Software Engineering Department, ISPRAS


Mime
View raw message