tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (TIKA-775) Embed Capabilities
Date Mon, 19 Nov 2012 15:26:59 GMT

     [ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting reopened TIKA-775:

There's a few problems with the implementation.

* The ExternalEmbedderTest fails in a plain Windows environment since it can't find {{sed}}.
I added a workaround in revision 1411238 that simply disables the test on Windows.
* It would be better if ExternalEmbeddedTest was located in {{tika-core}} along with the ExternalEmbedder
class itself. The use of TXTParser in the test case seems unnecessary.
* More generally the test case is quite complicated. Is it being reused elsewhere, or can
we simplify it? I'd just drop all the extra logging, error handling and flag variables.
* The ExternalEmbedder class also seems quite complicated, though I notice much of it comes
from ExternalParser. Can we for example refactor the common bits to a shared base class?
* See the ExternalParser class for how you can (and should) use the TemporaryResources class
to avoid all the complex cleanup logic. Used properly, the {{dispose()}} method takes care
of all that.
* It's usually a bad idea to capture InterruptedException and just ignore it. Throwing the
exception (possibly wrapped into a TikaException) is probably a better approach.
> Embed Capabilities
> ------------------
>                 Key: TIKA-775
>                 URL: https://issues.apache.org/jira/browse/TIKA-775
>             Project: Tika
>          Issue Type: Improvement
>          Components: general, metadata
>    Affects Versions: 1.0
>         Environment: The default ExternalEmbedder requires that sed be installed.
>            Reporter: Ray Gauss II
>              Labels: embed, patch
>             Fix For: 1.3
>         Attachments: embed_20121029.diff, embed.diff, tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
> This patch defines and implements the concept of embedding tika metadata into a file
stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed ExternalEmbedder
implementation meant to be extended or configured are added.  These classes are essentially
a reverse flow of the existing Parser and ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which uses the
default ExternalEmbedder (calls sed) to embed a value placed in Metadata.DESCRIPTION then
verify the operation by parsing the resulting stream.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message