tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ray Gauss II (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-775) Embed Capabilities
Date Mon, 19 Nov 2012 16:37:59 GMT

    [ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500351#comment-13500351

Ray Gauss II commented on TIKA-775:

* The ExternalEmbedderTest fails in a plain Windows environment since it can't find {{sed}}.
I added a workaround in revision 1411238 that simply disables the test on Windows.
Great, thanks.

* It would be better if ExternalEmbeddedTest was located in {{tika-core}} along with the ExternalEmbedder
class itself. The use of TXTParser in the test case seems unnecessary.
* More generally the test case is quite complicated. Is it being reused elsewhere, or can
we simplify it? I'd just drop all the extra logging, error handling and flag variables.
The {{ExternalEmbedderTest}} is indeed meant to be extended, an example can be found in the
{{tika-exiftool}} project [1].

The use of {{TXTParser}} in the test without that context does seem like overkill, but in
general I think we'll want to encourage tests that verify through a relevant parser that the
metadata was embedded properly.

The test certainly could be simplified, I kept it on the verbose side since it's introducing
a new concept.

* The ExternalEmbedder class also seems quite complicated, though I notice much of it comes
from ExternalParser. Can we for example refactor the common bits to a shared base class?
Embedders aren't required to be parsers and vice versa, and since {{ExternalParser}} extends
{{AbstractParser}} we can't have a common base class.  Some methods could probably be made
static and moved into utils classes though.

* See the ExternalParser class for how you can (and should) use the TemporaryResources class
to avoid all the complex cleanup logic. Used properly, the {{dispose()}} method takes care
of all that.
Yes, I followed the precedence there, but the timing of access and cleanup across the permutations
of various streams for {{outputFromStdOut}} true or false, input file, output file, stdErr,
etc. was a bit trickier than what the ExternalParser has to handle.  I'm sure this could be
optimized further though.

* It's usually a bad idea to capture InterruptedException and just ignore it. Throwing the
exception (possibly wrapped into a TikaException) is probably a better approach.
Yes, copied and pasted from ExternalParser and did not give it proper diligence, my mistake.
 I'll refactor that and the ExternalParser to wrap in a TikaException which is my approach
in most cases.

I'm on holiday for a few weeks soon and not sure I'll be able to make the changes I mentioned
before then, but I'll try.

[1] https://github.com/Alfresco/tika-exiftool/blob/master/src/test/java/org/apache/tika/embedder/exiftool/ExiftoolExternalEmbedderTest.java
> Embed Capabilities
> ------------------
>                 Key: TIKA-775
>                 URL: https://issues.apache.org/jira/browse/TIKA-775
>             Project: Tika
>          Issue Type: Improvement
>          Components: general, metadata
>    Affects Versions: 1.0
>         Environment: The default ExternalEmbedder requires that sed be installed.
>            Reporter: Ray Gauss II
>              Labels: embed, patch
>             Fix For: 1.3
>         Attachments: embed_20121029.diff, embed.diff, tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
> This patch defines and implements the concept of embedding tika metadata into a file
stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed ExternalEmbedder
implementation meant to be extended or configured are added.  These classes are essentially
a reverse flow of the existing Parser and ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which uses the
default ExternalEmbedder (calls sed) to embed a value placed in Metadata.DESCRIPTION then
verify the operation by parsing the resulting stream.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message