tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ray Gauss II (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-775) Embed Capabilities
Date Sun, 28 Oct 2012 23:49:12 GMT

    [ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485749#comment-13485749

Ray Gauss II commented on TIKA-775:

Hi Jörg,

Note that the embed.diff file attached to the issue is more current and replaces the previous
patch.txt files.  I've also changed just a few things since posting embed.diff, primarily
around error handling.  I'll post another diff soon with Javadoc additions mentioned below.

1) I'm not sure exactly what you mean here.  The Parser interface only guarantees a parse
method and supported types.  It says nothing about requiring the entire content to be extracted
by the implementation.  The parser interface also makes no specification about how the given
input stream must be read or processed, so each implementation can do that however it sees
fit.  Similarly the Embedder.embed method says nothing about requiring or preventing content
from being updated, so if a particular embedder implementation wants to update the content
itself I suppose there's no reason it couldn't.

2) This is intentionally somewhat vague (but perhaps too much so) as each embedder may implement
this slightly differently, though we should have a suggested approach, and in general I think
that approach should favor preserving the source file's metadata unless explicitly specified.
I will add some of this to the Javadoc but for your specific questions I think the answers
would be:

- Q: Does it always update all metadata in the file, i.e. does it delete properties that are
not in the Metadata object?
- A: Embedder implementations should only attempt to update metadata fields present in the
given Metadata object

- Q: How are empty properties set?
- A: Embedder implementations should set properties as empty when the corresponding field
in the Metadata object is an empty string, i.e. ""

- Q: How do I delete properties?
- A: Embedder implementations should nullify or delete properties corresponding to fields
with a null value in the given Metadata object.

- Q: Where does the embedding take place?
- A: That's up to the embedder implementation and particular file format.

- Q: Does the embed method update properties in all metadata containers?
- A: Embedder implementations should set the property corresponding to a particular field
in the given Metadata object in all metadata containers whenever possible and appropriate
for the file format at the time.  If a particular metadata container falls out of use and/or
is superseded by another (such as IIC vs XMP for IPTC) it is up to the implementation to decide
if and when to cease embedding in the alternate container.

- Q: What happens for properties where the file format specific fields have a fixed length
or different encodings?
- A: Embedder implementations should attempt to embed as much of the metadata as accurately
as possible.  An implementation may choose a strict approach and throw an exception if a value
to be embedded exceeds the length allowed or may choose to truncate the value.

For that last one we could consider adding a second embed method to Embedder which also accepts
a boolean isStrict parameter which would allow a single implementation to operate in a mode
where it would throw exceptions on bad data vs. doing something like truncating.  Implementations
could always implement that themselves so I'm not sure we need it in the interface.

3 and 5) The client is in control of the output stream as the client is responsible for creating
it and passing it to the embed method.  The Embedder needs the given input stream to read
the source data and writes the final data with metadata embedded to the given output stream.
 As such, consumers of the embed method are dictating what that output stream is, which will
probably be a temp file in most cases, and the client can refrain from an writing to the actual
source file in the case of receiving an exception.  See the ExternalEmbedderTest for an example
of creating a temp file output stream for the embedder to write to.

4) Yes, parser implementations could choose to implement the Embedder interface as well. 
That was the reason for naming getSupportedEmbedTypes differently than Parser's existing getSupportedTypes

If the above doesn't answer your concerns I'm more than happy to flesh things out further.


> Embed Capabilities
> ------------------
>                 Key: TIKA-775
>                 URL: https://issues.apache.org/jira/browse/TIKA-775
>             Project: Tika
>          Issue Type: Improvement
>          Components: general, metadata
>    Affects Versions: 1.0
>         Environment: The default ExternalEmbedder requires that sed be installed.
>            Reporter: Ray Gauss II
>              Labels: embed, patch
>             Fix For: 1.3
>         Attachments: embed.diff, tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
> This patch defines and implements the concept of embedding tika metadata into a file
stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed ExternalEmbedder
implementation meant to be extended or configured are added.  These classes are essentially
a reverse flow of the existing Parser and ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which uses the
default ExternalEmbedder (calls sed) to embed a value placed in Metadata.DESCRIPTION then
verify the operation by parsing the resulting stream.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message