tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörg Ehrlich (JIRA) <j...@apache.org>
Subject [jira] [Commented] (TIKA-775) Embed Capabilities
Date Sun, 28 Oct 2012 17:03:12 GMT

    [ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485665#comment-13485665

Jörg Ehrlich commented on TIKA-775:

Hi Ray,

I think it would be great if Tika could also write Metadata back to files and it would be
great to start on this rather sooner than later.
But I have a couple of comments regarding your proposed implementation:

1) Right now the Parsers do both content and metadata extraction. The proposed embedder does
only Metadata embedding, which is fine because updating of content would be out of scope for
But if we introduce separate APIs to embed just metadata I think it would make sense to also
introduce APIs to only extract metadata. Actually at Adobe we had stop using Tika to retrieve
Metadata from specific file formats because it always parses the whole content which is simply
too heavy an operation to scale in a larger system.
So I planned to get started on a new API and adjustments to parsers to just retrieve Metadata
from files, but did not have time for this, yet. I guess it would make sense to synchronize
these two new APIs, right?
Being able to just parse Metadata from files is actually also very important for the embedding
of it, which I will explain further down.

2) Your documentation does not really specify in detail the behavior of the metadata update
that should happen.
Does it always update all metadata in the file, i.e. does it delete properties that are not
in the Metadata object? Or does it only update those properties that are provided in the Metadata
object? How do I delete properties then? Do I make the property empty? But empty properties
are in most metadata containers a valid property value and should not delete the property.
Where does the embedding take place? A lot of file formats have several metadata containers
with similar properties. Does the embed method update all of them? Or just the ones, the parsers
were looking at? What happens in case of inconsistencies? Do you read/write from specific
fields or do you reconcile all of them together?
What happens for properties where the file format specific fields have a fixed length or different
encodings? Do you just write as much as possible and the rest is simply ignored? 

For all such questions, you have to think about whether it makes sense to provide the client
with the ability to either configure the embedder or provide a callback API for the client
to decide if specific scenarios arise or if the embedder should always just do a best guess
for the client.

In any such case, it is usually for the client important to get the original metadata from
the file, before writing it back, so that no properties are wrongly deleted or changed. But
even more so it is important for the Embedder as it would in most cases have to read the metadata
anyway, in order to know how to update the file properly. It usually has to check if an in-place
update of metadata can happen or if the whole file has to be restructured because the metadata
chunks have grown too large to fit where they were before.
That's why I think it would be important to have a get-only-metadata API and Parser capabilities
available, before starting writing it back.

3) This also leads me to the topic of error recovery and safe updating of files. I think the
documentation should be more clear about what the Embedder will do in case of an error and
what is expected by the client. 
There are all sorts of reasons the embedding could fail. If that happens, the original file
usually ends up being corrupt and lost for the user. So it usually makes sense (for samller
files) to do a safe update, which means writing the update in a new file and then swap it
with the original one, after the update was successful.
But what about scenarios where a partial update is possible? You often have files where just
specific metadata sections are corrupt because some tool did not read the spec and wrote it
wrongly. But the rest of the file is still ok, so other parts could still be updated. Do you
want to provide a callback API for the client to be able to react to error scenarios and decide
what he wants to do? The embedder could do a best guess action, but that is usually quite
dangerous for the user's files.

4) I take it that the expectation is that all parsers could also potentially implement the
Embedder interface, so that both reading and writing is in one hand? Otherwise you probably
end up with all sorts of inconsistencies between the two implementations regarding what metadata
fields are read from where and what should be updated when, etc.

5) Why do you pass in an InputStream? That would mean the Embedder has to open up an own OutputStream
to be able to write. That would imply that Tika knows how to properly create OutputStreams
in the client's environment. Wouldn't it be better to leave the client in control here? And
why do you want to return the InputStream?

6) I also agree with Jukka's comments that for such an important new feature we should spend
some more thoughts on this. I think your proposal works ok for the external embedder scenario
but I am not so sure for other scenarios.

Sorry that I did not speak up earlier. This issue has been around for quite a while.
> Embed Capabilities
> ------------------
>                 Key: TIKA-775
>                 URL: https://issues.apache.org/jira/browse/TIKA-775
>             Project: Tika
>          Issue Type: Improvement
>          Components: general, metadata
>    Affects Versions: 1.0
>         Environment: The default ExternalEmbedder requires that sed be installed.
>            Reporter: Ray Gauss II
>              Labels: embed, patch
>             Fix For: 1.3
>         Attachments: embed.diff, tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
> This patch defines and implements the concept of embedding tika metadata into a file
stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed ExternalEmbedder
implementation meant to be extended or configured are added.  These classes are essentially
a reverse flow of the existing Parser and ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which uses the
default ExternalEmbedder (calls sed) to embed a value placed in Metadata.DESCRIPTION then
verify the operation by parsing the resulting stream.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message