tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith R. Bennett (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-31) protected Parser.parse(InputStream stream, Iterable<Content> contents)
Date Tue, 25 Sep 2007 16:19:50 GMT

    [ https://issues.apache.org/jira/browse/TIKA-31?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530149
] 

Keith R. Bennett commented on TIKA-31:
--------------------------------------

Jukka -

I see you also removed the swallowed exceptions.  Thanks!

I have some comments and questions.  Some may not be related to your patch, but the original
code.

----

In the new parse method:

protected String parse(InputStream stream, Iterable<Content> contents)

This means that contents needs to be populated with the metadata keys before being passed
to parse(), right?  Is that the intention?  Wouldn't we want the parser to be responsible
to know what metadata should go into contents and put it there itself?

----

Currently, even with the changes, the parser is stateful.  I'm aware you said that this was
a step *on the way to* making it stateless.  However, the current code has a serious bug that
I think we should fix, even if it's a minimal fix (such as my second suggestion below).

Parser.java has an input stream nonstatic member named 'is'.  This is set via setInputStream().
 The parser can then be called on to return various kinds of content via getStrContent(),
getContents(), and getContent().  It is possible to call setInputStream() to give it a different
stream to parse.  However, any of the parse methods will reuse the parsed content of the previous
stream, because the boolean 'parsed' will not have been reset to false.

Possible alternatives:

1) pass the input stream in only in the parse method, remove the input stream member variable,
and return a document content object that contains the results (the full string content, the
contents map).

2) have setInputStream() reset parsed to false

3) pass the InputStream into the object in its constructor, and make it immutable.  This will
guarantee that each parser will be used for only one document, which is safe, but not very
economical in cases where constructing a parser is nontrivial.

----

In XMLParser, Intellij Idea is reporting that:

} else if (obj instanceof CDATA) {

will never be evaluated to true (since CDATA extends Text, which comes before it in the list
of conditions).

This occurs in the methods:

extractContent()
concatOccurance()

(BTW, "occurance" is a misspelling; it should be "occurrence".)

I can enter a Jira issue and submit a patch if you like.

----


Minor stuff:

In MSExcelParser, "a Excel document" should be "an Excel document".
In RTFParser, "a Excel document" should be "an RTF document".

----

Thanks,
- Keith


> protected Parser.parse(InputStream stream, Iterable<Content> contents)
> ----------------------------------------------------------------------
>
>                 Key: TIKA-31
>                 URL: https://issues.apache.org/jira/browse/TIKA-31
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>         Attachments: TIKA-31.patch
>
>
> In order to push towards a stateless Parser interface, I'd like to propose implementing
the current Parser.getContents() method (as it exists after TIKA-26) in terms of a stateless
abstract method with the following signature:
>     protected abstract String parse(InputStream stream, Iterable<Content> contents)
throws IOException, TikaException;
> This method would return the fulltext content of the given stream as the String return
value and place any extra metadata into the given set of Content instances. With this information
the current Parser.getContents() method could populate a fulltext (and summary) Content entry
and any regexp Content entries universally for any Parser classes. Also, we could centralize
state handling and exception processing to a single class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message