tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith R. Bennett (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-17) Need to support URL's for input resources.
Date Thu, 13 Sep 2007 21:20:32 GMT

     [ https://issues.apache.org/jira/browse/TIKA-17?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Keith R. Bennett updated TIKA-17:

    Attachment: tika-17.patch

I apologize for the large patch, but it was near impossible to avoid.  Here are the issues
addressed by this patch:



1) Changed to use URL's instead of File's.

2) Created constructor w/Document parameter; this was how it was being created anyway.

3) In getParserConfig(), added check for null object in list.

4) Added to error message the URL that was being processed when the error occurred.

5) a) Changed:
  static void populateConfig(Document doc, LiusConfig tc)
  void populateConfig(Document doc)

... and called it in the LiusConfig(Document) constructor.

5) b) Removed static member 'tc'; it was no longer necessary and, given the above change,
leaving it in would have been confusing.



1) Changed to use URL's instead of File's.

2) Added:
  public static Parser getParser(URL url, LiusConfig tc).

  public static Parser getParser(File file, String tcPath)
  public static Parser getParser(String str, String tcPath)
.. since this could easily be accomplished by instantiating the LiusConfig object and passing
it instead of tcPath... or do we really need it?  

3) Changed worker method to throw exception if a parser configuration cannot be found
for a mime type.  Currently, I think execution would continue and a NullPointerException would
be thrown when 'parser' is dereferenced.

4) Added log error for parser configuration not found error.



1) Changed to use URL's instead of File's.



1) Changed to use URL's instead of File's.

2) Method testWORDxtraction() to testWORDExtraction().

3) Added output that lists on one line all the content objects, such as:
  Structured Content contains the following 12 items: fullText, title, author, creator, 
  summary, keywords, producer, subject, trapped, creationDate, modificationDate,   

This was because some of the content pieces were many lines long, so it was difficult to find
out the total set of content pieces found.

4) A message is printed to stdout if either the config.xml or the log4j.properties file cannot
be found.

5) log4j.properties is in the repository in src/test/resources/log4j.  I changed the source
code to look for it there.

6) config.xml is in the repository in src/test/resources.  I changed the source code to look
for it there.

7) When exception stack traces are printed, the URL that caused the error is printed immediately
  "Exception getting parser for URL file://...."

> Need to support URL's for input resources.
> ------------------------------------------
>                 Key: TIKA-17
>                 URL: https://issues.apache.org/jira/browse/TIKA-17
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>         Attachments: tika-17.patch
> It would be extremely helpful to support URL's instead of just File's for input resources.
 This would enable us to use class loaders to find resources, and in general support resources
that are not available via the filesystem.
> Patch coming...

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message