tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-26) Use Map<String, Content> instead of List<Content>
Date Sun, 23 Sep 2007 09:48:50 GMT

     [ https://issues.apache.org/jira/browse/TIKA-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting updated TIKA-26:

    Attachment: TIKA-26.patch

This patch replaces the List<Content> collection in ParserConfig and Parser with a Map<String,
Content> map as described above.

In addition the patch makes some minor cleanups like using class-specific logger instances,
more explicitly tracking state of the parser instances (added a separate "parsed" flag), etc.
The patch should however not introduce any functional changes.

This patch probably conflicts a bit with Keith's recent work on TIKA-17 and other issues.
I'll give those a look and come up with an updated patch once his changes are committed.

After this patch the basic structure of a parser class is:

    public class SomeParser extends Parser {
        private static final Logger logger = Logger.getLogger(SomeParser.class);
        private boolean parsed = false;
        private String contentStr;
        public Map<String,Content> getContents() {
            Map<String,Content> contents = super.getContents();
            if (!parsed) {
                // fill in contents and contentStr with parsed content from getInputStream()
                parsed = true;
            return contents;
        public String getStrContent() {
            return contentStr;

What I'd like to do as a followup step is to pass the InputStream as an argument to getContents()
and to include the full text content as a part of the Content map to make the parser instances

> Use Map<String, Content> instead of List<Content>
> -------------------------------------------------
>                 Key: TIKA-26
>                 URL: https://issues.apache.org/jira/browse/TIKA-26
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>         Attachments: TIKA-26.patch
> The current Parser classes take a List<Content> collection from ParserConfig, and
explicitly reformat that collection into an internal Map<String,Content> map keyed by
the Content names. I don't see any place where using a list of Content instances is better
than a Map keyed by the Content names, so I'd like to simplify things by creating the map
already in ParserConfig and using it directly ever since.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message