tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-26) Use Map<String, Content> instead of List<Content>
Date Sun, 23 Sep 2007 09:48:50 GMT

     [ https://issues.apache.org/jira/browse/TIKA-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jukka Zitting updated TIKA-26:
------------------------------

    Attachment: TIKA-26.patch

This patch replaces the List<Content> collection in ParserConfig and Parser with a Map<String,
Content> map as described above.

In addition the patch makes some minor cleanups like using class-specific logger instances,
more explicitly tracking state of the parser instances (added a separate "parsed" flag), etc.
The patch should however not introduce any functional changes.

This patch probably conflicts a bit with Keith's recent work on TIKA-17 and other issues.
I'll give those a look and come up with an updated patch once his changes are committed.

After this patch the basic structure of a parser class is:

    public class SomeParser extends Parser {
        private static final Logger logger = Logger.getLogger(SomeParser.class);
        private boolean parsed = false;
        private String contentStr;
        public Map<String,Content> getContents() {
            Map<String,Content> contents = super.getContents();
            if (!parsed) {
                // fill in contents and contentStr with parsed content from getInputStream()
                parsed = true;
            }
            return contents;
        }
        public String getStrContent() {
            getContents();
            return contentStr;
        }
    }

What I'd like to do as a followup step is to pass the InputStream as an argument to getContents()
and to include the full text content as a part of the Content map to make the parser instances
stateless.


> Use Map<String, Content> instead of List<Content>
> -------------------------------------------------
>
>                 Key: TIKA-26
>                 URL: https://issues.apache.org/jira/browse/TIKA-26
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-26.patch
>
>
> The current Parser classes take a List<Content> collection from ParserConfig, and
explicitly reformat that collection into an internal Map<String,Content> map keyed by
the Content names. I don't see any place where using a list of Content instances is better
than a Map keyed by the Content names, so I'd like to simplify things by creating the map
already in ParserConfig and using it directly ever since.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message