tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-980) MicrodataContentHandler for Apache Tika
Date Thu, 12 Nov 2015 15:13:11 GMT

    [ https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002216#comment-15002216

Markus Jelsma commented on TIKA-980:

Hello Nick - the identity mapper is required because without it, tags such as time, meta and
many others are not passed to the content handler so no properties can be extracted from it.

Regarding mapping microdata properties to regular metadata, keep in mind microdata is nested,
you can have many identical properties in different nested blocks (see unit test).

There is also the problem of TIKA-1782, if the top itemscope on the body tag is moved to the
html tag, it should still work, but it doesn't appear to. 

> MicrodataContentHandler for Apache Tika
> ---------------------------------------
>                 Key: TIKA-980
>                 URL: https://issues.apache.org/jira/browse/TIKA-980
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Ken Krugler
>             Fix For: 1.12
>         Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, TIKA-980-1.3-3.patch,
TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch
> ContentHandler for Apache Tika capable of building a data structure containing Microdata
item scopes and item properties. The Item* classes are borrowed from the Apache Any23 project
and are slightly modified to accomodate this SAX-based extractor vs the original DOM-based
> The provided unit test outputs two item scopes about the Europe and NA ApacheCon events
and each has a nested property.

This message was sent by Atlassian JIRA

View raw message