nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Albinscode (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1871) Generic xsl parser plugin
Date Sun, 05 Oct 2014 20:01:34 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Albinscode updated NUTCH-1871:
------------------------------
    Attachment: xsl-parse-plugin.patch

As suggested by Sebastian Nagel and Chris Mattmann I'm providing this small plugin as a patch
to see if it is valuable for Nutch community.

To keep an ASCII only patch I've disabled in the build.xml the possibility of generating java
classes with jaxb (to see how to integrate them).

As you will see there are some unit tests strongly related to sites I'm crawling. If too specific
I can take time to crawl some more relevant sites and provide more examples.

> Generic xsl parser plugin
> -------------------------
>
>                 Key: NUTCH-1871
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1871
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>    Affects Versions: 1.9
>            Reporter: Albinscode
>             Fix For: 1.9
>
>         Attachments: xsl-parse-plugin.patch
>
>
> The aim of this plugin is to use XSLT to extract metadata from HTML DOM structures.
> | Your Data | --> | Parse-html plugin  or TIKA plugin | --> | DOM structure | -->
|XSLT plugin |
>                   
>                   
> The main advantage is that:
> - You won't have to produce any java code, only XSLT and configuration
> - It can process DOM structure from DocumentFragment (@see NekoHtml and @see TagSoup)
> - It is HtmlParseFilter plugin compatible and can be plugged as any other plugin (parse-js,
parse-swf, etc...)
> This topic has been discussed on http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message