nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1870) Generic xsl parser plugin
Date Mon, 10 Nov 2014 21:51:34 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205385#comment-14205385
] 

Sebastian Nagel commented on NUTCH-1870:
----------------------------------------

Hi [~Albinscode], simple and funny example :)!

I've added a patch which
* includes boilerplate to build, test, generate javadoc
* make the tests running (but only from src/plugin/parse-xsl via "ant test")
* various minor changes
* javadoc
** added package.info in org.apache.nutch.parse.xsl
** auto-generated JAXB packages are suppressed. Or do we need javadocs for these classes?
* attribute "filterUrlsWithNoRule" belongs to the element "rules", right? -> changed in
the sample

The plugin is working now! I'll continue testing with more complex transforms (to get the
full power of XSL).

Meanwhile a few points which could require review or rework:
* load all configuration files from class path, e.g.
{code}
Reader reader = conf.getConfResourceAsReader(rulesFile);
{code}
That's important if Nutch is run via Hadoop: class and configuration files are wrapped into
one single job file. There are no "real" files which can be load.
This also applies to running the unit tests: we cannot rely that they are executed from a
specific working directory.
* reading config files on-demand and multiple times is not really efficient. It's better to
read and parse all configuration files during setConf(). Sorry, maybe my comment before was
not 100% clear at this point, but setConf() should be the best place:
** errors in configuration are catched early, and are less likely to oversee than if it happens
somewhere in the middle of parsing a segment
** inside setConf() you do not take care of thread-safety
** setConf() is called only once
** parsing should be fast and there is strict timeout (30 sec. per default)
* regarding thread-safety: the trade-off should be minimal. Making RulesManager a local variable
seems too much and is in contradiction to the previous point (loading config files). Wouldn't
it be sufficient to make only those objects thread-local which are unsafe and need to be used
from filter(). E.g., {{javax.xml.transform.Transformer}} is definitely not thread-safe (we
need to check other javax classes). But it should be possible to get a Transformer without
reading the xsl file again every time.
* what about fields with multiple values? A expression can match multiple times, but looks
like only the first match is extracted.

> Generic xsl parser plugin
> -------------------------
>
>                 Key: NUTCH-1870
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1870
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>    Affects Versions: 1.9
>            Reporter: Albinscode
>             Fix For: 1.10
>
>         Attachments: NUTCH-1870-trunk-v3.patch, nutch-site.xml, xsl-parse-plugin.patch,
xsl-parse-plugin2.patch
>
>
> The aim of this plugin is to use XSLT to extract metadata from HTML DOM structures.
> | Your Data | --> | Parse-html plugin  or TIKA plugin | --> | DOM structure | -->
|XSLT plugin |
>                   
>                   
> The main advantage is that:
> - You won't have to produce any java code, only XSLT and configuration
> - It can process DOM structure from DocumentFragment (@see NekoHtml and @see TagSoup)
> - It is HtmlParseFilter plugin compatible and can be plugged as any other plugin (parse-js,
parse-swf, etc...)
> This topic has been discussed on http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message