I'd like to use Julien's approach because I found the scoring filter complex to understand.

My use case is the following :
1. during scoring after parsing, I want to tag interesting pages for me, say meta="HIT"
2. in the next step (to be created) I would like to prune the segment of NON-HIT content in order to optimize segment space (I use nutch caching), I typically need to ditch 90% of segment data.

Also considering to
4. focus recrawls on HIT pages and their outlinks

Today I don't know really if & how one can retrieve these meta data, I have manage to avoid storing "text" content for NON-HIT but it is a dirty trick.


2010/1/19 Andrzej Bialecki (JIRA) <jira@apache.org>

   [ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802175#action_12802175 ]

Andrzej Bialecki  commented on NUTCH-779:
-----------------------------------------

Personally I would use ScoringFilters because I'm familiar with the API, but the approach that you propose is certainly more user friendly especially for novice users.

> Mechanism for passing metadata from parse to crawldb
> ----------------------------------------------------
>
>                 Key: NUTCH-779
>                 URL: https://issues.apache.org/jira/browse/NUTCH-779
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>         Attachments: NUTCH-779
>
>
> The patch attached allows to pass parse metadata to the corresponding entry of the crawldb.
> Comments are welcome

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.




--
-MilleBii-