nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MilleBii <mille...@gmail.com>
Subject Re: [jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
Date Wed, 20 Jan 2010 22:50:25 GMT
I'd like to use Julien's approach because I found the scoring filter complex
to understand.

My use case is the following :
1. during scoring after parsing, I want to tag interesting pages for me, say
meta="HIT"
2. in the next step (to be created) I would like to prune the segment of
NON-HIT content in order to optimize segment space (I use nutch caching), I
typically need to ditch 90% of segment data.

Also considering to
4. focus recrawls on HIT pages and their outlinks

Today I don't know really if & how one can retrieve these meta data, I have
manage to avoid storing "text" content for NON-HIT but it is a dirty trick.


2010/1/19 Andrzej Bialecki (JIRA) <jira@apache.org>

>
>    [
> https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802175#action_12802175]
>
> Andrzej Bialecki  commented on NUTCH-779:
> -----------------------------------------
>
> Personally I would use ScoringFilters because I'm familiar with the API,
> but the approach that you propose is certainly more user friendly especially
> for novice users.
>
> > Mechanism for passing metadata from parse to crawldb
> > ----------------------------------------------------
> >
> >                 Key: NUTCH-779
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-779
> >             Project: Nutch
> >          Issue Type: New Feature
> >            Reporter: Julien Nioche
> >         Attachments: NUTCH-779
> >
> >
> > The patch attached allows to pass parse metadata to the corresponding
> entry of the crawldb.
> > Comments are welcome
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
-MilleBii-

Mime
View raw message