nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@nutch.org>
Subject Re: [Nutch-dev] Adding title and site to scoring
Date Tue, 22 Mar 2005 21:54:57 GMT
Your changes make good sense.  I look forward to seeing the patch.

My preference would be to first apply the patch as proposed and then, 
subsequently, consider your final two points.

Thanks!

Doug

Piotr Kosiorowski wrote:
> Hello,
> 
> I was reading the code and implementing some features today and want to
> summarize it as I promised to Andrzej and Michael - my email is a bit 
> long but I have promised some details.
> 
> Status of related features in current nutch codebase:
>     - "site" field added by SiteIndexingFilter cannot be used for 
> hostname storage as it is not tokenized and as I understand the purpose 
> of this plugin (limiting answers to given site) it should not be 
> tokenized. And we need to tokenize host.
>     - there is a "title" field added by index-basic plugin but it is not 
> indexed - it is stored only for display purposes.
> 
> There are two sets of changes required to add host and title fields to
> the index and use them during search.
> 
> Indexing changes:
> 
>     -index-basic plugin:
>     I assume index-basic functionality is to be changed to include
> indexed,tokenized,unstored "host" and indexed,tokenized,stored "title"
> fields and exclude title from "anchor" field.
> 
>     - NutchDocumentAnalyzer:
>         - for "host" and "title" use the same analyzer as for "anchor" 
> and "url".
> 
> 
>     - NutchSimilarity:
>         - length normalization should treat host as url and title as 
> anchor for now.
> 
> 
> Searching:
>     - BasicQueryFilter -
>         - add host and title fields handled exactly as all other fields. 
> For start I will set TITLE_BOOST=1.5, and HOST_BOOST=2  (as host would 
> be used in matching two times: in "host" and in "url" fields - it will 
> influence the score very much).After implementation I will do some test 
> to choose the values for boost that would look ok (at least for me).
> 
> 
> I have already implemented all these changes (not a lot of work after
> figuring what to change in fact) and I will do basic tests tommorow, and
> after basic verification of implementation I will send the patch for
> others interested to try - and comment on results.
> 
> 
> Changes that are introduced by this patch would modify index structure
> (addition of new field) and will change default query. I think it should
> be possible to use new code with old index (it should behave as old code
> as new fields in query would not be present in document), but mixing new
> and old segments might be a problem. So I think this change requires
> reindexing.
> 
> 
> During implementation I have found two additional ideas:
> 1) Do not index url (keep it as stored only field) - add separate host
> and path fields as indexed  (it will not index protocol, port and some
> other parts of url but I am not sure if indexing them makes sense). It
> will be easier to control effect of weights and length normalization if
> host is not counted twice, but this would require reindexing as some old
> fields would be used differently in query - so it will not work as
> before with old index.
> 
> 2)I do not have any evidence yet, but looking at the data I have a
> feeling that "not host" part of an url is not as important as current
> boost factor for it indicates. Probably it should be treated more like a
> title (as it is settable by page owner and easy to spam). I will look at
> paramters when I will have tested implementation so I can index the same
> segments with different parameters and compare results.
> 
> Do you think it makes sense to add such functionality? If so I can 
> change these two additional things before posting a patch.
> 
> Regards,
> Piotr

Mime
View raw message