nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: [Nutch-dev] Adding title and site to scoring
Date Wed, 23 Mar 2005 00:02:22 GMT
Piotr Kosiorowski wrote:
> Hello,
> 
> I was reading the code and implementing some features today and want to
> summarize it as I promised to Andrzej and Michael - my email is a bit 
> long but I have promised some details.
> 
> Status of related features in current nutch codebase:
>     - "site" field added by SiteIndexingFilter cannot be used for 
> hostname storage as it is not tokenized and as I understand the purpose 
> of this plugin (limiting answers to given site) it should not be 
> tokenized. And we need to tokenize host.
>     - there is a "title" field added by index-basic plugin but it is not 
> indexed - it is stored only for display purposes.

Correct.

One comment though on the value of the "site", something that I intended 
to raise but always kept forgetting... Page.computeDomainID() returns a 
nice, unique long value, which (when encoded with Character.MAX_RADIX) 
is always shorter than the host name. This could be used instead of 
"site", to save some space in the index. The coresponding query plugin 
would compute the domain ID using the same formula.

> 
> There are two sets of changes required to add host and title fields to
> the index and use them during search.
> 
> Indexing changes:
> 
>     -index-basic plugin:
>     I assume index-basic functionality is to be changed to include
> indexed,tokenized,unstored "host" and indexed,tokenized,stored "title"
> fields and exclude title from "anchor" field.

Yes.

> 
>     - NutchDocumentAnalyzer:
>         - for "host" and "title" use the same analyzer as for "anchor" 
> and "url".
> 

Hmmm... It is not clear to me why in the current NutchDocumentAnalyzer 
the AnchorAnalyzer is used for "url". While for the "anchor" field it 
makes sense, because it sets a gap between the terms to prevent matches 
across consecutive anchors, in case of "url" we only ever have a single 
value being added. This results in just adding three empty positions at 
the start of the tokens, e.g. for the url 
"http://www.tjorn.se/kof/kultur/sagor/" we get:

null_3, http|http-www|http-www-tjorn, www|www-tjorn, tjorn, se,
kof, kultur, sagor

So, I would argue that it doesn't make much sense, and we should fix it 
to use the ContentAnalyzer for "url".

The same would go for the new fields, "host" and "title", because there 
are ever only single values of these.

> 
>     - NutchSimilarity:
>         - length normalization should treat host as url and title as 
> anchor for now.

Yes, probably correct... we need to see the results on some well-known 
cases.

> 
> 
> Searching:
>     - BasicQueryFilter -
>         - add host and title fields handled exactly as all other fields. 
> For start I will set TITLE_BOOST=1.5, and HOST_BOOST=2  (as host would 
> be used in matching two times: in "host" and in "url" fields - it will 
> influence the score very much).After implementation I will do some test 
> to choose the values for boost that would look ok (at least for me).

Regarding the TITLE_BOOST - well, the boost for anchors is 2.0f, why 
should it be lower for this special anchor, which is the title? You make 
an assumption here that the author of the page is less to be trusted 
with the title than the others who link to his page...

Regarding the HOST_BOOST - if you consider the example I gave in the 
email that started this thread, the reason for treating the host part of 
urls separately was to increase the quality of results, by boosting up 
the scoring for sites that are more likely the "reference" sites for the 
query terms. So, given the query "ikea", and the following urls:

1. "http://www.ikea.se/some/other/name.html"
2. "http://www.some.se/some/other/ikea.html"
3. "http://ikea.some.se/some/other/name.html"

which of the above urls should score the highest?

With the current code, all three would get the same score. With your 
patch applied, only 1. and 3. would get the same score, the 2. would get 
a lower score. Now, the interesting question is this: is there any 
meaningful and generic way to introduce a difference in scoring between 
1. and 3.?

> 
> 
> I have already implemented all these changes (not a lot of work after
> figuring what to change in fact) and I will do basic tests tommorow, and
> after basic verification of implementation I will send the patch for
> others interested to try - and comment on results.

Great. If the patch is not too large, just send it to the list, 
otherwise you can put it in Bugzilla.

> 
> 
> Changes that are introduced by this patch would modify index structure
> (addition of new field) and will change default query. I think it should
> be possible to use new code with old index (it should behave as old code
> as new fields in query would not be present in document), but mixing new
> and old segments might be a problem. So I think this change requires
> reindexing.

Correct - mixing indexes would be a big no-no. Using the new code with 
old indexes would lead to different absolute score values.

> 
> During implementation I have found two additional ideas:
> 1) Do not index url (keep it as stored only field) - add separate host
> and path fields as indexed  (it will not index protocol, port and some
> other parts of url but I am not sure if indexing them makes sense). It
> will be easier to control effect of weights and length normalization if
> host is not counted twice, but this would require reindexing as some old
> fields would be used differently in query - so it will not work as
> before with old index.

In some cases you want to select a subset of results by protocol (e.g. 
https, file, smb, etc). So, it seems to me that you need to keep the 
protocol around.

Also, I think that in some cases you want to run a phrase match across 
the whole url, so keeping an index of the whole url would be beneficial.

The fact that terms in "host" and "url" overlap can be adjusted by 
boosting and different normalization. Please also remember that terms, 
which are "qualified" with the field name (like in a query 
"anchor:test") would never match the content in other fields.

> 
> 2)I do not have any evidence yet, but looking at the data I have a
> feeling that "not host" part of an url is not as important as current
> boost factor for it indicates. Probably it should be treated more like a
> title (as it is settable by page owner and easy to spam). I will look at
> paramters when I will have tested implementation so I can index the same
> segments with different parameters and compare results.

IMHO it's difficult to say anything general about this. The "path" part 
of the url, in addition to all terms we can get from it, gives us an 
important information about the nesting level of the current page, and 
all in all it's somewhat more to be trusted than the page title. Some 
ranking methods give deeply nested pages a lower score than pages 
directly linked to the top of the site.

> 
> Do you think it makes sense to add such functionality? If so I can 
> change these two additional things before posting a patch.

I think the changes related to the "host" field are better understood at 
this moment than these two. I think you should limit your patch just to 
the "host" functionality, and we should continue to discuss the other ideas.


-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message