lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <>
Subject Re: Reduction based "more like this"?
Date Fri, 09 Feb 2007 11:52:23 GMT
The distinguishing characteristics you mark out and put in a field may not be so distinguishing
as more content is added to an index (e.g. use of new terminology like "podcast" becomes more
prevalent). Maintaining/regenerating this field in anything other than a static index then
starts to look like a non-trivial overhead.

While we are musing on this, I'm not sure that with things like MoreLikeThis (or the BooleanQuery
scoring?) we have considered the true value of *coincidences* of terms rather than independently
summing their individual IDFs. For example, given terms "female", "John" and "London" - all
3 may have equal IDF but should a document representing a female in London be given equal
weighting to a document representing  the rarer example of a female who happens to be called
"John"? Considering these pairings adds extra complexity/cost but might be an interesting
avenue to explore for some apps when selecting distinguishing characteristics or weighting
query results.


----- Original Message ----
From: karl wettin <>
Sent: Friday, 9 February, 2007 8:31:05 AM
Subject: Reduction based "more like this"?

I just woke up thinking it would be cool to attempt reducing the data  
of all documents using PCA (or so) and store the output in a new  
field per dimention introduced in order to find similair documents by  
placing a simple proximity query. Did anyone attempt something like  

I did not think this through that much. Nor do I need this feature.  
Just think it would be a cool experiment.


To unsubscribe, e-mail:
For additional commands, e-mail:

Inbox full of unwanted email? Get leading protection and 1GB storage with All New Yahoo! Mail.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message