lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mpolzin <>
Subject Re: Limiting search result for web search engine
Date Thu, 04 Feb 2010 03:47:26 GMT

Hi thanks for the suggestion. I am relatively new to Lucene, so I have a few
more questions on this implementation. I looked at the source code for
Lucene and found the TopDocCollector class. It appears this class derives
from the HitCollector class, so I should be able to simply extend
TopDocCollector and override the Collect method to simply check to see if I
have a document with the base URL already in collected and inserted. Here is
my psuedo code changes to the Collect method: 

            if (score > 0.0f) 

                // Do something here to get the document base URL

                if ((hq.Size() < numHits || score >= minScore)  &&
                    hq.Insert(new ScoreDoc(doc, score)); 
                    minScore = ((ScoreDoc) hq.Top()).score; // maintain

Does this make sense? 

How could I tell the search to use my extended version of the
TopDocCollector class? Also, how would I pull the URL from the document
inside of the loop above? I didn't see any good documentation anywhere on
how to do that. There seems to be little information out there on how to
build your own custom collector. 

Thanks again, 

Anshum-2 wrote:
> Hi Mike,
> Not really through queries, but you may do this by writing a custom
> collector. You'd need some supporting data structure to mark/hash the
> occurrence of a domain in your result set.
> --
> Anshum Gupta
> Naukri Labs!
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <> wrote:
>> I am working on building a web search engine and I would like to build a
>> reults page similar to what Google does. The functionality I am looking
>> to
>> include is what I refer to a "rolling up" sites, meaning that even if a
>> particular site (defined by its base URL) has many relevent hits on
>> various
>> pages for the searches keywords, that site is only shown once in the
>> results
>> listing with a link to the most relevent hit on that site. What I do not
>> want is to have one site dominate a search results page.
>> Does it make sense to just do the search, get the hits list and then
>> programatically remove the results which, although they meet the search
>> criteria, are not as relevent? Is there a way to do this through queries?
>> Thanks in advance!
>> Mike

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message