lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nik Kolev" <>
Subject RE: cluster documents based on fields' values
Date Tue, 17 Aug 2010 17:42:16 GMT
Thanks Grant. I'll take a look at Solr's faceting.

A colleague of mine also discovered solr's clustering component - It's still labeled as
experimental - does anybody have experience with it?

Another option (pointed out by your post:
t-with-apache-lucene-and-solr-part-i-of-3/) is mahout with its baked in
clustering algorithms. Has anybody gotten good mileage out of this

Thanks again,

-----Original Message-----
From: Grant Ingersoll [] 
Sent: Tuesday, August 17, 2010 7:57 AM
Subject: Re: cluster documents based on fields' values

Hi Nik,

Inline below.
On Aug 15, 2010, at 5:01 PM, Nik Kolev wrote:

> Hi,
> I am researching the possibility of using Lucene for discovering
> clusters of documents and since I am new to Lucene I decided to
> ask the community for advice before I poke the APIs and the internals.
> Your input will be invaluable!
> Here's the use case. Documents arrive from different feeds. Each feed
> produces millions and millions of documents. Documents are structured
> and share certain "interesting" fields. For the purpose of
> here's a trivial example. Let's say the documents on each feed
> shoes. A shoe has an ID (uniquely identifying it within its feed) and
> SIZE [1]. I want to be able to ask Lucene (assuming the shoe documents
> are indexed of course) for all of the clusters of shoes that sport the
> size. A cluster of shoes is just the IDs of the shoes that got grouped
> together due to having the same value of the SIZE field.
> I don't think that doing this brute force will perform. Here's what I
> mean when I say brute force. I looked at IndexReader, and I saw that
> I can get the distinct values for an indexed field. So assuming that
> each feed will have its own index (I don't think I can get away with
> a single index for all feeds [2]), I can get the union of all distinct
> values for the interesting field across all indexes (one per each
> document feed). Then for each distinct value I can do a MultiSearcher
> search across these indexes getting the IDs of the documents .
> My gut tells me that the brute force approach won't perform [3]. And
> here's where you guys come in - is it possible to ask lucene to give
> me the groups of records (across indixes) that share the value for a
> given field? Is there something else in API that I can take advantage
> of and get what I need faster? If I am to extend Lucene to allow for
> this sort of thing, where should I start, what documents should I
> read,...?

This feature, called faceting -- amongst other things, comes out of the
box in Solr (

> Thanks very much again for your advise,
> -nik
> [1] Shoe document layout:
>    SHOE
>      |--ID
>      |--SIZE
> [2] The volume of documents I get from each feed is huge. Also, even
>    though a lot of fields are shared the format across feeds varies
>    and a feed may not have some of the "interesting" fields of
>    another feed.
> [3] The dataset I am talking about is rich and the brute force
>    means that I need to be doing tens of millions of searches just to
>    group on one field. Also I most likely will blow my heap up if I
try to
>    load all of the values in memory all at once.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Search the Lucene ecosystem using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message