lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lu <>
Subject Re: Announcement: Lucene powering Product Category Listings
Date Wed, 31 Aug 2005 02:54:33 GMT
Very nice implementation and a great write up.

How large is the index?
And when you keep posting new content to the index, will you optimize the index?

Chris Lu
Lucene Search RAD on Any Database

On 8/30/05, Chris Hostetter <> wrote:
> I'm pleased to announce that for about a month now, CNET's "Product
> Listing" pages are powered by Lucene 1.4.3.  These pages not only allow
> users to browse CNET's catalog of tech products by category, but also to
> "Filter" the lists according to category specific Attribute Filters which
> are displayed along with counts of how many products they will get if they
> apply that Filter.  Multiple Filters can be applied (in any order) to
> rapidly narrow down the list of products.
> Examples of these pages can be seen here...
> Digital Cameras
> Inkjet Printers
> Epson Inkjet Printers
> Epson Inkjet Printers that can print on Transparencies
> These pages work much the same way as I've described in past threads
> regarding "category counts", except that the logic determining which
> filter links to display is not as simple as just pulling out the most
> frequent terms per field, or based on a fixed list.  As you can see from
> the example links, each category has it's own unique list of attributes
> (ie: Price, Manufacturer, etc...) and for each of those attributes, there
> is a list of Queries which map one to one with a possible "Find by" link.
> Even if an Attribute is common between two categories, the list of Queries
> to filter by may be different -- note the differences in the "Find by
> price" lists between the various links above.  We have several thousand
> unique categories, and some of these have as many as a thousand unique
> Filter Queries which are needed to determine the counts to display on any
> given page for that category, but using some very aggressive Filter
> caching the time for a single request is kept very manageable.
> For those who are interested, I can elaborate a little more on how these
> pages work....
> At a high level there are four major pieces...
> 1) A Servlet which abstracts away most of the Lucene index modification
> APIs into an HTTP/XML based "web service" by accepting POSTed XML
> documents to add/update in the index.  It also replies to GET search
> requests using query plugins that have access to an IndexReader.
> 2) A ProductData index updater, which is executed as part of our "product
> publishing" process.  Anytime a product is added (or modified) in our
> database the updater creates an XML document describing the product and
> POSTs it to the above mentioned Servlet (which indexes it).
> 3) A Metadata index updater, which is executed as part of our "category
> metadata publishing" process.  Anytime someone decides to change the
> metadata that describes a category, this process creates an XML document
> containing that metadata, and POSTs it to the above mentioned Servlet
> (I'll elaborate more on these category documents in a moment).
> 4) A Query Plugin used by the Servlet specificly to generate the product
> result lists and counts needed for these product listing pages.
> The Category Metadata documents are what really drive the behavior of the
> Plugin.  They contain the following information...
>   * A Query whose results are all products to display in this category
>   * An ordered list of Attributes that can be filtered on
>     - A datatype for each Attribute
>     - An ordered list of "Filters" for each Attribute
>       + A label to display for each Filter
>       + A Query to define what products match that Filter
> When a request comes in for a category, the first thing the Plugin does is
> an initial query on the category Id to get the category's Metadata
> document.  From that document, the field containing the Query that defines
> that category is extracted, and a search is issued against it (using
> whatever Sort options have been specified).  This Category Query is also
> used to build a QueryFilter so that a BitSet of every matching product can
> be obtained.  For each Filter in each Attribute found in the Category's
> document, the Query is extracted, and again a QueryFilter is built to
> obtain a BitSet of all products which match.  The intersection of that
> BitSet with the BitSet from the initial Category Query is computed to
> determine the "count" to display next to the Filter label.  Once all of
> this is done the list of products, all of the data from the Category
> Metadata document, and the counts for each of the Category Filters are
> bundled up into an XML response document.  The client which initiated the
> search can then apply additional Business logic to decide which
> attributes/filters to display counts for -- the simple case is to display
> the first N attributes, and for each attribute display the Filter links
> with the highest counts, but in some cases the links may be displayed in
> different orders based on the datatype of the attribute.
> When a user clicks on a Filter link the process is the same as before, but
> the initial Category Query is augmented by the Filter that has been
> selected -- so the results to display on the first page (and the BitSet of
> all matching products) are correct.  The new counts (which take into
> account the selected Filter) are computed exactly as before -- using
> BitSet intersections.
> What makes all of this feasible to do during a single user request, and
> what keeps the load on our servers manageable, is an aggressive caching
> strategy.
> The Servlet maintains a single IndexReader for use by the any requests to
> the Query Plugin.  The Servlet also maintains a fixed size Cache of [
> Filter => BitSet] (This cache currently uses LRU replacement, but ideally
> it would be LFU).  The Servlet keeps track of when it makes modifications
> to the index, and once it's decided that it is time to make those
> modifications visible to the plugin it uses a background thread to open a
> new IndexReader, and create a new Cache instance which it warms up by
> "pre-computing" the BitSets for the top N Filters in the Cache already in
> use.  It then swaps out the "old" IndexReader/Cache pair with the new
> IndexReader/Cache for all subsequent searches.
> Given a large amount of RAM, and infrequent updates to the index, page
> hits to most categories rarely involve anything more then Cache lookups.
> But even when we make frequent updates, the Cache warming we do with a
> newly constructed IndexReader prior to actually *using* the IndexReader
> allows us to remain very responsive on our most popular categories.
> Through configuration, we can decide: Is it more important to open new
> IndexReaders as fast as possible (and display new results immediately) at
> the expense of not being able to pre-warm the cache very much? (resulting
> in slower page loads) ... Or: Is it more important to keep our page load
> times very low, by pre-warming our cache with everything and the kitchen
> sink (which means results take a while to update because we are opening
> new IndexReaders less frequently).  Our current configuration results in
> ~95% cache hit rate for our Filters.
> Hopefully I've explained the overall design of our system well enough that
> people interested in doing "category counts" and "drilling down" can see
> that it is possible, even when you are dealing with a very large number of
> Filters.  I'm sure my familiarity with the system has caused me to write
> something that makes perfect sense to me, but is totally unintelligible to
> everyone else -- if so, please feel free to ask any questions you have,
> I'll try to answer them as best I can.  Some questions regarding the
> internals of the Servlet and the Caching it does may be beyond my ability
> to answer because they were developed by my coworker -- but he is also an
> active participant on this list, and (time permitting) I'm sure he'd
> happily answer any questions I can not.
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message