lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Allan Hill <>
Subject RE: Using Lucene to match document sets to each other
Date Mon, 19 Dec 2011 17:17:19 GMT
I'm not sure I understand what your field arrangement would be when you say
"[T]he items I'm pulling in from the web contain large bodies of text (descriptions) whereas
the products in my catalog consist of shorter fields such as product name, manufacturer, product
code, etc. So using the smaller fields from my catalog to build queries against the larger
fields in the items I pull in seems to be the only way to do things (that I can think of)."

I would want to take a vanilla crawl, parse, index approach: (1) find a candidate document,
(2) parse the web document as best I could to located all the fields of your existing documents
"product name, manufacturer, product code etc.".  But instead of creating a new document,
I would form a very general query against my document set.

That sounds good, but if the web documents are tricky to parse, I could see why you might
want to index the web documents as "text body" and search for any of your existing fields.
You'd get good throughput searching for as many of your documents in as many of the web documents
as possible, but of course, you'd NOT want to wait until you've crawled Amazon before checking
for any matches.  This leads me to think about multiple phase approach where a crawler creates
"useful" size indices, then it closes that index, hands it off to the query-for-my-products
phase and starts another one.   Note how this approach doesn't require your products in an
Lucene index, just the web documents.

That sounds like a fun and interesting problem.  Good luck.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message