lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Tisdale <co...@shopperschoice.com>
Subject Re: metadata about result sets?
Date Fri, 10 Mar 2006 15:56:04 GMT
Interesting... I have been looking at lucene in my spare time at work  
for all of 3 days now, so I have to apologize for my lack of  
understanding when it comes to how it works specifically :)  We have  
a terrrible internal search that we are looking to replace, and the  
only thing it does well is help you refine a terrible resultset with  
facted metadata. The way that we build the metatdata list is post- 
indexing of product, we would actual build a bit sequence that  
corresponds to all possible key/value combos for each product and  
associate each variation with the product. Then wehn someone refines  
with the price range, lets say, we just look for the one that  
matches. It gets a bit crazy, but it is the only way we could get the  
speed down for millions of documents in the index. I did read that  
email from cnet the other day, but it didn't really register what  
they were talking about until I saw your metadata group xml exmaple  
doc here.

For the schema, I just meant the document format. the file is called  
schema.xml. I haven't tried it, but it looks like you can change that  
to affect the way solr works without actually affecting the way  
lucene handles it. Is that wrong? I guess it doesn't really matter,  
since it looks like your indexable groups make more sense from a  
maintainability standpoint (less redundant data).

As for 'scanning the resultset', I can see how I was a little shy on  
the details. Sorry about that. I meant look through the results to  
see what facets apply to the resultset. So if my company sells books  
and power tools, when someone searches for 'the little engine who  
could', once we know there are no power tools in the result set, I  
don't show the refinement facets for power tool metadata (like  
wattage or battery operated or blade size). For a big group of  
diverse data, we could potentially have several hundred group names,  
and it seems like it might be redundent to search for 300 metatdata  
types when we know that only 5 apply to the resultset. If, however,  
speed is not impacted noticably by searching for metadata that does  
not exist, then we don't need to worry about this. I am not familir  
enough with lucene's performance to know which would be more optimal.

  In your example file, how does the name facet know to display only  
the names that start with whatever intial was selected?   Would that  
be built in by modifying our result set first (by applying the  
author:a from the initial metadata group) then letting it gather all  
author names on the new set? That seems the easiset way to me, but I  
don't know how the would affect speed with lucene.

I think what I am starting to understand is that coming from what we  
have (a rdbms based metadata gathering system), I need to rethink my  
process. Ive spent so much time training myself to think in terms of  
how to make things fast in mysql that I need to re-open my mind :)

-Corey

On Mar 10, 2006, at 12:44 AM, Chris Hostetter wrote:

>
> : I like the idea of the wiki page; I think I will attempt to set one
> : up after this email, but I wanted to see if I could do a little bit
> : better job of fleshing out how pulling metadata out might work  
> (in my
>
> I finally got a chance to look at your ideas.
>
> first off: as far as i know, there isn't any spcial edit permissions
> neccessary to modify the TaskList ... if the edit link wasn't  
> showing up
> for you after you logged in, it might just be that the page was  
> cached,
> try a force-reload.
>
> Okay, on to the topic at hand..
>
> : We add suggestable metadata as part of the product schema, so we
> : could have something like
>
> There's a difference between the index schema, and the "xml schema/ 
> dtd"
> for adding documents.  You seem to be suggesting a change to the  
> xml used
> when adding documents to indicate wether a field should be  
> suggestable or
> not, but that syntax is tied directly to the underlyng lucene API for
> Documents/Fields -- where would the suggestable/preceding info be  
> stored?
>
> : Once we reindex, we do a search for 'legal' again and our book is in
> : it. Based on our index,  we can scan the resultset and see that the
> : results have three suggestable fields, two of which do not require a
> : preceding field.
>
> I'm not sure what you mean by "scan the result" to get to get the
> suggestable (and their values) ... can you elaborate?
>
>
> I'm not sure if you read the thread yonik mentioned earlier about  
> how we
> do this at CNET, but the way we store info about which fields we  
> want to
> have facets on (and what those facets should be in the case of range
> queries and such) is to put "metadata documents" into the index.   
> for a
> single user request, you pull out the metadata document, then use  
> the info
> contained in it to determine facets to search on and intersect with  
> the
> main result.
>
> the format of hte metadata docs we use is very custom, but perhaps a
> similar, generalized approach could be implimented?
>
> The plugin could dictate a specific XML format indicating the  
> behavior to
> drive the facets using either of hte following mechanisms (more  
> could be
> added as needed)...
>   * make group FF of all indexed values of field F
>   * make group G using queries x, y, and z with labels a, b, and c
> ...users could index one or more metadata documents, containing the  
> XML
> info in any stored field they want defined in the schema -- when
> configuring the plugin, they'd specify the field in the  
> solrconfig.xml.
> at query time, they specify two queries: one to restrict the main  
> results,
> and one to identify the metadata doc they want to use (if it's  
> allways the
> same one, a defualt could be configured in solrconfig as well)
>
> an example of what i mean about XML stored in a field of the metadata
> doc...
>
>    <facets>
>      <group id="price" label="Price">
>        <facet id="0-20"  label="Under $20">price:[0 TO 20]</facet>
>        <facet id="21-40" label="$21 - $40">price:[21 TO 40]</facet>
>        <facet id="41-60" label="$41 - $60">price:[41 TO 60]</facet>
>      </group>
>      <group id="initial" label="Author">
>        <facet id="a" label="A">author:a*</facet>
>        ...
>      </group>
>      <group id="name" label="Author" depends="initial">
>        <facet use-terms-field="author" />
>      </group>
>      ...
>    </facets>
>
>
> -Hoss
>


Mime
View raw message