lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <pe...@peknet.com>
Subject Re: [lucy-user] Avoid duplicate docs in hits?
Date Wed, 29 Aug 2012 03:13:27 GMT
Desilets, Alain wrote on 8/28/12 2:47 PM:
> When I started working with Lucy, I expected it to work like a kind of
> relational DB table, where certain fields of an index acted like "unique
> keys" for the records (which in turn would guarantee that there can be only
> one record with a given key). But that's not how Lucy is designed.
> 
> So in the end, we implemented our own class LucyIndex, which add this kind of
> functionality. When defnining the schema for the index, you indicate which
> field will act as the key. From then on, if you add a record whose key value
> is the same as that of an existing record, then the class will erase the
> existing record, and replace it by the one you provide. It wasn't hard to
> implement, but I am surprised this kind of functionality is not standard in
> Lucy.
> 

Alain,

I think you've answered the question in your comments: it wasn't hard to
implement on top of the Lucy core functionality. That's why it isn't in core.
Core aims to do the hard things.

You're right that Lucy doesn't have the concept of a primary key built in. I
expect that's because there are so many app-specific ways to define a PK, it's
not worth trying to build that functionality into core. (I think Marvin might
say the same about QueryParser.)

Instead, methods like delete_by_term() and delete_by_query() make it simple to
add app-specific constraints.

E.g., here's what I do in SWISH::Prog::Lucy::Indexer, which uses 'swishdocpath'
(the URI) as the unique term for each doc:

    # make sure we delete any existing doc with same URI
    $self->{lucy}->delete_by_term(
        field => 'swishdocpath',
        term  => $doc{swishdocpath}
    );

    $self->{lucy}->add_doc( \%doc );

All that said, I doubt anyone would be opposed to adding PK functionality into
core, were someone to care enough about the feature to work on it. I imagine a
specific FieldType would be the way to go about it, and then some logic in
add_doc() that checks the field types and in %doc and does... what? croak?
delete_by_term (as in your code and my example above)?

Alternately, it might be worth sharing your LucyIndex class on CPAN in the
LucyX::* namespace. Something to consider.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Mime
View raw message