lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <>
Subject Search agents
Date Wed, 04 Jan 2006 14:03:28 GMT
Hello list,

I wrote a search agent thingy for Lucene. It was built to handle huge  
amounts of agents.

Rather than one query per agent to find out if the new document is  
interesting or not, agent trigger queries are stored in an index that  
is queried with the tokens of a new document.

Since it uses the index a bit backwards  the agent trigger queries  
are somewhat limited:

At least one token in a OR or FUZZY OR per agent field must match the  
new document.
Any NOT token in agent must not match the new document.

It is fairly easy to add more query types, but is limited to single  
token and non-wildcard types since the query if created from the new  
document tokens.

Agents are clustered by required fields by agent, and each cluster is  
stored in an own index. When a new document is sent to the  
AgentManager it creates one query per possible cluster. I'm not sure  
this actually speeds things up, just a gut feeling.

Example agents in psuedo trigger query language:

Possible agent:

AND (OR ("category","media"))
AND (OR ("name", "hotel") OR ("name","rowanda"))
AND (NOT("name", "paradise"))

Impossible agent:

AND (OR ("category","media"))
AND (("name", "hotel") AND ("name","rowanda"))
AND (NOT("name", "paradise"))

In effect the agents can't trigger on AND queries of the same field.

One could of couse place a more complex query on the new document as  
the agent triggers, use some classifier or whatever if speed is not a  
big deal. The agent triggers could then be built from the original  
query. I probably won't implement such a thing my self.

Should I post the code to the sandbox when I've tested it? Are there  
any restrictions to the code if I do that?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message