lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Hoschek <>
Subject Re: Search agents
Date Wed, 04 Jan 2006 20:53:11 GMT
If you'd consider using a MemoryIndex for this, I'd recommend also  
having a look at nux.xom.pool.FullTextUtil and  
nux.xom.pool.FullTextPool, adding smart caching for indexes, queries  
and results on top of a MemoryIndex. With some luck this (or some  
variant of it) could help speed up your use cases, at least as far as  
I gather.

[It's part of the Nux download]


Snippet from the javadoc:

  * Thread-safe XQuery/XPath fulltext search utilities; implemented  
with the
  * Lucene engine and a custom high-performance adapter for
  * on-the-fly main memory indexing with smart caching for indexes,  
queries and results.
  * <p>
  * Complementing the standard XPath string and regular
  * expression matching functionality, Lucene has a powerful query  
syntax with support
  * for word stemming, fuzzy searches, similarity searches,  
approximate searches,
  * boolean operators, wildcards, grouping, range searches, term  
boosting, etc.
  * For details see the <a target="_blank"
  * href=" 
queryparsersyntax.html">Lucene Query
  * Syntax and Examples</a>.
  * Also see {@link org.apache.lucene.index.memory.MemoryIndex}
  * and {@link PatternAnalyzer} for detailed documentation.
  * <p>
  * Example Java usage:
  * <pre>
  * Analyzer analyzer = PatternAnalyzer.DEFAULT_ANALYZER;
  * float score = FullTextUtil.match(
  *    "Readings about Salmons and other select Alaska fishing Manuals",
  *    "+salmon~ +fish* manual~",
  *    analyzer, analyzer);
  * if (score &gt; 0.0f) {
  *     // query matches text
  * } else {
  *     // query does not match text
  * }
  * </pre>

On Jan 4, 2006, at 6:03 AM, karl wettin wrote:

> Hello list,
> I wrote a search agent thingy for Lucene. It was built to handle  
> huge amounts of agents.
> Rather than one query per agent to find out if the new document is  
> interesting or not, agent trigger queries are stored in an index  
> that is queried with the tokens of a new document.
> Since it uses the index a bit backwards  the agent trigger queries  
> are somewhat limited:
> At least one token in a OR or FUZZY OR per agent field must match  
> the new document.
> Any NOT token in agent must not match the new document.
> It is fairly easy to add more query types, but is limited to single  
> token and non-wildcard types since the query if created from the  
> new document tokens.
> Agents are clustered by required fields by agent, and each cluster  
> is stored in an own index. When a new document is sent to the  
> AgentManager it creates one query per possible cluster. I'm not  
> sure this actually speeds things up, just a gut feeling.
> Example agents in psuedo trigger query language:
> Possible agent:
> AND (OR ("category","media"))
> AND (OR ("name", "hotel") OR ("name","rowanda"))
> AND (NOT("name", "paradise"))
> Impossible agent:
> AND (OR ("category","media"))
> AND (("name", "hotel") AND ("name","rowanda"))
> AND (NOT("name", "paradise"))
> In effect the agents can't trigger on AND queries of the same field.
> One could of couse place a more complex query on the new document  
> as the agent triggers, use some classifier or whatever if speed is  
> not a big deal. The agent triggers could then be built from the  
> original query. I probably won't implement such a thing my self.
> Should I post the code to the sandbox when I've tested it? Are  
> there any restrictions to the code if I do that?
> -- 
> karl
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message