lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Searching Textile Documents
Date Wed, 23 Nov 2005 20:30:33 GMT

On 23 Nov 2005, at 14:30, Alan Chandler wrote:
> 1) The Analyser

First you'll have to spell it the US English way :)

> Since the body has some special syntax, I assume I have to extend  
> the analyser
> to skip the special symbols etc.  Has anyone done this already?  Is  
> there a
> standard place to look? If not, do I have to start again from  
> scratch, or can
> I just "configure" an existing one?  (In particular, I have a  
> routine which
> will take a textile input string and produce an html output string  
> - so could
> I use the HTMLParser in the demo - alternatively JavaCC - is that  
> something I
> could use? - just came across it whilst writing this mail)

I don't know of a Textile analyzer - it looks like you could simply  
configure all of its special symbols as a list of stop words and hand  
it to StandardAnalyzer's constructor.   You could go to the trouble  
of converting to HTML and then parse that, but that would be overkill  
and of course slower.

> I ultimately want to put a summary of the text on the front portion  
> of my web
> site.  In order to calculate where the split is, and therefore how  
> many
> articles to place it would be useful as I am analysing it to get some
> statistics like where is the end of the first paragraph.  Is there  
> a "hook"
> that I can plug into to get that information out (I scanned the  
> javadocs, but
> I can't find anything obvious).

No, there is nothing special in an analyzer to help with this.  It'd  
probably be best to create a parser for Textile that can give you  
back the raw text without the markup and also give you back the first  

> 2) Use of different field types.
> I am stuggling to understand what field types I need for my  
> different fields.

It really all depends on your searching and results display needs.

> For instance, I will want to index all the body of the article, so  
> that the
> words it contains show up in searches, and I will also want to  
> output the
> snippet around where the text is on a search page.  However I can  
> easily
> retrieve the article from the database  given its ID.  Would I  
> therefore make
> the ID of the article a keyword, and the body of it unstored? and  
> would I
> build a special space separated string of the (undetermined number of)
> categories and make them normal.

All of those options are possible and there is no Lucene "best way"  
to do it.  You could easily use Lucene itself as the entire blog  
storage mechanism if you like, even :)

As for categories - it depends on how you need them to be  
incorporated into the search.  You may want to index them  
individually (multiple per document, if desired) as Field.Keyword()  
so they aren't analyzed.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message