lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Chandler <>
Subject Searching Textile Documents
Date Wed, 23 Nov 2005 19:30:23 GMT
I am a brand new newbie with respect to Lucene, and I am just figuring out how 
to include it into an application I am building. (personal blog)

In essence I have a set of articles that reside in a database.  Each one will 
have an Integer ID identifying it, Textual Title, some key parameters (such 
as date published, and categories) and body which is composed of text using 
the "textile" format (see

I have two sets of questions which I don't quite understand

1) The Analyser

Since the body has some special syntax, I assume I have to extend the analyser 
to skip the special symbols etc.  Has anyone done this already?  Is there a 
standard place to look? If not, do I have to start again from scratch, or can 
I just "configure" an existing one?  (In particular, I have a routine which 
will take a textile input string and produce an html output string - so could 
I use the HTMLParser in the demo - alternatively JavaCC - is that something I 
could use? - just came across it whilst writing this mail)

I ultimately want to put a summary of the text on the front portion of my web 
site.  In order to calculate where the split is, and therefore how many 
articles to place it would be useful as I am analysing it to get some 
statistics like where is the end of the first paragraph.  Is there a "hook" 
that I can plug into to get that information out (I scanned the javadocs, but 
I can't find anything obvious).

2) Use of different field types.

I am stuggling to understand what field types I need for my different fields.

For instance, I will want to index all the body of the article, so that the 
words it contains show up in searches, and I will also want to output the 
snippet around where the text is on a search page.  However I can easily 
retrieve the article from the database  given its ID.  Would I therefore make 
the ID of the article a keyword, and the body of it unstored? and would I 
build a special space separated string of the (undetermined number of) 
categories and make them normal.


Alan Chandler
Open Source. It's the difference between trust and antitrust.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message