lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Chandler <>
Subject Re: Searching Textile Documents
Date Wed, 23 Nov 2005 20:56:38 GMT
On Wednesday 23 Nov 2005 20:30, Erik Hatcher wrote:
> On 23 Nov 2005, at 14:30, Alan Chandler wrote:
> > 1) The Analyser
> First you'll have to spell it the US English way :)

You mean yet another corruption of my language:-)  I am still having trouble 
with color rather than colour in all my css files.
> I don't know of a Textile analyzer - it looks like you could simply
> configure all of its special symbols as a list of stop words and hand
> it to StandardAnalyzer's constructor. 
Might be possible, the real difficult ones are the url etc

> You could go to the trouble 
> of converting to HTML and then parse that, but that would be overkill
> and of course slower.

Well, I have to put it into html to display it on a web page, so its a form it 
will exist in at some stage. 

> > I ultimately want to put a summary of the text on the front portion
> > of my web
> > site.  In order to calculate where the split is, and therefore how
> > many
> > articles to place it would be useful as I am analysing it to get some
> > statistics like where is the end of the first paragraph.  Is there
> > a "hook"
> > that I can plug into to get that information out (I scanned the
> > javadocs, but
> > I can't find anything obvious).
> No, there is nothing special in an analyzer to help with this.  It'd
> probably be best to create a parser for Textile that can give you
> back the raw text without the markup and also give you back the first
> paragraph.

I think you are probably right.  I am just looking at the demo html parser and 
seeing how thats built from the javaCC stuff - looks to be something I could 
usefully study some more.

> > 2) Use of different field types.
> All of those options are possible and there is no Lucene "best way"
> to do it.  You could easily use Lucene itself as the entire blog
> storage mechanism if you like, even :)

I hadn't thought about it until you mentioned it.  Indeed, that might be the 
right way to go (the database is little more than an article store and some 
verification tables for the article status (published or not) and category).


Alan Chandler
Open Source. It's the difference between trust and antitrust.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message