lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Burton <>
Subject NGram Language Categorization Source
Date Fri, 19 Aug 2005 21:42:26 GMT
Hey lucene guys.

I know for a fact that a bunch of you have been curious about language
categorization for a long time now and Java has lacked a solid way to
solve this problem.

Anyway.  This new library that I just released should be easy to tie
into your lucene indexers.  Just use the library on a text (strip the
HTML) and then create a new field in Lucene called LANG (or soemthing)
and then create a filter before you search with JUST that language

I'd love some help with filling out missing languages if anyone has
some spare time.  That help make up for all the hard work I've done
here (nudge.. nudge)

I did a full research of the lang categorization space for Java and I
think this is basically the only library out there.

Good luck

I'm working on a blog post describing how blog search engines like
Technorati, PubSub, and Feedster could/should use language
categorization to help deal with the chaos of tagging and full-text
search. Google has done this for a long time now and Technorati has it
in beta.

 Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web -
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message