lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Gubarkov <>
Subject Two questions on RussianAnalyzer
Date Thu, 19 Apr 2012 11:26:11 GMT

Upon updating to Lucene 3.6 I've noticed that new RussianAnalyzer
analyzes not the same way as before.

Please see example:

    private List<String> getTokens(Analyzer theAnalyzer, String str)
throws IOException {
        final TokenStream tokenStream =
theAnalyzer.tokenStream(MessageFields.BODY, new StringReader(str));


        final CharTermAttribute termAttribute =

        List<String> tokens = new LinkedList<String>();

        while (tokenStream.incrementToken()) {
            final String term = new String(termAttribute.buffer(), 0,
//            System.out.println(">>" + term);
        return tokens;

    public void testDots() throws IOException {
        final String str = " " +

        System.out.println("New analyzer:");
RussianAnalyzer(Version.LUCENE_36), str));

        System.out.println("Old analyzer:");
RussianAnalyzer(Version.LUCENE_30), str));

This shows:

New analyzer:
[, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q,
r, s, t, u, v, z, y, z]
Old analyzer:
[aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p,
q, r, s, t, u, v, z, y, z]

Please note the differences.

The most uncomfortable in new behaviour to me is that in past I used
to search by subdomain like
and have displayed results with, and
so on. Now I have 0 results.

My questions are: 1) it this change is by design (not a mistake) and
2) is the only option to achieve old behaviour is to use
Version.LUCENE_30 for creating analyzer?

The other problem with RussionAnalyzer is with the letter Yo which in russian often
replaced by letter Ye, and
such words are considered same.
What I want to achieve is that my search by word with yo also yield
words with this letter replaced to ye (and vice-versa).

What I'm currently doing is roughly next:

// NOTE: I have to define my class in this package, because method
russianAnalyzer.createComponents is protected

public class RussianAnalyzerImproved extends ReusableAnalyzerBase{
    private RussianAnalyzer russianAnalyzer = new

    protected Reader initReader(Reader reader) {
        return new YoCharFilter(CharReader.get(reader));

    protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
        return russianAnalyzer.createComponents(fieldName, reader);

public class YoCharFilter extends CharFilter {
    public YoCharFilter(CharStream in) {

    public int read(char[] cbuf, int off, int len) throws IOException {
        final int charsRead =, off, len);
        if (charsRead > 0) {
            final int end = off + charsRead;
            while (off < end) {
                if (cbuf[off] == 'ё' || cbuf[off] == 'Ё')
                    cbuf[off] = 'е';
        return charsRead;

But I'm not sure this is the correct approach.
What do you think?
Maybe it may have sense to add a configuration option to
RussianAnalyzer itself (distinguish or not yo & ye)?

Sincerely yours,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message