From "Spencer, Dave" <>
Subject Is there a StrlenFilter yet?
Date Mon, 21 Oct 2002 23:20:11 GMT

Use case - you want to protect yourself against pathalogical docs
such as one with a string of a million consectutive characters - any
normal tokenizer will consider this one big token but there's probably
no point in indexing a string that is a million characters long.
One example is indexing a mailing list which could contain uuencoded
attachments - there
could be lots of lamo lines 72 or so chars long.

Anyway - I've attached a possible impl.

Discussion question is, let's say the filter is told to only return
tokens <= 5 chars long (note:
I think 16 or so would be more realistic for most docs -this is just for
sake of example).

What if there is one 6 chars long then i.e. longer than the limit - say
is "abcdef".

Then either:

[a] we ignore "abcdef" and assume it is garbage
[b] we return "abcde" and "bcdef" i.e. all 5 char substrings
of it, so that if someone wants to search on the 6 char string they
sort of still can (at least w/ a carefully chosen query...hmmm..).

Anyway here's some code.
If popular it could be put into StandardAnalyzer.

package com.tropo.lucene;

import org.apache.lucene.analysis.*;

 * Removes words that are too long and too short from the stream
public final class StrlenFilter
	extends TokenFilter
	 * Build a filter that removes words that are too long or too
short from the text.
	public StrlenFilter(TokenStream in, int min, int max)
		input = in;
		this.min = min;
		this.max =max;

	/** Returns the next input Token whose termText() is the right
	public final Token next() throws IOException
		// return the first non-stop word found
		for (Token token =; token != null; token =
			final int len = token.termText().length();
			if ( len >= min && len <= max)
				return token;
			// note: else we ignore it but should we index
each part of it?
		// reached EOS -- return null		
		return null;
	final int min;
	final int max;

