lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danil Ε’ORIN <torin...@gmail.com>
Subject Re: how to preserve whitespaces etc when tokenizing stream?
Date Mon, 16 Jan 2012 10:50:17 GMT
Maybe you could simply use String.replace()?
Or the text actually needs to be tokenized?

On Fri, Jan 13, 2012 at 18:44, Ilya Zavorin <izavorin@caci.com> wrote:

> I am trying to perform a "translation" of sorts of a stream of text. More
> specifically, I need to tokenize the input stream, look up every term in a
> specialized dictionary and output the corresponding "translation" of the
> token. However, i also want to preserve all the original whitespaces,
> stopwords etc from the input so that the output is formatted in the same
> way as the input instead of ended up being a stream of translations. So if
> my input is
>
>
>
> <term1>: <term2> <stopword>! <term3>
>
> <term4>
>
>
>
> then I want the output to look like
>
>
>
> <term1'>: <term2'> <stopword>! <term3'>
>
> <term4'>
>
>
>
> (where <termi'> is translation of <termi>) instead of
>
>
>
> <term1'> <term2'> <term3'> <term4'>
>
>
>
> Currently I am doing the following:
>
>
>
> PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
>
>
> PatternAnalyzer.WHITESPACE_PATTERN,
>
>                                           false,
>
>                                           WordlistLoader.getWordSet(new
> File(stopWordFilePath)));
>
> TokenStream ts = pa.tokenStream(null, in);
>
> CharTermAttribute charTermAttribute =
> ts.getAttribute(CharTermAttribute.class);
>
>
>
> while (ts.incrementToken()) { // loop over tokens
>
>       String termIn = charTermAttribute.toString();
>
>       ...
>
> }
>
>
>
> but this, of course, loses all the whitespaces etc. How can I modify this
> to be able to re-insert them into the output? thanks much!
>
>
> Thanks,
>
> Ilya
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message