lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pawel Rog (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6991) WordDelimiterFilter bug
Date Mon, 25 Jan 2016 10:45:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115029#comment-15115029
] 

Pawel Rog commented on LUCENE-6991:
-----------------------------------

Below you can see tokens produced from first token stream and second token stream

{code}
Jun
0000
GET
products
productskeyphraseextractor
key
phrase
extractor
HTTP
200
3437
http
httpwwwgooglecomurlsatrctjqesrcssourcewebcd
www
google
com
url
sa
t
rct
j
q
esrc
s
source
web
cd
cad
cadrjaved
rja
ved
QFj
QFjAEOAourlhttp
AEOAo
url
http
sematext
sematextcom
com
phrase
phraseextractor
extractor
ei
eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAbv
TPOu
Uba
WM
OKi
Qf
Gx
IGYDw
usg
AFQj
CNGw
YAFYg
M3EZnp2e
EWJzdv
Rr
VPrg
sig2
o
Yit
ONI2EIZ0CQar7Ej8HA
bv
m
mbv
bv
d
daGc
a
Gc
Mozilla
X11
Ubuntu
Linux
i686
rv
Gecko
Firefox
{code}

{code}
Jun
0000
GET
products
productskeyphraseextractor
key
phrase
extractor
HTTP
200
3437
http
httpwwwgooglecomurlsatrctjqesrcssourcewebcd
www
google
com
url
sa
t
rct
j
q
esrc
s
source
web
cd
cad
cadrjaved
rja
ved
QFj
QFjAEOAourlhttp
AEOAo
url
http
sematext
sematextcom
com
phrase
phraseextractor
extractor
ei
eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAb
TPOu
Uba
WM
OKi
Qf
Gx
IGYDw
usg
AFQj
CNGw
YAFYg
M3EZnp2e
EWJzdv
Rr
VPrg
sig2
o
Yit
ONI2EIZ0CQar7Ej8HA
b
vm
vmbv
bv
d
daGc
a
Gc
Mozilla
X11
Ubuntu
Linux
i686
rv
Gecko
Firefox
{code}


The difference in input string is quotation mark before "http". The difference in output is
in a few terms:

eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAbv vs
eiTPOuUbaWMOKiQfGxIGYDwusgAFQjCNGwYAFYgM3EZnp2eEWJzdvRrVPrgsig2oYitONI2EIZ0CQar7Ej8HAb

or 
mbv vs vmbv 

> WordDelimiterFilter bug
> -----------------------
>
>                 Key: LUCENE-6991
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6991
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.10.4, 5.3.1
>            Reporter: Pawel Rog
>            Priority: Minor
>
> I was preparing analyzer which contains WordDelimiterFilter and I realized it sometimes
gives results different then expected.
> I prepared a short test which shows the problem. I haven't used Lucene tests for this
but this doesn't matter for showing the bug.
> {code}
>     String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET /products/key-phrase-extractor/
HTTP/1.1\"" +
>             " 200 3437 http://www.google.com/url?sa=t&rct=j&q=&esrc=s&"
+
>             "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
+
>             "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
+
>             "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11;
Ubuntu; Linux i686; rv:20.0) " +
>             "Gecko/20100101 Firefox/20.0\"";
>     List<String> tokens1 = new ArrayList<String>();
>     List<String> tokens2 = new ArrayList<String>();
>     WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
>     TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
>     tokenStream = new WordDelimiterFilter(tokenStream,
>             WordDelimiterFilter.GENERATE_WORD_PARTS |
>             WordDelimiterFilter.CATENATE_WORDS |
>             WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
>         null);
>     CharTermAttribute charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
>     tokenStream.reset();
>     while(tokenStream.incrementToken()) {
>       tokens1.add(charAttrib.toString());
>       System.out.println(charAttrib.toString());
>     }
>     tokenStream.end();
>     tokenStream.close();
>     urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET /products/key-phrase-extractor/
HTTP/1.1\"" +
>         " 200 3437 \"http://www.google.com/url?sa=t&rct=j&q=&esrc=s&"
+
>         "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-"
+
>         "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2"
+
>         "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; Ubuntu;
Linux i686; rv:20.0) " +
>         "Gecko/20100101 Firefox/20.0\"";
>     System.out.println("\n\n====\n\n");
>     tokenStream = analyzer.tokenStream("test", urlIndexed);
>     tokenStream = new WordDelimiterFilter(tokenStream,
>             WordDelimiterFilter.GENERATE_WORD_PARTS |
>             WordDelimiterFilter.CATENATE_WORDS |
>             WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
>         null);
>     charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
>     tokenStream.reset();
>     while(tokenStream.incrementToken()) {
>       tokens2.add(charAttrib.toString());
>       System.out.println(charAttrib.toString());
>     }
>     tokenStream.end();
>     tokenStream.close();
>     assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message