cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pavel Yaskevich (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11130) [SASI Pre-QA] = semantics not respected when using StandardAnalyzer
Date Sun, 07 Feb 2016 20:30:39 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136422#comment-15136422
] 

Pavel Yaskevich commented on CASSANDRA-11130:
---------------------------------------------

This is what [~jbellis] was talking about, since the index is split into multiple terms from
the index perspective "title1" = "Yesterday" is an exact much, but I guess it should just
give up trying to explain and make it work, even through it's going to be inefficient (going
to return more matches which will need to be filtered out).

> [SASI Pre-QA] = semantics not respected when using StandardAnalyzer
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-11130
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11130
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL
>         Environment: Tested from build [CASSANDRA-11067|https://issues.apache.org/jira/browse/CASSANDRA-11067]
>            Reporter: DOAN DuyHai
>            Assignee: Pavel Yaskevich
>
> Tested from build [CASSANDRA-11067|https://issues.apache.org/jira/browse/CASSANDRA-11067]
> {code:sql}
> CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy', 'replication_factor':
'1'}  AND durable_writes = true;
> CREATE TABLE music.albums (
>     id int PRIMARY KEY,
>     artist text,
>     title1 text,
>     title2 text
> );
> CREATE CUSTOM INDEX ON music.albums (title1) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {'tokenization_skip_stop_words': 'true', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
'case_sensitive': 'false', 'mode': 'PREFIX', 'tokenization_enable_stemming': 'true'};
> CREATE CUSTOM INDEX ON music.albums (title2) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {'tokenization_skip_stop_words': 'true', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
'case_sensitive': 'false', 'mode': 'CONTAINS', 'tokenization_enable_stemming': 'true'};
> INSERT INTO music.albums(id, artist, title1, title2) 
> VALUES(1, 'Superpitcher', 'Yesterday', 'Yesterday');
> INSERT INTO music.albums(id, artist, title1, title2) 
> VALUES(2, 'Hilary Duff', 'So Yesterday', 'So Yesterday');
> INSERT INTO music.albums(id, artist, title1, title2) 
> VALUES(3, 'The Mr. T Experience', 'Yesterday Rules', 'Yesterday Rules');
> SELECT artist,title1 FROM music.albums WHERE title1='Yesterday';
>  artist                 | title1
> ------------------------+----------------
>            Superpitcher |       Yesterday
>             Hilary Duff |    So Yesterday
>    The Mr. T Experience | Yesterday Rules
>  
> (3 rows)
> SELECT artist,title1 FROM music.albums WHERE title2='Yesterday';
> artist                 | title1
> ------------------------+----------------
>            Superpitcher |       Yesterday
>             Hilary Duff |    So Yesterday
>    The Mr. T Experience | Yesterday Rules
>   
> (3 rows)
> {code}
> The semantic of *=* is not respected. SASI should return only 1 row with exact match.
Using *LIKE* would return all 3 rows. It does impact both *PREFIX* and *CONTAINS* mode. Using
*NonTokenizerAnalyzer* return 1 row with exact match.
>  So indeed, the semantics of *=* depends on the chosen analyzer, which is inconsistent.
We should force *=* to be exact match no matter which analyzer is chosen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message