cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-5959) CQL3 support for multi-column insert in a single operation (Batch Insert / Batch Mutate)
Date Mon, 02 Sep 2013 09:13:54 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755973#comment-13755973
] 

Sylvain Lebresne commented on CASSANDRA-5959:
---------------------------------------------

For what is worth, I wouldn't be opposed to adding the multi-value INSERT extension of the
description. It can be handy (as in, it minimize the number of characters to type in cqlsh
to insert multiple rows) and at least both MySQL and Postresql support such syntax extension.

Though as hinted above, it wouldn't fix the performance problem described here, so it's a
completely different motivation.  The reason such a big batch is slow is due to parsing (and
possibly also the transport of the large query string, though that part can be solved by using
compression at the transport level). If you want performance on such big insert, you'll definitively
need to use prepared statements (and batch of them) and that's where CASSANDRA-4693 misses
in 1.2.

I'll note however that while C* 1.2 doesn't have CASSANDRA-4693, it can still prepare batch
statements. So a workaround could be to prepare a medium-sized batch of a fixed number of
inserts, say 500 inserts (but some experimentation to find the best number is probably in
order), and use that to insert the 50K columns by batches of 500. It won't be as efficient
as what CASSANDRA-4693 gives you and it's certainly a bit of a pain to implement client side,
but performance wise, this should (emphasize on should since I haven't tested it) get you
closer from the thrift perf number.

                
> CQL3 support for multi-column insert in a single operation (Batch Insert / Batch Mutate)
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5959
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5959
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core, Drivers
>            Reporter: Les Hazlewood
>              Labels: CQL
>
> h3. Impetus for this Request
> (from the original [question on StackOverflow|http://stackoverflow.com/questions/18522191/using-cassandra-and-cql3-how-do-you-insert-an-entire-wide-row-in-a-single-reque]):
> I want to insert a single row with 50,000 columns into Cassandra 1.2.9. Before inserting,
I have all the data for the entire row ready to go (in memory):
> {code}
> +---------+------+------+------+------+-------+
> |         | 0    | 1    | 2    | ...  | 49999 |
> | row_id  +------+------+------+------+-------+
> |         | text | text | text | ...  | text  |
> +---------+------+------+------|------+-------+
> {code}
> The column names are integers, allowing slicing for pagination. The column values are
a value at that particular index.
> CQL3 table definition:
> {code}
> create table results (
>     row_id text,
>     index int,
>     value text,
>     primary key (row_id, index)
> ) 
> with compact storage;
> {code}
> As I already have the row_id and all 50,000 name/value pairs in memory, I just want to
insert a single row into Cassandra in a single request/operation so it is as fast as possible.
> The only thing I can seem to find is to do execute the following 50,000 times:
> {code}
> INSERT INTO results (row_id, index, value) values (my_row_id, ?, ?);
> {code}
> where the first {{?}} is is an index counter ({{i}}) and the second {{?}} is the text
value to store at location {{i}}.
> With the Datastax Java Driver client and C* server on the same development machine, this
took a full minute to execute.
> Oddly enough, the same 50,000 insert statements in a [Datastax Java Driver Batch|http://www.datastax.com/drivers/java/apidocs/com/datastax/driver/core/querybuilder/QueryBuilder.html#batch(com.datastax.driver.core.Statement...)]
on the same machine took 7.5 minutes.  I thought batches were supposed to be _faster_ than
individual inserts?
> We tried instead with a Thrift client (Astyanax) and the same insert via a [MutationBatch|http://netflix.github.io/astyanax/javadoc/com/netflix/astyanax/MutationBatch.html].
 This took _235 milliseconds_.
> h3. Feature Request
> As a result of this performance testing, this issue is to request that CQL3 support batch
mutation operations as a single operation (statement) to ensure the same speed/performance
benefits as existing Thrift clients.
> Example suggested syntax (based on the above example table/column family):
> {code}
> insert into results (row_id, (index,value)) values 
>     ((0,text0), (1,text1), (2,text2), ..., (N,textN));
> {code}
> Each value in the {{values}} clause is a tuple.  The first tuple element is the column
name, the second tuple element is the column value.  This seems to be the most simple/accurate
representation of what happens during a batch insert/mutate.
> Not having this CQL feature forced us to remove the Datastax Java Driver (which we liked)
in favor of Astyanax because Astyanax supports this behavior.  We desire feature/performance
parity between Thrift and CQL3/Datastax Java Driver, so we hope this request improves both
CQL3 and the Driver.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message