cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-1278) Make bulk loading into Cassandra less crappy, more pluggable
Date Wed, 04 May 2011 03:36:03 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028574#comment-13028574
] 

Jonathan Ellis commented on CASSANDRA-1278:
-------------------------------------------

bq. One of the main goals of the bulk loading was that no local/temp storage was required
on the client; that has been the plan from the beginning

No, it hasn't.

But we can leave that aside for now; we already have "build everything else from the sstable
bits" code, so we can add "take advantage of local storage to offload that from the server"
later as an optimization.

bq. deprecate sessions all together

You're going to need some kind "when all of this is done, run this callback" construct for
bootstrap/node movement. Currently we call that a Session.

bq. When node A wants to send things to node B, it records that fact in the system table.
For each entry it sends the file using the bulk loading protocol and continues retrying until
the file is excepted.

Sounds exactly like what existing streaming does.

bq. The only complex part is preventing removal of the SSTable on the source

Currently we do this by simply maintaining a reference to the SSTR object so the GC doesn't
delete it. There's no need to make it more complicated than that.

I took a look at the patch.  Just superficially, there's a lot of gratuitous change in there,
e.g., refactoring test_thrift_server.py.  Those changes also need to be moved to a separate
patch (again, I suggest git) so reviewers can easily distinguish refactoring from ticket-specific
changes.

> Make bulk loading into Cassandra less crappy, more pluggable
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-1278
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1278
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jeremy Hanna
>            Assignee: Matthew F. Dennis
>             Fix For: 0.8.1
>
>         Attachments: 1278-cassandra-0.7-v2.txt, 1278-cassandra-0.7.1.txt, 1278-cassandra-0.7.txt
>
>   Original Estimate: 40h
>          Time Spent: 40h 40m
>  Remaining Estimate: 0h
>
> Currently bulk loading into Cassandra is a black art.  People are either directed to
just do it responsibly with thrift or a higher level client, or they have to explore the contrib/bmt
example - http://wiki.apache.org/cassandra/BinaryMemtable  That contrib module requires delving
into the code to find out how it works and then applying it to the given problem.  Using either
method, the user also needs to keep in mind that overloading the cluster is possible - which
will hopefully be addressed in CASSANDRA-685
> This improvement would be to create a contrib module or set of documents dealing with
bulk loading.  Perhaps it could include code in the Core to make it more pluggable for external
clients of different types.
> It is just that this is something that many that are new to Cassandra need to do - bulk
load their data into Cassandra.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message