cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance
Date Fri, 18 Mar 2016 07:52:33 GMT


Stefania commented on CASSANDRA-11053:

[~aholmber] the patch is ready, are you still available to review it?

||2.1||2.2||2.2 win||3.0||3.5||trunk||

The 2.1 patch merges cleanly up to 3.5, then there is a simple conflict into trunk.

The issue reported above by [~jjordan] was caused by the fact that the machine has only one
core. There was a typo that caused the number of worker processes to be zero. This was easy
to fix. However, I then introduced a bulk copy test by simulating a single core machine, see
[this pull request|], and this highlighted
a more serious deadlock in COPY TO. To fix this I had to introduce a new thread in the COPY
TO worker processes.

Incidentally, this bug means that the performance measurements taken above were running 1
worker process less than indicated.

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>                 Key: CASSANDRA-11053
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>              Labels: doc-impacting
>             Fix For: 2.1.14, 2.2.6, 3.0.5, 3.5
>         Attachments: copy_from_large_benchmark.txt, copy_from_large_benchmark_2.txt,
parent_profile.txt, parent_profile_2.txt, worker_profiles.txt, worker_profiles_2.txt
> h5. Description
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test
at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller
cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages
50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.
> h5. Doc-impacting changes to COPY FROM options
> * A new option was added: PREPAREDSTATEMENTS - it indicates if prepared statements should
be used; it defaults to true.
> * The default value of CHUNKSIZE changed from 1000 to 5000.
> * The default value of MINBATCHSIZE changed from 2 to 10.

This message was sent by Atlassian JIRA

View raw message