spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CCInCharge <charles.l.chen....@gmail.com>
Subject Custom Catalyst Optimizer Strategy for DataFrame Writes?
Date Sat, 27 Jan 2018 23:17:21 GMT
I've been working with Datastax's spark-cassandra-connector, and have noticed
that, when creating batches of DataFrame Rows to write to database, write
throughput is increased substantially and overall task completion time is
decreased if the user sorts the DataFrame on Cassandra partition key prior
to writing to database.

Saving DataFrames from Spark to Cassandra, using the connector, is performed
by calling the DataFrame API's write method, and setting the output format
to "org.apache.spark.sql.cassandra" - this makes the DataFrameWriter write
data to Cassandra using the connector.

I'm thinking that the spark-cassandra-connector could automatically sort a
DataFrame by Cassandra partition key before it writes data to the database.
I am not very familiar with the Catalyst, but I was thinking that one
possibility is to create a custom Catalyst optimization (extraStrategies or
extraOptimizations) in the connector that will automatically do this. Is
this possible/valid, or am I misunderstanding what is possible with custom
Catalyst optimizations?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message