spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maddenpj <madde...@gmail.com>
Subject Spark Streaming: No parallelism in writing to database (MySQL)
Date Thu, 25 Sep 2014 20:56:14 GMT
I posted yesterday about a related issue but resolved it shortly after. I'm
using Spark Streaming to summarize event data from Kafka and save it to a
MySQL table. Currently the bottleneck is in writing to MySQL and I'm puzzled
as to how to speed it up. I've tried repartitioning with several different
values but it looks like only one worker is actually doing the writing to
MySQL. Obviously this is not ideal because I need the parallelism to insert
this data in a timely manner.

Here's the code https://gist.github.com/maddenpj/5032c76aeb330371a6e6
<https://gist.github.com/maddenpj/5032c76aeb330371a6e6>  

I'm running this on a cluster of 6 spark nodes (2 cores, 7.5 GB memory) and
tried repartition sizes of 6, 12 and 48. How do I ensure that there is
parallelism in writing to the database? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-No-parallelism-in-writing-to-database-MySQL-tp15174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Mime
View raw message