spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dgoldenberg <>
Subject Spark and Morphlines, parallelization, multithreading
Date Wed, 18 Mar 2015 22:19:18 GMT
Still a Spark noob grappling with the concepts...

I'm trying to grok the idea of integrating something like the Morphlines
pipelining library with Spark (or SparkStreaming). The Kite/Morphlines doc
states that "runtime executes all commands of a given morphline in the same
thread...  there are no queues, no handoffs among threads, no context
switches and no serialization between commands, which minimizes performance

Further: "There is no need for a morphline to manage multiple processes,
nodes, or threads because this is already addressed by host systems such as
MapReduce, Flume, Spark or Storm."

My question is, how exactly does Spark manage parallelization and
multi-treading aspects of RDD processing?  As I understand it, each
collection of data is split into partitions and each partition is sent over
to a slave machine to perform computations. So, for each data partition, how
many processes are created? And for each process, how many threads?

Knowing that would help me understand how to structure the following:

		JavaPairInputDStream<String, String> messages =


		JavaDStream<String> messageBodies =
Function<Tuple2&lt;String, String>, String>() {
			public String call(Tuple2<String, String> tuple2) {
				return tuple2._2();

Would I want to create a morphline in a 'messages.foreachRDD' block? then
invoke the morphline on each messageBody?

What will Spark be doing behind the scenes as far as multiple processes and
multiple threads? Should I rely on it to optimize performance with multiple
threads and not worry about plugging in a multi-threaded pipelining engine?


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message