spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evo Eftimov" <evo.efti...@isecc.com>
Subject RE: Creating topology in spark streaming
Date Wed, 06 May 2015 10:42:42 GMT
The “abstraction level” of Storm or shall we call it Architecture, is effectively Pipelines
of Nodes/Agents – Pipelines is one of the standard Parallel Programming Patterns which you
can use on multicore CPUs as well as Distributed Systems – the chaps  from Storm simply
implemented it as a reusable framework for distributed systems and offered it for free. Effectively
it you have a set of independent Agents chained in a pipeline as the output from the previous
Agent feeds into the Input of the next Agent  

 

Spark Streaming (which is essentially Batch Spark but with some optimizations for Streaming)
on the other hand is more like a Map Reduce framework where you always have to have a Central
Job/Task Manager scheduling and submitting tasks to remote distributed nodes, collecting the
results / statuses and then scheduling and sending some more tasks and so on 

 

“Map Reduce” is simply another Parallel Programming pattern known as Data Parallelism
or Data Parallel Programming. Although you can also have Data Parallelism without a Central
Scheduler     

 

From: Juan Rodríguez Hortalá [mailto:juan.rodriguez.hortala@gmail.com] 
Sent: Wednesday, May 6, 2015 11:20 AM
To: Evo Eftimov
Cc: anshu shukla; ayan guha; user@spark.apache.org
Subject: Re: Creating topology in spark streaming

 

Hi, 

 

I agree with Evo, Spark works at a different abstraction level than Storm, and there is not
a direct translation from Storm topologies to Spark Streaming jobs. I think something remotely
close is the notion of lineage of  DStreams or RDDs, which is similar to a logical plan of
an engine like Apache Pig. Here  https://github.com/JerryLead/SparkInternals/blob/master/pdf/2-JobLogicalPlan.pdf
is a diagram of a spark logical plan by a third party. I would suggest you reading the book
"Learning Spark" https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/foreword01.html
for more on this. But in general I think that Storm has an abstraction level closer to MapReduce,
and Spark has an abstraction level closer to Pig, so the correspondence between Storm and
Spark notions cannot be perfect. 

 

Greetings, 

 

Juan 

 

 

 

 

2015-05-06 11:37 GMT+02:00 Evo Eftimov <evo.eftimov@isecc.com>:

What is called Bolt in Storm is essentially a combination of [Transformation/Action and DStream
RDD] in Spark – so to achieve a higher parallelism for specific Transformation/Action on
specific Dstream RDD simply repartition it to the required number of partitions which directly
relates to the corresponding number of Threads   

 

From: anshu shukla [mailto:anshushukla0@gmail.com] 
Sent: Wednesday, May 6, 2015 9:33 AM
To: ayan guha
Cc: user@spark.apache.org; dev@spark.apache.org
Subject: Re: Creating topology in spark streaming

 

But main problem is how to increase the level of parallelism  for any particular bolt logic
.

 

suppose i  want  this type of topology .

 

https://storm.apache.org/documentation/images/topology.png

 

How we can manage it .

 

On Wed, May 6, 2015 at 1:36 PM, ayan guha <guha.ayan@gmail.com> wrote:

Every transformation on a dstream will create another dstream. You may want to take a look
at foreachrdd? Also, kindly share your code so people can help better

On 6 May 2015 17:54, "anshu shukla" <anshushukla0@gmail.com> wrote:

Please help  guys, Even  After going through all the examples given i have not understood
how to pass the  D-streams  from one bolt/logic to other (without writing it on HDFS etc.)
just like emit function in storm .

Suppose i have topology with 3  bolts(say) 

 

BOLT1(parse the tweets nd emit tweet using given hashtags)=====>Bolt2(Complex logic for
sentiment analysis over tweets)=======>BOLT3(submit tweets to the sql database using spark
SQL)


 

 

Now  since Sentiment analysis will take most of the time ,we have to increase its level of
parallelism for tuning latency. Howe to increase the levele of parallelism since the logic
of topology is not clear .

 

-- 

Thanks & Regards,
Anshu Shukla

Indian Institute of Sciences





 

-- 

Thanks & Regards,
Anshu Shukla

 


Mime
View raw message