spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: How do you perform blocking IO in apache spark job?
Date Mon, 08 Sep 2014 15:40:01 GMT

I What does the external service provide? Data? Calculations? Can the
service push data to you via Kafka and Spark streaming ? Can you fetch the
necessary data beforehand from the service? The solution to your question
depends on your answers.

I would not recommend to connect to a blocking service during spark jobs
execution. What do you do if a node crashes? Is order of service calls for
you relevant?

Best regards
Le 8 sept. 2014 17:31, "DrKhu" <> a écrit :

> What if, when I traverse RDD, I need to calculate values in dataset by
> calling external (blocking) service? How do you think that could be
> achieved?
> val values: Future[RDD[Double]] = Future sequence tasks
> I've tried to create a list of Futures, but as RDD id not Traversable,
> Future.sequence is not suitable.
> I just wonder, if anyone had such a problem, and how did you solve it? What
> I'm trying to achieve is to get a parallelism on a single worker node, so I
> can call that external service 3000 times per second.
> Probably, there is another solution, more suitable for spark, like having
> multiple working nodes on single host.
> It's interesting to know, how do you cope with such a challenge? Thanks.
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message