spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Martín Guillén <>
Subject Re: RDD-like API for entirely local workflows?
Date Sat, 04 Jul 2020 15:49:34 GMT
 Hi Antonin.
It seems you are confusing Standalone with Local mode. They are 2 different modes.
>From Spark in Action book: "In local mode, there is only one executor in the same client
JVM as the driver, butthis executor can spawn several threads to run tasks. 
In local mode, Spark uses your client process as the single executor in the cluster, 
and the number of threads specified determines how many tasks can be executed in parallel."
I am pretty sure this is the mode your use case is more suited to.
What you are referring to, I think, is to run an Standalone Cluster locally, something that
does not make too much sense resources wise and is what may be considered only for testing
Running Spark in Local mode is totally fine and supported for non-cluster (local) environments.
Here the options you have to connect you Spark application to:
Regards,Juan Martín.

    El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists) <>

I am working on revamping the architecture of OpenRefine, an ETL tool,
to execute workflows on datasets which do not fit in RAM.

Spark's RDD API is a great fit for the tool's operations, and provides
everything we need: partitioning and lazy evaluation.

However, OpenRefine is a lightweight tool that runs locally, on the
users' machine, and we want to preserve this use case. Running Spark in
standalone mode works, but I have read at a couple of places that the
standalone mode is only intended for development and testing. This is
confirmed by my experience with it so far:
- the overhead added by task serialization and scheduling is significant
even in standalone mode. This makes sense for testing, since you want to
test serialization as well, but to run Spark in production locally, we
would need to bypass serialization, which is not possible as far as I know;
- some bugs that manifest themselves only in local mode are not getting
a lot of attention ( so
it seems dangerous to base a production system on standalone Spark.

So, we cannot use Spark as default runner in the tool. Do you know any
alternative which would be designed for local use? A library which would
provide something similar to the RDD API, but for parallelization with
threads in the same JVM, not machines in a cluster?

If there is no such thing, it should not be too hard to write our
homegrown implementation, which would basically be Java streams with
partitioning. I have looked at Apache Beam's direct runner, but it is
also designed for testing so does not fit our bill for the same reasons.

We plan to offer a Spark-based runner in any case - but I do not think
it can be used as the default runner.


To unsubscribe e-mail:

View raw message