spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Boesch <>
Subject Re: RDD-like API for entirely local workflows?
Date Sat, 04 Jul 2020 15:38:23 GMT
Spark in local mode (which is different than standalone) is a solution for
many use cases. I use it in conjunction with (and sometimes instead of)
pandas/pandasql due to its much wider ETL related capabilities. On the JVM
side it is an even more obvious choice - given there is no equivalent to
pandas and it has even better performance.

It is also a strong candidate due to the expressiveness of the sql dialect
including support for analytical/windowing functions.    There is a latency
hit: on the order of a couple of seconds to start the SparkContext - but
pandas is not a high performance tool in any case.

i see that OpenRefine is implemented in Java so then Spark local should  be
a very good complement to it.

On Sat, 4 Jul 2020 at 08:17, Antonin Delpeuch (lists) <> wrote:

> Hi,
> I am working on revamping the architecture of OpenRefine, an ETL tool,
> to execute workflows on datasets which do not fit in RAM.
> Spark's RDD API is a great fit for the tool's operations, and provides
> everything we need: partitioning and lazy evaluation.
> However, OpenRefine is a lightweight tool that runs locally, on the
> users' machine, and we want to preserve this use case. Running Spark in
> standalone mode works, but I have read at a couple of places that the
> standalone mode is only intended for development and testing. This is
> confirmed by my experience with it so far:
> - the overhead added by task serialization and scheduling is significant
> even in standalone mode. This makes sense for testing, since you want to
> test serialization as well, but to run Spark in production locally, we
> would need to bypass serialization, which is not possible as far as I know;
> - some bugs that manifest themselves only in local mode are not getting
> a lot of attention ( so
> it seems dangerous to base a production system on standalone Spark.
> So, we cannot use Spark as default runner in the tool. Do you know any
> alternative which would be designed for local use? A library which would
> provide something similar to the RDD API, but for parallelization with
> threads in the same JVM, not machines in a cluster?
> If there is no such thing, it should not be too hard to write our
> homegrown implementation, which would basically be Java streams with
> partitioning. I have looked at Apache Beam's direct runner, but it is
> also designed for testing so does not fit our bill for the same reasons.
> We plan to offer a Spark-based runner in any case - but I do not think
> it can be used as the default runner.
> Cheers,
> Antonin
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

View raw message