spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: Best strategy for Pandas -> Spark
Date Tue, 02 Jun 2015 00:38:32 GMT
The second one sounds reasonable, I think.

On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot
<o.girardot@lateral-thoughts.com> wrote:
> Hi everyone,
> Let's assume I have a complex workflow of more than 10 datasources as input
> - 20 computations (some creating intermediary datasets and some merging
> everything for the final computation) - some taking on average 1 minute to
> complete and some taking more than 30 minutes.
>
> What would be for you the best strategy to port this to Apache Spark ?
>
> Transform the whole flow into a Spark Job (PySpark or Scala)
> Transform only part of the flow (the heavy lifting ~30 min parts) using the
> same language (PySpark)
> Transform only part of the flow and pipe the rest from Scala to Python
>
> Regards,
>
> Olivier.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message