spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Girardot <o.girar...@lateral-thoughts.com>
Subject Re: Best strategy for Pandas -> Spark
Date Tue, 02 Jun 2015 20:13:38 GMT
Thanks for the answer, I'm currently doing exactly that.
I'll try to sum-up the usual Pandas <=> Spark Dataframe caveats soon.

Regards,

Olivier.

Le mar. 2 juin 2015 à 02:38, Davies Liu <davies@databricks.com> a écrit :

> The second one sounds reasonable, I think.
>
> On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot
> <o.girardot@lateral-thoughts.com> wrote:
> > Hi everyone,
> > Let's assume I have a complex workflow of more than 10 datasources as
> input
> > - 20 computations (some creating intermediary datasets and some merging
> > everything for the final computation) - some taking on average 1 minute
> to
> > complete and some taking more than 30 minutes.
> >
> > What would be for you the best strategy to port this to Apache Spark ?
> >
> > Transform the whole flow into a Spark Job (PySpark or Scala)
> > Transform only part of the flow (the heavy lifting ~30 min parts) using
> the
> > same language (PySpark)
> > Transform only part of the flow and pipe the rest from Scala to Python
> >
> > Regards,
> >
> > Olivier.
>

Mime
View raw message