spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: Best way to process lookup ETL with Dataframes
Date Fri, 30 Dec 2016 05:06:13 GMT
How about this -

select a.*, nvl(b.col,nvl(c.col,'some default'))
from driving_table a
left outer join lookup1 b on a.id=b.id
left outer join lookup2 c on a.id=c,id

?

On Fri, Dec 30, 2016 at 9:55 AM, Sesterhenn, Mike <msesterhenn@cars.com>
wrote:

> Hi all,
>
>
> I'm writing an ETL process with Spark 1.5, and I was wondering the best
> way to do something.
>
>
> A lot of the fields I am processing require an algorithm similar to this:
>
>
> Join input dataframe to a lookup table.
>
> if (that lookup fails (the joined fields are null)) {
>
>     Lookup into some other table to join some other fields.
>
> }
>
>
> With Dataframes, it seems the only way to do this is to do something like
> this:
>
>
> Join input dataframe to a lookup table.
>
> if (that lookup fails (the joined fields are null)) {
>
>    *SPLIT the dataframe into two DFs via DataFrame.filter(),
>
>       one group with successful lookup, the other failed).*
>
>    For failed lookup:  {
>
>        Lookup into some other table to grab some other fields.
>
>    }
>
>    *MERGE the dataframe splits back together via DataFrame.unionAll().*
> }
>
>
> I'm seeing some really large execution plans as you might imagine in the
> Spark Ui, and the processing time seems way out of proportion with the size
> of the dataset.  (~250GB in 9 hours).
>
>
> Is this the best approach to implement an algorithm like this?  Note also
> that some fields I am implementing require multiple staged split/merge
> steps due to cascading lookup joins.
>
>
> Thanks,
>
>
> *Michael Sesterhenn*
>
>
> *msesterhenn@cars.com <msesterhenn@cars.com> *
>
>


-- 
Best Regards,
Ayan Guha

Mime
View raw message