spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Hakobian <nicholas.hakob...@rallyhealth.com>
Subject Re: Best way to process lookup ETL with Dataframes
Date Fri, 30 Dec 2016 20:12:40 GMT
It looks like Spark 1.5 has the coalesce function, which is like NVL, but a
bit more flexible. From Ayan's example you should be able to use:
coalesce(b.col, c.col, 'some default')

If that doesn't have the flexibility you want, you can always use nested
case or if statements, but its just harder to read.

Nicholas Szandor Hakobian, Ph.D.
Senior Data Scientist
Rally Health
nicholas.hakobian@rallyhealth.com



On Fri, Dec 30, 2016 at 7:46 AM, Sesterhenn, Mike <msesterhenn@cars.com>
wrote:

> Thanks, but is nvl() in Spark 1.5?  I can't find it in spark.sql.functions
> (http://spark.apache.org/docs/1.5.0/api/scala/index.html#
> org.apache.spark.sql.functions$)
>
>
> Reading about the Oracle nvl function, it seems it is similar to the na
> functions.  Not sure it will help though, because what I need is to join
> after the first join fails.
>
> ------------------------------
> *From:* ayan guha <guha.ayan@gmail.com>
> *Sent:* Thursday, December 29, 2016 11:06 PM
> *To:* Sesterhenn, Mike
> *Cc:* user@spark.apache.org
> *Subject:* Re: Best way to process lookup ETL with Dataframes
>
> How about this -
>
> select a.*, nvl(b.col,nvl(c.col,'some default'))
> from driving_table a
> left outer join lookup1 b on a.id=b.id
> left outer join lookup2 c on a.id=c,id
>
> ?
>
> On Fri, Dec 30, 2016 at 9:55 AM, Sesterhenn, Mike <msesterhenn@cars.com>
> wrote:
>
>> Hi all,
>>
>>
>> I'm writing an ETL process with Spark 1.5, and I was wondering the best
>> way to do something.
>>
>>
>> A lot of the fields I am processing require an algorithm similar to this:
>>
>>
>> Join input dataframe to a lookup table.
>>
>> if (that lookup fails (the joined fields are null)) {
>>
>>     Lookup into some other table to join some other fields.
>>
>> }
>>
>>
>> With Dataframes, it seems the only way to do this is to do something like
>> this:
>>
>>
>> Join input dataframe to a lookup table.
>>
>> if (that lookup fails (the joined fields are null)) {
>>
>>    *SPLIT the dataframe into two DFs via DataFrame.filter(),
>>
>>       one group with successful lookup, the other failed).*
>>
>>    For failed lookup:  {
>>
>>        Lookup into some other table to grab some other fields.
>>
>>    }
>>
>>    *MERGE the dataframe splits back together via DataFrame.unionAll().*
>> }
>>
>>
>> I'm seeing some really large execution plans as you might imagine in the
>> Spark Ui, and the processing time seems way out of proportion with the size
>> of the dataset.  (~250GB in 9 hours).
>>
>>
>> Is this the best approach to implement an algorithm like this?  Note also
>> that some fields I am implementing require multiple staged split/merge
>> steps due to cascading lookup joins.
>>
>>
>> Thanks,
>>
>>
>> *Michael Sesterhenn*
>>
>>
>> *msesterhenn@cars.com <msesterhenn@cars.com> *
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Mime
View raw message