spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Starks <suse...@protonmail.com.INVALID>
Subject Re: Newbie question on how to extract column value
Date Tue, 07 Aug 2018 16:12:57 GMT
Because of some legacy issues I can't immediately upgrade spark version. But I try filter data
before loading it into spark based on the suggestion by

     val df = sparkSession.read.format("jdbc").option(...).option("dbtable", "(select .. from
... where url <> '') table_name")....load()
     df.createOrReplaceTempView("new_table")

Then perform custom operation do the trick.

    sparkSession.sql("select id, url from new_table").as[(String, String)].map { case (id,
url) =>
       val derived_data = ... // operation on url
       (id, derived_data)
    }.show()

Thanks for the advice, it's really helpful!

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On August 7, 2018 5:33 PM, Gourav Sengupta <gourav.sengupta@gmail.com> wrote:

> Hi James,
>
> It is always advisable to use the latest SPARK version. That said, can you please giving
a try to dataframes and udf if possible. I think, that would be a much scalable way to address
the issue.
>
> Also in case possible, it is always advisable to use the filter option before fetching
the data to Spark.
>
> Thanks and Regards,
> Gourav
>
> On Tue, Aug 7, 2018 at 4:09 PM, James Starks <suserft@protonmail.com.invalid> wrote:
>
>> I am very new to Spark. Just successfully setup Spark SQL connecting to postgresql
database, and am able to display table with code
>>
>>     sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show()
>>
>> Now I want to perform filter and map function on col_b value. In plain scala it would
be something like
>>
>>     Seq((1, "http://a.com/a"), (2, "http://b.com/b"), (3, "unknown")).filter { case
(_, url) => isValid(url) }.map { case (id, url)  => (id, pathOf(url)) }
>>
>> where filter will remove invalid url, and then map (id, url) to (id, path of url).
>>
>> However, when applying this concept to spark sql with code snippet
>>
>>     sparkSession.sql("...").filter(isValid($"url"))
>>
>> Compiler complains type mismatch because $"url" is ColumnName type. How can I extract
column value i.e. http://... for the column url in order to perform filter function?
>>
>> Thanks
>>
>> Java 1.8.0
>> Scala 2.11.8
>> Spark 2.1.0
Mime
View raw message