spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ZHANG Wei <wezh...@outlook.com>
Subject Re: Filtering on multiple columns in spark
Date Wed, 29 Apr 2020 08:51:48 GMT
AFAICT, maybe Spark SQL built-in functions[1] can help as below:

scala> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+


scala> df.filter("length(name) == 4 or substring(name, 1, 1) == 'J'").show()
+---+------+
|age|  name|
+---+------+
| 30|  Andy|
| 19|Justin|
+---+------+


-- 
Cheers,
-z
[1] https://spark.apache.org/docs/latest/api/sql/index.html

On Wed, 29 Apr 2020 08:45:26 +0100
Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:

> Hi,
> 
> 
> 
> Trying to filter a dataframe with multiple conditions using OR "||" as below
> 
> 
> 
>   val rejectedDF = newDF.withColumn("target_mobile_no",
> col("target_mobile_no").cast(StringType)).
> 
>                    filter(length(col("target_mobile_no")) !== 10 ||
> substring(col("target_mobile_no"),1,1) !== "7")
> 
> 
> 
> This throws this error
> 
> 
> 
> res12: org.apache.spark.sql.DataFrame = []
> 
> <console>:49: error: value || is not a member of Int
> 
>                           filter(length(col("target_mobile_no")) !== 10 ||
> substring(col("target_mobile_no"),1,1) !== "7")
> 
> 
> 
> Try another way
> 
> 
> 
> val rejectedDF = newDF.withColumn("target_mobile_no",
> col("target_mobile_no").cast(StringType)).
> 
>                    filter(length(col("target_mobile_no")) !=== 10 ||
> substring(col("target_mobile_no"),1,1) !=== "7")
> 
>   rejectedDF.createOrReplaceTempView("tmp")
> 
> 
> 
> Tried few options but I am still getting this error
> 
> 
> 
> <console>:49: error: value !=== is not a member of
> org.apache.spark.sql.Column
> 
>                           filter(length(col("target_mobile_no")) !=== 10 ||
> substring(col("target_mobile_no"),1,1) !=== "7")
> 
>                                                                  ^
> 
> <console>:49: error: value || is not a member of Int
> 
>                           filter(length(col("target_mobile_no")) !=== 10 ||
> substring(col("target_mobile_no"),1,1) !=== "7")
> 
> 
> 
> I can create a dataframe for each filter but that does not look efficient
> to me?
> 
> 
> 
> Thanks
> 
> 
> 
> Dr Mich Talebzadeh
> 
> 
> 
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
> 
> 
> 
> http://talebzadehmich.wordpress.com
> 
> 
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message