spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
Date Mon, 07 May 2018 02:02:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465356#comment-16465356
] 

Hyukjin Kwon commented on SPARK-23291:
--------------------------------------

[~felixcheung], sure, I agree with that in general. However, we could probably think about
this way too for this case specifically:

in other words, it has been wrong for 3 years, it requires weird codes for R specifically
comparing to other languages APIs. IMHO, It's a bit subtlety and users might be adopted to
this bugs rather than bothering this out (of course I guess with some nuisance). Think about
this expr("substr(...)") and substr work differently. I am also seeing [expr("substr(...)")
is suggested as an alternative of substr|https://stackoverflow.com/questions/37413122/use-of-substr-on-dataframe-column-in-sparkr?rq=1]
 If it's clearly documented in the migration guide, I thought it can be fine.

Also, this substr case is pretty well understood and isolated.

As a reference, I recall a case - https://github.com/apache/spark/pull/20499#issuecomment-363863660.
It sounds pretty a similar case with that. I was hesitant at that time too but after thinking
for a while, I ended up with kind of agreeing that the backport is okay. It wasn't a regression
at that time too.


> SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr"
is giving wrong result  when the position is greater than 1
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-23291
>                 URL: https://issues.apache.org/jira/browse/SPARK-23291
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>            Reporter: Narendra
>            Assignee: Liang-Chi Hsieh
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Defect Description :
> -----------------------------
> For example ,an input string "2017-12-01" is read into a SparkR dataframe "df" with column
name "col1".
>  The target is to create a a new column named "col2" with the value "12" which is inside
the string ."12" can be extracted with "starting position" as "6" and "Ending position" as
"7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 'starting position',
is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates the 'ending
position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be the "actual
position + 1"
> Expected behavior :
> ----------------------------
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> -----------
>  This defect is observed with only when the starting position is greater than 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message