spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Narine Kokhlikyan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions
Date Mon, 14 Dec 2015 23:01:46 GMT

     [ https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Narine Kokhlikyan updated SPARK-12325:
--------------------------------------
    Affects Version/s: 1.5.2

> Inappropriate error messages in DataFrame StatFunctions 
> --------------------------------------------------------
>
>                 Key: SPARK-12325
>                 URL: https://issues.apache.org/jira/browse/SPARK-12325
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: Narine Kokhlikyan
>            Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL component, but
I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation method and it says that covariance is wrong.
> I do not think that this is an appropriate message to show here.
> scala> df.stat.corr("rating", "income")
> java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns
with dataType StringType not supported.
>     at scala.Predef$.require(Predef.scala:233)
>     at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)
> 2. The biggest issue here is not the message shown, but the design.
> A class called CovarianceCounter does the computations both for correlation and covariance.
This might be a convenient way
> from certain perspective, however something like this is harder to understand and extend,
especially if you want to add another algorithm
> e.g. Spearman correlation, or something else.
> There are many possible solutions here:
> starting from
> 1. just fixing the message 
> 2. fixing the message and renaming  CovarianceCounter and corresponding methods
> 3. create CorrelationCounter and splitting the computations for correlation and covariance
> and many more .... 
> Since I'm not getting any response and according to github all five of you have been
working on this, I'll try again:
> [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]
> Can any of you ,please, explain me such a behavior with the stat functions or communicate
more about this ?
> In case you are planning to remove it or something else, we'd truly appreciate if you
communicate.
> In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML
components are just staying there without any response, I'll wait for your response first.
> cc: [~shivaram], [~mengxr]
> Thank you,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message