spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
Date Mon, 27 Feb 2017 08:00:53 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885315#comment-15885315
] 

Nick Pentreath commented on SPARK-19714:
----------------------------------------

I also agree that the naming of {{splits}} could be better, but for now we're stuck with it.
We could deprecate it and have a new param, but to me the param doc is pretty clear and unambiguous
about what it actually does. So that option seems more confusing to users than it's worth.

Of course {{QuantileDiscretizer}} is different but the result is exactly a {{Bucketizer}}
- the discretizer computes what the actual values of the splits should be. My point is that
if you want to include values outside of the splits (bucket boundaries) you need to be explicit
and put Inf/-Inf in {{splits}}. 

If you believe that the "invalid" handling should also include values outside of the split
range that can be discussed. Do you propose to include all values outside the range in the
special bucket (as is done for {{NaN}} now)?

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---------------------------------------------------
>
>                 Key: SPARK-19714
>                 URL: https://issues.apache.org/jira/browse/SPARK-19714
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.1.0
>            Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of Bucketizer bounds
[5.0, 500.0].  Check your features, or loosen the lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message