spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <>
Subject [jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
Date Fri, 24 Feb 2017 08:34:44 GMT


Nick Pentreath commented on SPARK-19714:

I agree that the parameter naming is perhaps misleading. At least the doc should be updated
because "invalid" here actually means {{NaN}} or {{null}}. 

However {{Bucketizer}} is doing what you tell it to as the splits are specified by you. Note
that if you used {{QuantileDiscretizer}} to construct the {{Bucketizer}} then it adds {{+/-
Infinity}} as the lower/upper bounds of the splits. So you can do the same if you want anything
below the lower bound to be included in the first bucket, and above the upper bound to be
included in the last bucket.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---------------------------------------------------
>                 Key: SPARK-19714
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.1.0
>            Reporter: Bill Chambers
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of Bucketizer bounds
[5.0, 500.0].  Check your features, or loosen the lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message