spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean R. Owen (Jira)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-31671) Wrong error message in VectorAssembler when column lengths can not be inferred
Date Mon, 11 May 2020 23:25:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-31671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean R. Owen resolved SPARK-31671.
----------------------------------
    Fix Version/s: 2.4.7
                   3.0.0
         Assignee: YijieFan
       Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28487

> Wrong error message in VectorAssembler  when column lengths can not be inferred
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-31671
>                 URL: https://issues.apache.org/jira/browse/SPARK-31671
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.4.4
>         Environment: Mac OS  catalina
>            Reporter: YijieFan
>            Assignee: YijieFan
>            Priority: Minor
>             Fix For: 3.0.0, 2.4.7
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In VectorAssembler when input column lengths can not be inferred and handleInvalid =
"keep", it will throw a runtime exception with message like below
> _Can not infer column lengths with handleInvalid = "keep". *Consider using VectorSizeHint*_
>  *_|to add metadata for columns: [column1, column2]_*
> However, even if you set vector size hint for *column1*, the message remains, and will
not change to  *[column2]* only. This is not consistent with the description in the error
message.
> This introduce difficulties when I try to resolve this exception, for I do not know which
column required vectorSizeHint. This is especially troublesome when you have a large number
of columns to deal with.
> Here is a simple example:
>  
> {code:java}
> // create a df without vector size
> val df = Seq(
>   (Vectors.dense(1.0), Vectors.dense(2.0))
> ).toDF("n1", "n2")
> // only set vector size hint for n1 column
> val hintedDf = new VectorSizeHint()
>   .setInputCol("n1")
>   .setSize(1)
>   .transform(df)
> // assemble n1, n2
> val output = new VectorAssembler()
>   .setInputCols(Array("n1", "n2"))
>   .setOutputCol("features")
>   .setHandleInvalid("keep")
>   .transform(hintedDf)
> // because only n1 has vector size, the error message should tell us to set vector size
for n2 too
> output.show()
> {code}
> Expected error message:
>  
> {code:java}
> Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint
to add metadata for columns: [n2].
> {code}
> Actual error message:
> {code:java}
> Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint
to add metadata for columns: [n1, n2].
> {code}
> I change one line in VectorAssembler.scala, so that it can work properly as expected. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message