spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steffen Herbold (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18301) VectorAssembler does not support StructTypes
Date Mon, 02 Jan 2017 10:13:58 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15792615#comment-15792615
] 

Steffen Herbold commented on SPARK-18301:
-----------------------------------------

I think if this is a bug or a feature request depends on the point of view. 

Since structured types are natively supported by Spark the simple assumption is, that they
are supported by all features of Spark. If they are not supported by specific features (e.g.,
transformers), then there should either be a good reason for this, or it is a bug. 

In case there is a reason, this should be part of the documentation and this should be changed
to a feature request. If not, then this constitutes a bug and if other transformers are also
not able to work with them, it might actually be major instead of minor.

> VectorAssembler does not support StructTypes
> --------------------------------------------
>
>                 Key: SPARK-18301
>                 URL: https://issues.apache.org/jira/browse/SPARK-18301
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.0.1
>         Environment: Windows Standalone Mode, Java
>            Reporter: Steffen Herbold
>            Priority: Minor
>
> I tried to transform a structured type using the VectorAssembler as follows:
> {code:java}
> VectorAssembler va = new VectorAssembler().setInputCols(new String[]
>             { "metrics.Line", "metrics.McCC" }).setOutputCol("features");
>         dataframe= va.transform(dataframe);
> {code}
> This yields the following exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Field "metrics.McCC" does
not exist.
> 	at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
> 	at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
> 	at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> 	at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> 	at org.apache.spark.sql.types.StructType.apply(StructType.scala:227)
> 	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:116)
> 	at org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:116)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> 	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> 	at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:116)
> 	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
> 	at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
> 	at de.ugoe.cs.smartshark.jobs.DefectPredictionExample.main(DefectPredictionExample.java:53)
> {code}
> The schema of the dataframe is:
> {noformat}
>  |-- metrics: struct (nullable = true)
>  |    |-- Line: double (nullable = true)
>  |    |-- McCC: double (nullable = true)
> ...
> {noformat}
> The transfomation works, if I first use withColumn to make "metrics.Line" and "metrics.McCC"
into columns of the dataframe:
> {code:java}
> dataframe.withColumn("Line", data.col("metrics.Line")
> dataframe.withColumn("McCC", data.col("metrics.McCC")
> VectorAssembler va = new VectorAssembler().setInputCols(new String[]
>             { "metrics.McCC", "metrics.NL" }).setOutputCol("features");
>         fileState = va.transform(dataframe);
> {code}
> However, this workaround is quite costly and direct support to access the nested values
would be very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message