spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: java.lang.RuntimeException: Unsupported type: vector
Date Mon, 25 Jul 2016 06:52:12 GMT
I just wonder how your CSV data structure looks like.

If my understanding is correct, is SQL type of the VectorUDT is StructType
and CSV data source does not support ArrayType and StructType.

Anyhow, it seems CSV does not support UDT for now anyway.

https://github.com/apache/spark/blob/e1dc853737fc1739fbb5377ffe31fb2d89935b1f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L241-L293



2016-07-25 1:50 GMT+09:00 Jean Georges Perrin <jgp@jgp.net>:

> I try to build a simple DataFrame that can be used for ML
>
>
> SparkConf conf = new SparkConf().setAppName("Simple prediction from Text
> File").setMaster("local");
> SparkContext sc = new SparkContext(conf);
> SQLContext sqlContext = new SQLContext(sc);
>
> sqlContext.udf().register("vectorBuilder", new VectorBuilder(), new
> VectorUDT());
>
> String filename = "data/tuple-data-file.csv";
> StructType schema = new StructType(
> new StructField[] { new StructField("C0", DataTypes.StringType, false,
> Metadata.empty()),
> new StructField("C1", DataTypes.IntegerType, false, Metadata.empty()),
> new StructField("features", new VectorUDT(), false, Metadata.empty()), });
>
> DataFrame df = sqlContext.read().format("com.databricks.spark.csv"
> ).schema(schema).option("header", "false")
> .load(filename);
> df = df.withColumn("label", df.col("C0")).drop("C0");
> df = df.withColumn("value", df.col("C1")).drop("C1");
> df.printSchema();
> Returns:
> root
>  |-- features: vector (nullable = false)
>  |-- label: string (nullable = false)
>  |-- value: integer (nullable = false)
> df.show();
> Returns:
>
> java.lang.RuntimeException: Unsupported type: vector
> at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
> at scala.collection.AbstractIterator.to(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 16/07/24 12:46:01 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> localhost): java.lang.RuntimeException: Unsupported type: vector
> at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
> at scala.collection.AbstractIterator.to(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> 16/07/24 12:46:01 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1
> times; aborting job
> 16/07/24 12:46:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
> 16/07/24 12:46:01 INFO TaskSchedulerImpl: Cancelling stage 0
> 16/07/24 12:46:01 INFO DAGScheduler: ResultStage 0 (show at
> SimplePredictionFromTextFile.java:50) failed in 0.979 s
> 16/07/24 12:46:01 INFO DAGScheduler: Job 0 failed: show at
> SimplePredictionFromTextFile.java:50, took 1.204903 s
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent
> failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
> java.lang.RuntimeException: Unsupported type: vector
> at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
> at scala.collection.AbstractIterator.to(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
> at
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
> at
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
> at
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
> at
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
> at
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
> at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
> at org.apache.spark.sql.DataFrame.org
> $apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
> at org.apache.spark.sql.DataFrame.org
> $apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
> at
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
> at
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
> at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
> at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
> at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
> at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
> at org.apache.spark.sql.DataFrame.show(DataFrame.scala:350)
> at org.apache.spark.sql.DataFrame.show(DataFrame.scala:311)
> at org.apache.spark.sql.DataFrame.show(DataFrame.scala:319)
> at net.jgp.labs.spark.SimplePredictionFromTextFile.start(
> SimplePredictionFromTextFile.java:50)
> at net.jgp.labs.spark.SimplePredictionFromTextFile.main(
> SimplePredictionFromTextFile.java:29)
> Caused by: java.lang.RuntimeException: Unsupported type: vector
> at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
> at scala.collection.AbstractIterator.to(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 16/07/24 12:46:01 INFO SparkContext: Invoking stop() from shutdown hook
> 16/07/24 12:46:01 INFO BlockManagerInfo: Removed broadcast_0_piece0 on
> localhost:52970 in memory (size: 9.8 KB, free: 1140.4 MB)
> 16/07/24 12:46:01 INFO BlockManagerInfo: Removed broadcast_1_piece0 on
> localhost:52970 in memory (size: 5.4 KB, free: 1140.4 MB)
> 16/07/24 12:46:01 INFO ContextCleaner: Cleaned accumulator 3
> 16/07/24 12:46:01 INFO ContextCleaner: Cleaned accumulator 2
> 16/07/24 12:46:01 INFO SparkUI: Stopped Spark web UI at
> http://10.0.100.24:4040
> 16/07/24 12:46:01 INFO MapOutputTrackerMasterEndpoint:
> MapOutputTrackerMasterEndpoint stopped!
> 16/07/24 12:46:01 INFO MemoryStore: MemoryStore cleared
> 16/07/24 12:46:01 INFO BlockManager: BlockManager stopped
> 16/07/24 12:46:01 INFO BlockManagerMaster: BlockManagerMaster stopped
> 16/07/24 12:46:01 INFO
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
> OutputCommitCoordinator stopped!
> 16/07/24 12:46:01 INFO SparkContext: Successfully stopped SparkContext
> 16/07/24 12:46:01 INFO ShutdownHookManager: Shutdown hook called
> 16/07/24 12:46:01 INFO ShutdownHookManager: Deleting directory
> /private/var/folders/vs/kl6qlcvx30707d07txrm_xnw0000gn/T/spark-bc6bb3dc-abc5-4db0-b00c-a17f5b8f9637
>
>
> Any clue on why it says it can't dump the vector? I saw other examples
> where it did not seem to have a problem...
>
>
>

Mime
View raw message