spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean Georges Perrin <...@jgp.net>
Subject java.lang.RuntimeException: Unsupported type: vector
Date Sun, 24 Jul 2016 16:50:32 GMT
I try to build a simple DataFrame that can be used for ML


		SparkConf conf = new SparkConf().setAppName("Simple prediction from Text File").setMaster("local");
		SparkContext sc = new SparkContext(conf);
		SQLContext sqlContext = new SQLContext(sc);

		sqlContext.udf().register("vectorBuilder", new VectorBuilder(), new VectorUDT());

		String filename = "data/tuple-data-file.csv";
		StructType schema = new StructType(
				new StructField[] { new StructField("C0", DataTypes.StringType, false, Metadata.empty()),
						new StructField("C1", DataTypes.IntegerType, false, Metadata.empty()),
						new StructField("features", new VectorUDT(), false, Metadata.empty()), });

		DataFrame df = sqlContext.read().format("com.databricks.spark.csv").schema(schema).option("header",
"false")
				.load(filename);
		df = df.withColumn("label", df.col("C0")).drop("C0");
		df = df.withColumn("value", df.col("C1")).drop("C1");
		df.printSchema();
Returns:
root
 |-- features: vector (nullable = false)
 |-- label: string (nullable = false)
 |-- value: integer (nullable = false)
		df.show();
Returns:

java.lang.RuntimeException: Unsupported type: vector
	at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
	at scala.collection.Iterator$class.foreach(Iterator.scala:742)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
	at scala.collection.AbstractIterator.to(Iterator.scala:1194)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
16/07/24 12:46:01 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.RuntimeException:
Unsupported type: vector
	at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
	at scala.collection.Iterator$class.foreach(Iterator.scala:742)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
	at scala.collection.AbstractIterator.to(Iterator.scala:1194)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

16/07/24 12:46:01 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
16/07/24 12:46:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed,
from pool 
16/07/24 12:46:01 INFO TaskSchedulerImpl: Cancelling stage 0
16/07/24 12:46:01 INFO DAGScheduler: ResultStage 0 (show at SimplePredictionFromTextFile.java:50)
failed in 0.979 s
16/07/24 12:46:01 INFO DAGScheduler: Job 0 failed: show at SimplePredictionFromTextFile.java:50,
took 1.204903 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0,
localhost): java.lang.RuntimeException: Unsupported type: vector
	at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
	at scala.collection.Iterator$class.foreach(Iterator.scala:742)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
	at scala.collection.AbstractIterator.to(Iterator.scala:1194)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
	at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
	at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
	at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
	at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
	at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
	at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
	at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
	at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
	at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
	at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
	at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
	at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
	at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:350)
	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:311)
	at org.apache.spark.sql.DataFrame.show(DataFrame.scala:319)
	at net.jgp.labs.spark.SimplePredictionFromTextFile.start(SimplePredictionFromTextFile.java:50)
	at net.jgp.labs.spark.SimplePredictionFromTextFile.main(SimplePredictionFromTextFile.java:29)
Caused by: java.lang.RuntimeException: Unsupported type: vector
	at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
	at scala.collection.Iterator$class.foreach(Iterator.scala:742)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
	at scala.collection.AbstractIterator.to(Iterator.scala:1194)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
16/07/24 12:46:01 INFO SparkContext: Invoking stop() from shutdown hook
16/07/24 12:46:01 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:52970 in
memory (size: 9.8 KB, free: 1140.4 MB)
16/07/24 12:46:01 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:52970 in
memory (size: 5.4 KB, free: 1140.4 MB)
16/07/24 12:46:01 INFO ContextCleaner: Cleaned accumulator 3
16/07/24 12:46:01 INFO ContextCleaner: Cleaned accumulator 2
16/07/24 12:46:01 INFO SparkUI: Stopped Spark web UI at http://10.0.100.24:4040
16/07/24 12:46:01 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/07/24 12:46:01 INFO MemoryStore: MemoryStore cleared
16/07/24 12:46:01 INFO BlockManager: BlockManager stopped
16/07/24 12:46:01 INFO BlockManagerMaster: BlockManagerMaster stopped
16/07/24 12:46:01 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator
stopped!
16/07/24 12:46:01 INFO SparkContext: Successfully stopped SparkContext
16/07/24 12:46:01 INFO ShutdownHookManager: Shutdown hook called
16/07/24 12:46:01 INFO ShutdownHookManager: Deleting directory /private/var/folders/vs/kl6qlcvx30707d07txrm_xnw0000gn/T/spark-bc6bb3dc-abc5-4db0-b00c-a17f5b8f9637


Any clue on why it says it can't dump the vector? I saw other examples where it did not seem
to have a problem...



Mime
View raw message