spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Spark Sql with python udf fail
Date Mon, 23 Mar 2015 09:00:49 GMT
I suspect there is a malformed row in your input dataset. Could you try 
something like this to confirm:

|sql("SELECT * FROM <your-table>").foreach(println)
|

If there does exist a malformed line, you should see similar exception. 
And you can catch it with the help of the output. Notice that the 
messages are printed to stdout on executor side.

On 3/23/15 4:36 PM, lonely Feb wrote:

> I caught exceptions in the python UDF code, flush exceptions into a 
> single file, and made sure the the column number of the output lines 
> as same as sql schema.
>
> Sth. interesting is that my output line of the UDF code is just 10 
> columns, and the exception above is 
> java.lang.ArrayIndexOutOfBoundsException: 9, is there any inspirations?
>
> 2015-03-23 16:24 GMT+08:00 Cheng Lian <lian.cs.zju@gmail.com 
> <mailto:lian.cs.zju@gmail.com>>:
>
>     Could you elaborate on the UDF code?
>
>
>     On 3/23/15 3:43 PM, lonely Feb wrote:
>
>         Hi all, I tried to transfer some hive jobs into spark-sql.
>         When i ran a sql job with python udf i got a exception:
>
>         java.lang.ArrayIndexOutOfBoundsException: 9
>                 at
>         org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
>                 at
>         org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
>                 at
>         org.apache.spark.sql.catalyst.expressions.EqualTo.eval(predicates.scala:166)
>                 at
>         org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30)
>                 at
>         org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$anonfun$apply$1.apply(predicates.scala:30)
>                 at
>         scala.collection.Iterator$anon$14.hasNext(Iterator.scala:390)
>                 at
>         scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
>                 at
>         org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:156)
>                 at
>         org.apache.spark.sql.execution.Aggregate$anonfun$execute$1$anonfun$7.apply(Aggregate.scala:151)
>                 at
>         org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601)
>                 at
>         org.apache.spark.rdd.RDD$anonfun$13.apply(RDD.scala:601)
>                 at
>         org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>                 at
>         org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>                 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>                 at
>         org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>                 at
>         org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>                 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>                 at
>         org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>                 at
>         org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>                 at org.apache.spark.scheduler.Task.run(Task.scala:56)
>                 at
>         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
>                 at
>         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>                 at
>         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>                 at java.lang.Thread.run(Thread.java:744)
>
>         I suspected there was an odd line in the input file. But the
>         input file is so large and i could not found any abnormal
>         lines with several jobs to check. How can i get the abnormal
>         line here ?
>
>
>
‚Äč

Mime
View raw message