spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-14901) java exception when showing join
Date Thu, 12 Jan 2017 02:20:16 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819920#comment-15819920
] 

Hyukjin Kwon commented on SPARK-14901:
--------------------------------------

Would this be possible to provide a self-contained reproducer? I can't reproduce this as below
within Spark:

{code}
dispute_df = spark.range(10).toDF("COMMENTID")
comments_df = spark.range(5).toDF("COMMENTID")
dispute_df.join(comments_df, dispute_df.COMMENTID == comments_df.COMMENTID).first()
{code}

Do you mind if I ask where I can download Netezza?

> java exception when showing join
> --------------------------------
>
>                 Key: SPARK-14901
>                 URL: https://issues.apache.org/jira/browse/SPARK-14901
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.1
>            Reporter: Brent Elmer
>
> I am using pyspark with netezza.  I am getting a java exception when trying to show the
first row of a join.  I can show the first row for of the two dataframes separately but not
the result of a join.  I get the same error for any action I take(first, collect, show). 
Am I doing something wrong?
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> dispute_df = sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
user='***', password='***', dbtable='table1', driver='com.ibm.spark.netezza').load()
> dispute_df.printSchema()
> comments_df = sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
user='***', password='***', dbtable='table2', driver='com.ibm.spark.netezza').load()
> comments_df.printSchema()
> dispute_df.join(comments_df, dispute_df.COMMENTID == comments_df.COMMENTID).first()
> root
>  |-- COMMENTID: string (nullable = true)
>  |-- EXPORTDATETIME: timestamp (nullable = true)
>  |-- ARTAGS: string (nullable = true)
>  |-- POTAGS: string (nullable = true)
>  |-- INVTAG: string (nullable = true)
>  |-- ACTIONTAG: string (nullable = true)
>  |-- DISPUTEFLAG: string (nullable = true)
>  |-- ACTIONFLAG: string (nullable = true)
>  |-- CUSTOMFLAG1: string (nullable = true)
>  |-- CUSTOMFLAG2: string (nullable = true)
> root
>  |-- COUNTRY: string (nullable = true)
>  |-- CUSTOMER: string (nullable = true)
>  |-- INVNUMBER: string (nullable = true)
>  |-- INVSEQNUMBER: string (nullable = true)
>  |-- LEDGERCODE: string (nullable = true)
>  |-- COMMENTTEXT: string (nullable = true)
>  |-- COMMENTTIMESTAMP: timestamp (nullable = true)
>  |-- COMMENTLENGTH: long (nullable = true)
>  |-- FREEINDEX: long (nullable = true)
>  |-- COMPLETEDFLAG: long (nullable = true)
>  |-- ACTIONFLAG: long (nullable = true)
>  |-- FREETEXT: string (nullable = true)
>  |-- USERNAME: string (nullable = true)
>  |-- ACTION: string (nullable = true)
>  |-- COMMENTID: string (nullable = true)
> ---------------------------------------------------------------------------
> Py4JJavaError                             Traceback (most recent call last)
> <ipython-input-19-0cb9eb943052> in <module>()
>       5 comments_df = sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
user='***', password='***', dbtable='table2', driver='com.ibm.spark.netezza').load()
>       6 comments_df.printSchema()
> ----> 7 dispute_df.join(comments_df, dispute_df.COMMENTID == comments_df.COMMENTID).first()
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in first(self)
>     802         Row(age=2, name=u'Alice')
>     803         """
> --> 804         return self.head()
>     805 
>     806     @ignore_unicode_prefix
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in head(self,
n)
>     790         """
>     791         if n is None:
> --> 792             rs = self.head(1)
>     793             return rs[0] if rs else None
>     794         return self.take(n)
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in head(self,
n)
>     792             rs = self.head(1)
>     793             return rs[0] if rs else None
> --> 794         return self.take(n)
>     795 
>     796     @ignore_unicode_prefix
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in take(self,
num)
>     304         with SCCallSiteSync(self._sc) as css:
>     305             port = self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
> --> 306                 self._jdf, num)
>     307         return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
>     308 
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
in __call__(self, *args)
>     811         answer = self.gateway_client.send_command(command)
>     812         return_value = get_return_value(
> --> 813             answer, self.gateway_client, self.target_id, self.name)
>     814 
>     815         for temp_arg in temp_args:
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a,
**kw)
>      43     def deco(*a, **kw):
>      44         try:
> ---> 45             return f(*a, **kw)
>      46         except py4j.protocol.Py4JJavaError as e:
>      47             s = e.java_exception.toString()
> /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
>     306                 raise Py4JJavaError(
>     307                     "An error occurred while calling {0}{1}{2}.\n".
> --> 308                     format(target_id, ".", name), value)
>     309             else:
>     310                 raise Py4JError(
> Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage
59.0 failed 1 times, most recent failure: Lost task 2.0 in stage 59.0 (TID 1406, localhost):
java.io.IOException: EOF whilst processing escape sequence
>     at org.apache.commons.csv.Lexer.readEscape(Lexer.java:346)
>     at org.apache.commons.csv.Lexer.parseSimpleToken(Lexer.java:200)
>     at org.apache.commons.csv.Lexer.nextToken(Lexer.java:161)
>     at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
>     at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
>     at com.ibm.spark.netezza.NetezzaRecordParser.parse(NetezzaRecordParser.scala:43)
>     at com.ibm.spark.netezza.NetezzaDataReader.next(NetezzaDataReader.scala:136)
>     at com.ibm.spark.netezza.NetezzaRDD$$anon$1.getNext(NetezzaRDD.scala:77)
>     at com.ibm.spark.netezza.NetezzaRDD$$anon$1.hasNext(NetezzaRDD.scala:106)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1143)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:618)
>     at java.lang.Thread.run(Thread.java:785)
> Driver stacktrace:
>     at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>     at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>     at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>     at scala.Option.foreach(Option.scala:236)
>     at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
>     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
>     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
>     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>     at java.lang.Thread.getStackTrace(Thread.java:1117)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
>     at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
>     at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply$mcI$sp(python.scala:126)
>     at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124)
>     at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124)
>     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>     at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>     at org.apache.spark.sql.execution.EvaluatePython$.takeAndServe(python.scala:124)
>     at org.apache.spark.sql.execution.EvaluatePython.takeAndServe(python.scala)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
>     at java.lang.reflect.Method.invoke(Method.java:507)
>     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>     at py4j.Gateway.invoke(Gateway.java:259)
>     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>     at py4j.GatewayConnection.run(GatewayConnection.java:209)
>     at java.lang.Thread.run(Thread.java:785)
> Caused by: java.io.IOException: EOF whilst processing escape sequence
>     at org.apache.commons.csv.Lexer.readEscape(Lexer.java:346)
>     at org.apache.commons.csv.Lexer.parseSimpleToken(Lexer.java:200)
>     at org.apache.commons.csv.Lexer.nextToken(Lexer.java:161)
>     at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
>     at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365)
>     at com.ibm.spark.netezza.NetezzaRecordParser.parse(NetezzaRecordParser.scala:43)
>     at com.ibm.spark.netezza.NetezzaDataReader.next(NetezzaDataReader.scala:136)
>     at com.ibm.spark.netezza.NetezzaRDD$$anon$1.getNext(NetezzaRDD.scala:77)
>     at com.ibm.spark.netezza.NetezzaRDD$$anon$1.hasNext(NetezzaRDD.scala:106)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1143)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:618)
>     ... 1 more
> In [ ]:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message