spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka
Date Tue, 06 Apr 2021 16:39:28 GMT
Gabor's point is that these are not libraries you typically install in your
cluster itself. You package them with your app.

On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Hi G
>
> Thanks for the heads-up.
>
> In a thread on 3rd of March I reported that 3.1.1 works in yarn mode
>
> Spark 3.1.1 Preliminary results (mainly to do with Spark Structured
> Streaming) (mail-archive.com)
> <https://www.mail-archive.com/user@spark.apache.org/msg75979.html>
>
> From that mail
>
>
> The needed jar files for version 3.1.1 to read from Kafka and write to
> BigQuery for 3.1.1 are as follows:
>
> All under $SPARK_HOME/jars on all nodes. These are the latest available jar
> files
>
>
>    - commons-pool2-2.9.0.jar
>    - spark-token-provider-kafka-0-10_2.12-3.1.0.jar
>    - spark-sql-kafka-0-10_2.12-3.1.0.jar
>    - kafka-clients-2.7.0.jar
>    - spark-bigquery-latest_2.12.jar
>
>
>
> I just tested it and in local mode single JVM it works fine without the
> addition of package --> --packages
> org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
>  BUT including all the above jars files
>
> Batch: 17
> -------------------------------------------
> +--------------------+------+-------------------+------+
> |              rowkey|ticker|         timeissued| price|
> +--------------------+------+-------------------+------+
> |54651f0d-1be0-4d7...|   IBM|2021-04-06 17:17:04| 91.92|
> |8aa1ad79-4792-466...|   SAP|2021-04-06 17:17:04| 34.93|
> |8567f327-cfec-43d...|  TSCO|2021-04-06 17:17:04| 324.5|
> |138a1278-2f54-45b...|   VOD|2021-04-06 17:17:04| 241.4|
> |e02793c3-8e78-47e...|  ORCL|2021-04-06 17:17:04|  17.6|
> |0ab456fb-bd22-465...|  SBRY|2021-04-06 17:17:04|350.45|
> |74588e92-a3e2-48c...|  MSFT|2021-04-06 17:17:04| 44.58|
> |1e7203c6-6938-4ea...|    BP|2021-04-06 17:17:04| 588.0|
> |1e55021a-148d-4aa...|   MRW|2021-04-06 17:17:04|171.21|
> |229ad6f9-e4ed-475...|   MKS|2021-04-06 17:17:04|439.17|
> +--------------------+------+-------------------+------+
>
> However, if I exclude the jar file spark-sql-kafka-0-10_2.12-3.1.0.jar and
> include the packages as suggested in the link
>
>
> spark-submit --master local[4] --conf
> spark.pyspark.virtualenv.enabled=true --conf
> spark.pyspark.virtualenv.type=native --conf
> spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt
> --conf
> spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv
> --conf
> spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 *--packages
> org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1* xyz.py
>
> It cannot fetch the data
>
> root
>  |-- parsed_value: struct (nullable = true)
>  |    |-- rowkey: string (nullable = true)
>  |    |-- ticker: string (nullable = true)
>  |    |-- timeissued: timestamp (nullable = true)
>  |    |-- price: float (nullable = true)
>
> {'message': 'Initializing sources', 'isDataAvailable': False,
> 'isTriggerActive': False}
> -------------------------------------------
> Batch: 0
> -------------------------------------------
> +------+------+----------+-----+
> |rowkey|ticker|timeissued|price|
> +------+------+----------+-----+
> +------+------+----------+-----+
>
> 2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
> java.lang.NoSuchMethodError:
> org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
>         at
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:549)
>         at
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:291)
>         at
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>         at
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604)
>         at
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287)
>         at
> org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63)
>         at
> org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
>         at
> org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
>         at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>         at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
>         at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>         at
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
>         at
> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452)
>         at
> org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:131)
>         at
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
> java.lang.NoSuchMethodError:
> org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
>
>
> Now I deleted ~/.ivy2 directory and ran the job again
>
> Ivy Default Cache set to: /home/hduser/.ivy2/cache
> The jars for the packages stored in: /home/hduser/.ivy2/jars
> org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
> :: resolving dependencies ::
> org.apache.spark#spark-submit-parent-2bab6bd2-3136-4783-b044-810f0800ef0e;1.0
>
> let us go and have a look at the directory .ivy2/jars
>
>  /home/hduser/.ivy2/jars> ltr
> total 13108
> -rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014
> org.spark-project.spark_unused-1.0.0.jar
> -rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019
> org.apache.commons_commons-pool2-2.6.2.jar
> -rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019
> org.slf4j_slf4j-api-1.7.30.jar
> -rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
> -rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020
> org.apache.kafka_kafka-clients-2.6.0.jar
> -rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10
> org.xerial.snappy_snappy-java-1.1.8.2.jar
> -rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14
> com.github.luben_zstd-jni-1.4.8-1.jar
> -rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57
> org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
> -rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58
> org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
> drwxr-xr-x 4 hduser hadoop    4096 Apr  6 17:25 ..
> drwxr-xr-x 2 hduser hadoop    4096 Apr  6 17:25 .
>
> Strangely these jar files like org.apache.kafka_kafka-clients-2.6.0.jar
> and org.apache.commons_commons-pool2-2.6.2.jar seem to be out of date.
>
> Very confusing. Sounds like we have changed something in the cluster that
> as reported on 3rd March it  used to work with those jar files and now not
> working.
>
> So in summary *without those jar files added to $SPARK_HOME/jars i*t
> fails totally even with the packages added.
>
> Cheers
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 6 Apr 2021 at 15:44, Gabor Somogyi <gabor.g.somogyi@gmail.com>
> wrote:
>
>> > Anyway I unzipped the tarball for Spark-3.1.1 and there is
>> no spark-sql-kafka-0-10_2.12-3.0.1.jar even
>>
>> Please see how Structured Streaming app with Kafka needs to be deployed
>> here:
>> https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
>> I don't see the --packages option...
>>
>> G
>>
>>
>> On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>>> OK thanks for that.
>>>
>>> I am using spark-submit with PySpark as follows
>>>
>>>  spark-submit --version
>>> Welcome to
>>>       ____              __
>>>      / __/__  ___ _____/ /__
>>>     _\ \/ _ \/ _ `/ __/  '_/
>>>    /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
>>>       /_/
>>>
>>> Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201
>>> Branch HEAD
>>> Compiled by user ubuntu on 2021-02-22T01:33:19Z
>>>
>>>
>>> spark-submit --master yarn --deploy-mode client --conf
>>> spark.pyspark.virtualenv.enabled=true --conf
>>> spark.pyspark.virtualenv.type=native --conf
>>> spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt
>>> --conf
>>> spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv
>>> --conf
>>> spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3
>>> --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores
>>> 2 xyz.py
>>>
>>> enabling with virtual environment
>>>
>>>
>>> That works fine with any job that does not do structured streaming in a
>>> client mode.
>>>
>>>
>>> Running on local  node with
>>>
>>>
>>> spark-submit --master local[4] --conf
>>> spark.pyspark.virtualenv.enabled=true --conf
>>> spark.pyspark.virtualenv.type=native --conf
>>> spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt
>>> --conf
>>> spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv
>>> --conf
>>> spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3
>>> xyz.py
>>>
>>>
>>> works fine with the same spark version and $SPARK_HOME/jars
>>>
>>>
>>> Cheers
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 6 Apr 2021 at 13:20, Sean Owen <srowen@gmail.com> wrote:
>>>
>>>> You may be compiling your app against 3.0.1 JARs but submitting to
>>>> 3.1.1.
>>>> You do not in general modify the Spark libs. You need to package libs
>>>> like this with your app at the correct version.
>>>>
>>>> On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Thanks Gabor.
>>>>>
>>>>> All nodes are running Spark /spark-3.1.1-bin-hadoop3.2
>>>>>
>>>>> So $SPARK_HOME/jars contains all the required jars on all nodes
>>>>> including the jar file commons-pool2-2.9.0.jar as well.
>>>>>
>>>>> They are installed identically on all nodes.
>>>>>
>>>>> I have looked at the Spark environment for classpath. Still I don't
>>>>> see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.
>>>>> 12-3.1.1.jar
>>>>> but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar
>>>>>
>>>>> Anyway I unzipped the tarball for Spark-3.1.1 and there is
>>>>> no spark-sql-kafka-0-10_2.12-3.0.1.jar even
>>>>>
>>>>> I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then
>>>>> I enquired the availability of new version from Maven that pointed to
>>>>> *spark-sql-kafka-0-10_2.12-3.1.1.jar*
>>>>>
>>>>> So to confirm Spark out of the tarball does not have any
>>>>>
>>>>> ltr spark-sql-kafka-*
>>>>> ls: cannot access spark-sql-kafka-*: No such file or directory
>>>>>
>>>>>
>>>>> For SSS, I had to add these
>>>>>
>>>>> add commons-pool2-2.9.0.jar. The one shipped is
>>>>>  commons-pool-1.5.4.jar!
>>>>>
>>>>> add kafka-clients-2.7.0.jar  Did not have any
>>>>>
>>>>> add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any
>>>>>
>>>>> I gather from your second mail, there seems to be an issue with
>>>>> spark-sql-kafka-0-10_2.12-3.*1*.1.jar ?
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <gabor.g.somogyi@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Since you've not shared too much details I presume you've updated
the spark-sql-kafka
>>>>>> jar only.
>>>>>> KafkaTokenUtil is in the token provider jar.
>>>>>>
>>>>>> As a general note if I'm right, please update Spark as a whole on
all
>>>>>> nodes and not just jars independently.
>>>>>>
>>>>>> BR,
>>>>>> G
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar
>>>>>>> for Spark. It throws
>>>>>>>
>>>>>>>
>>>>>>> java.lang.NoSuchMethodError:
>>>>>>> org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
>>>>>>>
>>>>>>>
>>>>>>> However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar
>>>>>>> works fine
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property
which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary
damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>

Mime
View raw message