spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anwar AliKhan <anwaralikhan...@gmail.com>
Subject Re: Where are all the jars gone ?
Date Thu, 25 Jun 2020 11:17:39 GMT
I know I can  arrive at the same result with this code,

      val range100 = spark.range(1,101).agg((sum('id) as
"sum")).first.get(0)
      println(f"sum of range100 =  $range100")

so I am not stuck,
I was just curious  😯 why the code breaks using the current link
libraries.

spark.range(1,101).reduce(_+_)

spark-submit test

/opt/spark/spark-submit

spark.range(1,101).reduce(_+_)
<console>:24: error: overloaded method value reduce with alternatives:
  (func:
org.apache.spark.api.java.function.ReduceFunction[java.lang.Long])java.lang.Long
<and>
  (func: (java.lang.Long, java.lang.Long) => java.lang.Long)java.lang.Long
 cannot be applied to ((java.lang.Long, java.lang.Long) => scala.Long)
       spark.range(1,101).reduce(_+_)
<http://www.backbutton.co.uk/>


On Wed, 24 Jun 2020, 19:54 Anwar AliKhan, <anwaralikhanuae@gmail.com> wrote:

>
> I am using the method describe on this page for Scala development in
> eclipse.
>
> https://data-flair.training/blogs/create-spark-scala-project/
>
>
> in the middle of the page you will find
>
>
> *“y**ou will see lots of error due to missing libraries.*
> viii. Add Spark Libraries”
>
>
> Now that I have my own build I will be pointing to the jars (spark
> libraries)
>
> in directory /opt/spark/assembly/target/scala-2.12/jars
>
>
> This way I know exactly the jar libraries I am using to remove the
> formentioned errors.
>
>
> At the same time I am trying to setup a template environment as shown here
>
>
> https://medium.com/@faizanahemad/apache-spark-setup-with-gradle-scala-and-intellij-2eeb9f30c02a
>
>
> so that I can have variables sc and spark in the eclipse editor same you
> would have spark, sc variables in the spark-shell.
>
>
> I used the word trying because the following code is broken
>
>
> spark.range(1,101).reduce(_ + _)
>
> with latest spark.
>
>
> If I use the gradle method as described then the code does work because
> it is pulling the libraries from maven repository as stipulated in
> gradle.properties
> <https://github.com/faizanahemad/spark-gradle-template/blob/master/gradle.properties>
> .
>
>
> In my previous post I *forget* with maven pom.xml you can actually
> specify version number of jar you want to pull from maven repository using *mvn
> clean package *command.
>
>
> So even if I use maven with eclipse then any new libraries uploaded in
> maven repository by developers will have recent version numbers. So will
> not effect my project.
>
> Can you please tell me why the code spark.range(1,101).reduce(_ + _) is
> broken with latest spark ?
>
>
> <http://www.backbutton.co.uk/>
>
>
> On Wed, 24 Jun 2020, 17:07 Jeff Evans, <jeffrey.wayne.evans@gmail.com>
> wrote:
>
>> If I'm understanding this correctly, you are building Spark from source
>> and using the built artifacts (jars) in some other project.  Correct?  If
>> so, then why are you concerning yourself with the directory structure that
>> Spark, internally, uses when building its artifacts?  It should be a black
>> box to your application, entirely.  You would pick the profiles (ex: Scala
>> version, Hadoop version, etc.) you need, then the install phase of Maven
>> will take care of building the jars and putting them in your local Maven
>> repo.  After that, you can resolve them from your other project seamlessly
>> (simply by declaring the org/artifact/version).
>>
>> Maven artifacts are immutable, at least released versions in Maven
>> central.  If "someone" (unclear who you are talking about) is "swapping
>> out" jars in a Maven repo then they're doing something extremely strange
>> and broken, unless they're simply replacing snapshot versions, which is a different
>> beast entirely
>> <https://maven.apache.org/guides/getting-started/index.html#What_is_a_SNAPSHOT_version>
>> .
>>
>> On Wed, Jun 24, 2020 at 10:39 AM Anwar AliKhan <anwaralikhanuae@gmail.com>
>> wrote:
>>
>>> THANKS
>>>
>>>
>>> It appears the directory containing the jars have been switched from
>>> download version to source version.
>>>
>>> In the download version it is just below parent directory called jars.
>>> level 1.
>>>
>>> In the git source version it is  4 levels down in the directory
>>>  /spark/assembly/target/scala-2.12/jars
>>>
>>> The issue I have with using maven is that the linking libraries can be
>>> changed at maven repository without my knowledge .
>>> So if an application compiled and worked previously could just break.
>>>
>>> It is not like when the developers make a change to the link libraries
>>> they run it by me first ,😢  they just upload it to maven repository with
>>> out asking me if their change
>>> Is going to impact my app.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 24 Jun 2020, 16:07 ArtemisDev, <artemis@dtechspace.com> wrote:
>>>
>>>> If you are using Maven to manage your jar dependencies, the jar files
>>>> are located in the maven repository on your home directory.  It is usually
>>>> in the .m2 directory.
>>>>
>>>> Hope this helps.
>>>>
>>>> -ND
>>>> On 6/23/20 3:21 PM, Anwar AliKhan wrote:
>>>>
>>>> Hi,
>>>>
>>>> I prefer to do most of my projects in Python and for that I use Jupyter.
>>>> I have been downloading the compiled version of spark.
>>>>
>>>> I do not normally like the source code version because the build
>>>> process makes me nervous.
>>>> You know with lines of stuff   scrolling up the screen.
>>>> What am I am going to do if a build fails. I am a user!
>>>>
>>>> I decided to risk it and it was only one  mvn command to build. (45
>>>> minutes later)
>>>> Everything is great. Success.
>>>>
>>>> I removed all jvms except jdk8 for compilation.
>>>>
>>>> I used jdk8 so I know which libraries where linked in the build process.
>>>> I also used my local version of maven. Not the apt install version .
>>>>
>>>> I used jdk8 because if you go this scala site.
>>>>
>>>> http://scala-ide.org/download/sdk.html. they say requirement  jdk8 for
>>>> IDE
>>>>  even for scala12.
>>>> They don't say JDK 8 or higher ,  just jdk8.
>>>>
>>>> So anyway  once in a while I  do spark projects in scala with eclipse.
>>>>
>>>> For that I don't use maven or anything. I prefer to make use of build
>>>> path
>>>> And external jars. This way I know exactly which libraries I am linking
>>>> to.
>>>>
>>>> creating a jar in eclipse is straight forward for spark_submit.
>>>>
>>>>
>>>> Anyway  as you can see (below) I am pointing jupyter to find
>>>> spark.init('opt/spark').
>>>> That's OK everything is fine.
>>>>
>>>> With the compiled version of spark there is a jar directory which I
>>>> have been using in eclipse.
>>>>
>>>>
>>>>
>>>> With my own compiled from source version there is no jar directory.
>>>>
>>>>
>>>> Where are all the jars gone  ?.
>>>>
>>>>
>>>>
>>>> I am not sure how findspark.init('/opt/spark') is locating the
>>>> libraries unless it is finding them from
>>>> Anaconda.
>>>>
>>>>
>>>> import findspark
>>>> findspark.init('/opt/spark')
>>>> from pyspark.sql import SparkSession
>>>> spark = SparkSession \
>>>>     .builder \
>>>>     .appName('Titanic Data') \
>>>>     .getOrCreate()
>>>>
>>>>

Mime
View raw message