spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: [VOTE] Apache Spark 2.2.0 (RC1)
Date Mon, 01 May 2017 19:33:07 GMT
Michael, I think that the problem is with your classpath.

Spark has a dependency to 1.7.7, which can't be changed. Your project is
what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
dependency on Avro 1.8. It is understandably annoying that using the same
version of Parquet for your parquet-avro dependency is what causes your
project to depend on Avro 1.8, but Spark's dependencies aren't a problem
because its Parquet dependency doesn't bring in Avro.

There are a few ways around this:
1. Make sure Avro 1.8 is found in the classpath first
2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
3. Use parquet-avro 1.8.1 in your project, which I think should work with
1.8.2 and avoid the Avro change

The work-around in Spark is for tests, which do use parquet-avro. We can
look at a Parquet 1.8.3 that avoids this issue, but I think this is
reasonable for the 2.2.0 release.

rb

On Mon, May 1, 2017 at 12:08 PM, Michael Heuer <heuermh@gmail.com> wrote:

> Please excuse me if I'm misunderstanding -- the problem is not with our
> library or our classpath.
>
> There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
> already has to work around this for unit tests to pass.
>
>
>
> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue <rblue@netflix.com> wrote:
>
>> Thanks for the extra context, Frank. I agree that it sounds like your
>> problem comes from the conflict between your Jars and what comes with
>> Spark. Its the same concern that makes everyone shudder when anything has a
>> public dependency on Jackson. :)
>>
>> What we usually do to get around situations like this is to relocate the
>> problem library inside the shaded Jar. That way, Spark uses its version of
>> Avro and your classes use a different version of Avro. This works if you
>> don't need to share classes between the two. Would that work for your
>> situation?
>>
>> rb
>>
>> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> sounds like you are running into the fact that you cannot really put
>>> your classes before spark's on classpath? spark's switches to support this
>>> never really worked for me either.
>>>
>>> inability to control the classpath + inconsistent jars => trouble ?
>>>
>>> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
>>> fnothaft@berkeley.edu> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> We do set Avro to 1.8 in our downstream project. We also set Spark as a
>>>> provided dependency, and build an überjar. We run via spark-submit, which
>>>> builds the classpath with our überjar and all of the Spark deps. This leads
>>>> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
>>>> the no such method exception to occur.
>>>>
>>>> Regards,
>>>>
>>>> Frank Austin Nothaft
>>>> fnothaft@berkeley.edu
>>>> fnothaft@eecs.berkeley.edu
>>>> 202-340-0466 <(202)%20340-0466>
>>>>
>>>> On May 1, 2017, at 11:31 AM, Ryan Blue <rblue@netflix.com> wrote:
>>>>
>>>> Frank,
>>>>
>>>> The issue you're running into is caused by using parquet-avro with Avro
>>>> 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark
>>>> can't update Avro because it is a breaking change that would force users
to
>>>> rebuilt specific Avro classes in some cases. But you should be free to use
>>>> Avro 1.8 to avoid the problem.
>>>>
>>>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
>>>> fnothaft@berkeley.edu> wrote:
>>>>
>>>>> Hi Ryan et al,
>>>>>
>>>>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>>>>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>>>>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7
as a
>>>>> dependency. My colleague Michael (who posted earlier on this thread)
>>>>> documented this in Spark-19697
>>>>> <https://issues.apache.org/jira/browse/SPARK-19697>. I know that
>>>>> Spark has unit tests that check this compatibility issue, but it looks
like
>>>>> there was a recent change that sets a test scope dependency on Avro
>>>>> 1.8.0
>>>>> <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>,
>>>>> which masks this issue in the unit tests. With this error, you can’t
use
>>>>> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Frank Austin Nothaft
>>>>> fnothaft@berkeley.edu
>>>>> fnothaft@eecs.berkeley.edu
>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>
>>>>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID
>>>>> <rblue@netflix.com.invalid>> wrote:
>>>>>
>>>>> I agree with Sean. Spark only pulls in parquet-avro for tests. For
>>>>> execution, it implements the record materialization APIs in Parquet to
go
>>>>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
>>>>> dependency into Spark as far as I can tell.
>>>>>
>>>>> rb
>>>>>
>>>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <sowen@cloudera.com>
wrote:
>>>>>
>>>>>> See discussion at https://github.com/apache/spark/pull/17163 -- I
>>>>>> think the issue is that fixing this trades one problem for a slightly
>>>>>> bigger one.
>>>>>>
>>>>>>
>>>>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <heuermh@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2
but
>>>>>>> does not bump the dependency version for avro (currently at 1.7.7).
 Though
>>>>>>> perhaps not clear from the issue I reported [0], this means that
Spark is
>>>>>>> internally inconsistent, in that a call through parquet (which
depends on
>>>>>>> avro 1.8.0 [1]) may throw errors at runtime when it hits avro
1.7.7 on the
>>>>>>> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>>>>>>
>>>>>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>>>>>>> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8
>>>>>>> .2/pom.xml#L96
>>>>>>>
>>>>>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <sowen@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have one more issue that, if it needs to be fixed, needs
to be
>>>>>>>> fixed for 2.2.0.
>>>>>>>>
>>>>>>>> I'm fixing build warnings for the release and noticed that
>>>>>>>> checkstyle actually complains there are some Java methods
named in
>>>>>>>> TitleCase, like `ProcessingTimeTimeout`:
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/pull/17803/files#r113934080
>>>>>>>>
>>>>>>>> Easy enough to fix and it's right, that's not conventional.
However
>>>>>>>> I wonder if it was done on purpose to match a class name?
>>>>>>>>
>>>>>>>> I think this is one for @tdas
>>>>>>>>
>>>>>>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> Please vote on releasing the following candidate as Apache
Spark
>>>>>>>>> version 2.2.0. The vote is open until Tues, May 2nd,
2017 at
>>>>>>>>> 12:00 PST and passes if a majority of at least 3 +1 PMC
votes are
>>>>>>>>> cast.
>>>>>>>>>
>>>>>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>>> http://spark.apache.org/
>>>>>>>>>
>>>>>>>>> The tag to be voted on is v2.2.0-rc1
>>>>>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1>
(8ccb4a57c82146c
>>>>>>>>> 1a8f8966c7e64010cf5632cb6)
>>>>>>>>>
>>>>>>>>> List of JIRA tickets resolved can be found with this
filter
>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> The release files, including signatures, digests, etc.
can be
>>>>>>>>> found at:
>>>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-
>>>>>>>>> rc1-bin/
>>>>>>>>>
>>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>>
>>>>>>>>> The staging repository for this release can be found
at:
>>>>>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>>>>>> spark-1235/
>>>>>>>>>
>>>>>>>>> The documentation corresponding to this release can be
found at:
>>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.
>>>>>>>>> 0-rc1-docs/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *FAQ*
>>>>>>>>>
>>>>>>>>> *How can I help test this release?*
>>>>>>>>>
>>>>>>>>> If you are a Spark user, you can help us test this release
by
>>>>>>>>> taking an existing Spark workload and running on this
release candidate,
>>>>>>>>> then reporting any regressions.
>>>>>>>>>
>>>>>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>>>>>
>>>>>>>>> Committers should look at those and triage. Extremely
important
>>>>>>>>> bug fixes, documentation, and API tweaks that impact
compatibility should
>>>>>>>>> be worked on immediately. Everything else please retarget
to 2.3.0 or 2.2.1.
>>>>>>>>>
>>>>>>>>> *But my bug isn't fixed!??!*
>>>>>>>>>
>>>>>>>>> In order to make timely releases, we will typically not
hold the
>>>>>>>>> release unless the bug in question is a regression from
2.1.1.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message