spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))
Date Mon, 09 Mar 2015 21:44:12 GMT
Does the Apache project team have any ability to measure download counts of
the various releases?  That data could be useful when it comes time to
sunset vendor-specific releases, like CDH4 for example.

On Mon, Mar 9, 2015 at 5:34 AM, Mridul Muralidharan <mridul@gmail.com>
wrote:

> In ideal situation, +1 on removing all vendor specific builds and
> making just hadoop version specific - that is what we should depend on
> anyway.
> Though I hope Sean is correct in assuming that vendor specific builds
> for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause
> incompatibilities for us or our users !
>
> Regards,
> Mridul
>
>
> On Mon, Mar 9, 2015 at 2:50 AM, Sean Owen <sowen@cloudera.com> wrote:
> > Yes, you should always find working bits at Apache no matter what --
> > though 'no matter what' really means 'as long as you use Hadoop distro
> > compatible with upstream Hadoop'. Even distros have a strong interest
> > in that, since the market, the 'pie', is made large by this kind of
> > freedom at the core.
> >
> > If tso, then no vendor-specific builds are needed, only some
> > Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be
> > good (although I'm not yet clear if there's something about 2.5 or 2.6
> > that needs a different build.)
> >
> > I take it that we already believe that, say, the "Hadoop 2.4" build
> > works with CDH5, so no CDH5-specific build is provided by Spark.
> >
> > If a distro doesn't work with stock Spark, then it's either something
> > Spark should fix (e.g. use of a private YARN API or something), or
> > it's something the distro should really fix because it's incompatible.
> >
> > Could we maybe rename the "CDH4" build then, as it doesn't really work
> > with all CDH4, to be a "Hadoop 2.0.x build"? That's been floated
> > before. And can we remove the MapR builds -- or else can someone
> > explain why these exist separately from a Hadoop 2.3 build? I hope it
> > is not *because* they are somehow non-standard. And shall we first run
> > down why Spark doesn't fully work on HDP and see if it's something
> > that Spark or HDP needs to tweak, rather than contemplate another
> > binary? or, if so, can it simply be called a "Hadoop 2.7 + YARN
> > whatever" build and not made specific to a vendor, even if the project
> > has to field another tarball combo for a vendor?
> >
> > Maybe we are saying almost the same thing.
> >
> >
> > On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia <matei.zaharia@gmail.com>
> wrote:
> >> Yeah, my concern is that people should get Apache Spark from *Apache*,
> not from a vendor. It helps everyone use the latest features no matter
> where they are. In the Hadoop distro case, Hadoop made all this effort to
> have standard APIs (e.g. YARN), so it should be easy. But it is a problem
> if we're not packaging for the newest versions of some distros; I think we
> just fell behind at Hadoop 2.4.
> >>
> >> Matei
> >>
> >>> On Mar 8, 2015, at 8:02 PM, Sean Owen <sowen@cloudera.com> wrote:
> >>>
> >>> Yeah it's not much overhead, but here's an example of where it causes
> >>> a little issue.
> >>>
> >>> I like that reasoning. However, the released builds don't track the
> >>> later versions of Hadoop that vendors would be distributing -- there's
> >>> no Hadoop 2.6 build for example. CDH4 is here, but not the
> >>> far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
> >>> actually work with many CDH4 versions.
> >>>
> >>> I agree with the goal of maximizing the reach of Spark, but I don't
> >>> know how much these builds advance that goal.
> >>>
> >>> Anyone can roll-their-own exactly-right build, and the docs and build
> >>> have been set up to make that as simple as can be expected. So these
> >>> aren't *required* to let me use latest Spark on distribution X.
> >>>
> >>> I had thought these existed to sorta support 'legacy' distributions,
> >>> like CDH4, and that build was justified as a
> >>> quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
> >>> the MapR profiles are for.
> >>>
> >>> I think it's too much work to correctly, in parallel, maintain any
> >>> customizations necessary for any major distro, and it might be best to
> >>> do not at all than to do it incompletely. You could say it's also an
> >>> enabler for distros to vary in ways that require special
> >>> customization.
> >>>
> >>> Maybe there's a concern that, if lots of people consume Spark on
> >>> Hadoop, and most people consume Hadoop through distros, and distros
> >>> alone manage Spark distributions, then you de facto 'have to' go
> >>> through a distro instead of get bits from Spark? Different
> >>> conversation but I think this sort of effect does not end up being a
> >>> negative.
> >>>
> >>> Well anyway, I like the idea of seeing how far Hadoop-provided
> >>> releases can help. It might kill several birds with one stone.
> >>>
> >>> On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia <
> matei.zaharia@gmail.com> wrote:
> >>>> Our goal is to let people use the latest Apache release even if
> vendors fall behind or don't want to package everything, so that's why we
> put out releases for vendors' versions. It's fairly low overhead.
> >>>>
> >>>> Matei
> >>>>
> >>>>> On Mar 8, 2015, at 5:56 PM, Sean Owen <sowen@cloudera.com>
wrote:
> >>>>>
> >>>>> Ah. I misunderstood that Matei was referring to the Scala 2.11
> tarball
> >>>>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
> >>>>> Maven artifacts.
> >>>>>
> >>>>> Patrick I see you just commented on SPARK-5134 and will follow up
> >>>>> there. Sounds like this may accidentally not be a problem.
> >>>>>
> >>>>> On binary tarball releases, I wonder if anyone has an opinion on
my
> >>>>> opinion that these shouldn't be distributed for specific Hadoop
> >>>>> *distributions* to begin with. (Won't repeat the argument here yet.)
> >>>>> That resolves this n x m explosion too.
> >>>>>
> >>>>> Vendors already provide their own distribution, yes, that's their
> job.
> >>>>>
> >>>>>
> >>>>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ksankar42@gmail.com>
> wrote:
> >>>>>> Yep, otherwise this will become an N^2 problem - Scala versions
X
> Hadoop
> >>>>>> Distributions X ...
> >>>>>>
> >>>>>> May be one option is to have a minimum basic set (which I know
is
> what we
> >>>>>> are discussing) and move the rest to spark-packages.org. There
the
> vendors
> >>>>>> can add the latest downloads - for example when 1.4 is released,
> HDP can
> >>>>>> build a release of HDP Spark 1.4 bundle.
> >>>>>>
> >>>>>> Cheers
> >>>>>> <k/>
> >>>>>>
> >>>>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pwendell@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>> We probably want to revisit the way we do binaries in general
for
> >>>>>>> 1.4+. IMO, something worth forking a separate thread for.
> >>>>>>>
> >>>>>>> I've been hesitating to add new binaries because people
> >>>>>>> (understandably) complain if you ever stop packaging older
ones,
> but
> >>>>>>> on the other hand the ASF has complained that we have too
many
> >>>>>>> binaries already and that we need to pare it down because
of the
> large
> >>>>>>> volume of files. Doubling the number of binaries we produce
for
> Scala
> >>>>>>> 2.11 seemed like it would be too much.
> >>>>>>>
> >>>>>>> One solution potentially is to actually package "Hadoop
provided"
> >>>>>>> binaries and encourage users to use these by simply setting
> >>>>>>> HADOOP_HOME, or have instructions for specific distros.
I've heard
> >>>>>>> that our existing packages don't work well on HDP for instance,
> since
> >>>>>>> there are some configuration quirks that differ from the
upstream
> >>>>>>> Hadoop.
> >>>>>>>
> >>>>>>> If we cut down on the cross building for Hadoop versions,
then it
> is
> >>>>>>> more tenable to cross build for Scala versions without exploding
> the
> >>>>>>> number of binaries.
> >>>>>>>
> >>>>>>> - Patrick
> >>>>>>>
> >>>>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <sowen@cloudera.com>
> wrote:
> >>>>>>>> Yeah, interesting question of what is the better default
for the
> >>>>>>>> single set of artifacts published to Maven. I think
there's an
> >>>>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10
build too.
> Pros
> >>>>>>>> and cons discussed more at
> >>>>>>>>
> >>>>>>>> https://issues.apache.org/jira/browse/SPARK-5134
> >>>>>>>> https://github.com/apache/spark/pull/3917
> >>>>>>>>
> >>>>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <
> matei.zaharia@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> +1
> >>>>>>>>>
> >>>>>>>>> Tested it on Mac OS X.
> >>>>>>>>>
> >>>>>>>>> One small issue I noticed is that the Scala 2.11
build is using
> Hadoop
> >>>>>>>>> 1 without Hive, which is kind of weird because people
will more
> likely want
> >>>>>>>>> Hadoop 2 with Hive. So it would be good to publish
a build for
> that
> >>>>>>>>> configuration instead. We can do it if we do a new
RC, or it
> might be that
> >>>>>>>>> binary builds may not need to be voted on (I forgot
the details
> there).
> >>>>>>>>>
> >>>>>>>>> Matei
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message