spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Pros and Cons
Date Fri, 27 May 2016 15:44:17 GMT
Hi Ted,

do you mean Hive 2 with spark 2 snapshot build as the execution engine just
binaries for snapshot (all ok)?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 27 May 2016 at 16:39, Ted Yu <yuzhihong@gmail.com> wrote:

> Teng:
> Why not try out the 2.0 SANPSHOT build ?
>
> Thanks
>
> > On May 27, 2016, at 7:44 AM, Teng Qiu <tengqiu@gmail.com> wrote:
> >
> > ah, yes, the version is another mess!... no vendor's product
> >
> > i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work.
> >
> > hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this
> > from hive side https://issues.apache.org/jira/browse/HIVE-13301
> >
> > the jackson-databind lib from calcite-avatica.jar is too old.
> >
> > will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0
> released.
> >
> >
> > 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:
> >> Hi Teng,
> >>
> >>
> >> what version of spark are using as the execution engine. are you using a
> >> vendor's product here?
> >>
> >> thanks
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn
> >>
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >>
> >>
> >>> On 27 May 2016 at 13:05, Teng Qiu <tengqiu@gmail.com> wrote:
> >>>
> >>> I agree with Koert and Reynold, spark works well with large dataset
> now.
> >>>
> >>> back to the original discussion, compare SparkSQL vs Hive in Spark vs
> >>> Spark API.
> >>>
> >>> SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
> >>> SparkSQL is pure SQL, and Spark API is language for writing stored
> >>> procedure
> >>>
> >>> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
> >>> use spark as spark as execution engine, SparkSQL uses Hive's syntax,
> >>> so as a language, i would say they are almost the same.
> >>>
> >>> but Hive on Spark has a much better support for hive features,
> >>> especially hiveserver2 and security features, hive features in
> >>> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
> >>> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
> >>> work with hivevar and hiveconf argument anymore, and the username for
> >>> login via jdbc doesn't work either...
> >>> see https://issues.apache.org/jira/browse/SPARK-13983
> >>>
> >>> i believe hive support in spark project is really very low priority
> >>> stuff...
> >>>
> >>> sadly Hive on spark integration is not that easy, there are a lot of
> >>> dependency conflicts... such as
> >>> https://issues.apache.org/jira/browse/HIVE-13301
> >>>
> >>> our requirement is using spark with hiveserver2 in a secure way (with
> >>> authentication and authorization), currently SparkSQL alone can not
> >>> provide this, we are using ranger/sentry + Hive on Spark.
> >>>
> >>> hope this can help you to get a better idea which direction you should
> go.
> >>>
> >>> Cheers,
> >>>
> >>> Teng
> >>>
> >>>
> >>> 2016-05-27 2:36 GMT+02:00 Koert Kuipers <koert@tresata.com>:
> >>>> We do disk-to-disk iterative algorithms in spark all the time, on
> >>>> datasets
> >>>> that do not fit in memory, and it works well for us. I usually have
to
> >>>> do
> >>>> some tuning of number of partitions for a new dataset but that's about
> >>>> it in
> >>>> terms of inconveniences.
> >>>>
> >>>> On May 26, 2016 2:07 AM, "Jörn Franke" <jornfranke@gmail.com>
wrote:
> >>>>
> >>>>
> >>>> Spark can handle this true, but it is optimized for the idea that it
> >>>> works
> >>>> it works on the same full dataset in-memory due to the underlying
> nature
> >>>> of
> >>>> machine learning algorithms (iterative). Of course, you can spill
> over,
> >>>> but
> >>>> that you should avoid.
> >>>>
> >>>> That being said you should have read my final sentence about this.
> Both
> >>>> systems develop and change.
> >>>>
> >>>>
> >>>> On 25 May 2016, at 22:14, Reynold Xin <rxin@databricks.com> wrote:
> >>>>
> >>>>
> >>>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfranke@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Spark is more for machine learning working iteravely over the whole
> >>>>> same
> >>>>> dataset in memory. Additionally it has streaming and graph processing
> >>>>> capabilities that can be used together.
> >>>>
> >>>>
> >>>> Hi Jörn,
> >>>>
> >>>> The first part is actually no true. Spark can handle data far greater
> >>>> than
> >>>> the aggregate memory available on a cluster. The more recent versions
> >>>> (1.3+)
> >>>> of Spark have external operations for almost all built-in operators,
> and
> >>>> while things may not be perfect, those external operators are becoming
> >>>> more
> >>>> and more robust with each version of Spark.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>

Mime
View raw message