spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Pros and Cons
Date Fri, 27 May 2016 15:39:20 GMT
Teng:
Why not try out the 2.0 SANPSHOT build ?

Thanks

> On May 27, 2016, at 7:44 AM, Teng Qiu <tengqiu@gmail.com> wrote:
> 
> ah, yes, the version is another mess!... no vendor's product
> 
> i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work.
> 
> hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this
> from hive side https://issues.apache.org/jira/browse/HIVE-13301
> 
> the jackson-databind lib from calcite-avatica.jar is too old.
> 
> will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0 released.
> 
> 
> 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:
>> Hi Teng,
>> 
>> 
>> what version of spark are using as the execution engine. are you using a
>> vendor's product here?
>> 
>> thanks
>> 
>> Dr Mich Talebzadeh
>> 
>> 
>> 
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> 
>> 
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> 
>> 
>> 
>>> On 27 May 2016 at 13:05, Teng Qiu <tengqiu@gmail.com> wrote:
>>> 
>>> I agree with Koert and Reynold, spark works well with large dataset now.
>>> 
>>> back to the original discussion, compare SparkSQL vs Hive in Spark vs
>>> Spark API.
>>> 
>>> SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
>>> SparkSQL is pure SQL, and Spark API is language for writing stored
>>> procedure
>>> 
>>> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
>>> use spark as spark as execution engine, SparkSQL uses Hive's syntax,
>>> so as a language, i would say they are almost the same.
>>> 
>>> but Hive on Spark has a much better support for hive features,
>>> especially hiveserver2 and security features, hive features in
>>> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
>>> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
>>> work with hivevar and hiveconf argument anymore, and the username for
>>> login via jdbc doesn't work either...
>>> see https://issues.apache.org/jira/browse/SPARK-13983
>>> 
>>> i believe hive support in spark project is really very low priority
>>> stuff...
>>> 
>>> sadly Hive on spark integration is not that easy, there are a lot of
>>> dependency conflicts... such as
>>> https://issues.apache.org/jira/browse/HIVE-13301
>>> 
>>> our requirement is using spark with hiveserver2 in a secure way (with
>>> authentication and authorization), currently SparkSQL alone can not
>>> provide this, we are using ranger/sentry + Hive on Spark.
>>> 
>>> hope this can help you to get a better idea which direction you should go.
>>> 
>>> Cheers,
>>> 
>>> Teng
>>> 
>>> 
>>> 2016-05-27 2:36 GMT+02:00 Koert Kuipers <koert@tresata.com>:
>>>> We do disk-to-disk iterative algorithms in spark all the time, on
>>>> datasets
>>>> that do not fit in memory, and it works well for us. I usually have to
>>>> do
>>>> some tuning of number of partitions for a new dataset but that's about
>>>> it in
>>>> terms of inconveniences.
>>>> 
>>>> On May 26, 2016 2:07 AM, "Jörn Franke" <jornfranke@gmail.com> wrote:
>>>> 
>>>> 
>>>> Spark can handle this true, but it is optimized for the idea that it
>>>> works
>>>> it works on the same full dataset in-memory due to the underlying nature
>>>> of
>>>> machine learning algorithms (iterative). Of course, you can spill over,
>>>> but
>>>> that you should avoid.
>>>> 
>>>> That being said you should have read my final sentence about this. Both
>>>> systems develop and change.
>>>> 
>>>> 
>>>> On 25 May 2016, at 22:14, Reynold Xin <rxin@databricks.com> wrote:
>>>> 
>>>> 
>>>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfranke@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Spark is more for machine learning working iteravely over the whole
>>>>> same
>>>>> dataset in memory. Additionally it has streaming and graph processing
>>>>> capabilities that can be used together.
>>>> 
>>>> 
>>>> Hi Jörn,
>>>> 
>>>> The first part is actually no true. Spark can handle data far greater
>>>> than
>>>> the aggregate memory available on a cluster. The more recent versions
>>>> (1.3+)
>>>> of Spark have external operations for almost all built-in operators, and
>>>> while things may not be perfect, those external operators are becoming
>>>> more
>>>> and more robust with each version of Spark.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message