spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Teng Qiu <teng...@gmail.com>
Subject Re: Pros and Cons
Date Fri, 27 May 2016 15:58:05 GMT
tried spark 2.0.0 preview, but no assembly jar there... then just gave up... :p

2016-05-27 17:39 GMT+02:00 Ted Yu <yuzhihong@gmail.com>:
> Teng:
> Why not try out the 2.0 SANPSHOT build ?
>
> Thanks
>
>> On May 27, 2016, at 7:44 AM, Teng Qiu <tengqiu@gmail.com> wrote:
>>
>> ah, yes, the version is another mess!... no vendor's product
>>
>> i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work.
>>
>> hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this
>> from hive side https://issues.apache.org/jira/browse/HIVE-13301
>>
>> the jackson-databind lib from calcite-avatica.jar is too old.
>>
>> will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0 released.
>>
>>
>> 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh <mich.talebzadeh@gmail.com>:
>>> Hi Teng,
>>>
>>>
>>> what version of spark are using as the execution engine. are you using a
>>> vendor's product here?
>>>
>>> thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>>> On 27 May 2016 at 13:05, Teng Qiu <tengqiu@gmail.com> wrote:
>>>>
>>>> I agree with Koert and Reynold, spark works well with large dataset now.
>>>>
>>>> back to the original discussion, compare SparkSQL vs Hive in Spark vs
>>>> Spark API.
>>>>
>>>> SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
>>>> SparkSQL is pure SQL, and Spark API is language for writing stored
>>>> procedure
>>>>
>>>> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
>>>> use spark as spark as execution engine, SparkSQL uses Hive's syntax,
>>>> so as a language, i would say they are almost the same.
>>>>
>>>> but Hive on Spark has a much better support for hive features,
>>>> especially hiveserver2 and security features, hive features in
>>>> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
>>>> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
>>>> work with hivevar and hiveconf argument anymore, and the username for
>>>> login via jdbc doesn't work either...
>>>> see https://issues.apache.org/jira/browse/SPARK-13983
>>>>
>>>> i believe hive support in spark project is really very low priority
>>>> stuff...
>>>>
>>>> sadly Hive on spark integration is not that easy, there are a lot of
>>>> dependency conflicts... such as
>>>> https://issues.apache.org/jira/browse/HIVE-13301
>>>>
>>>> our requirement is using spark with hiveserver2 in a secure way (with
>>>> authentication and authorization), currently SparkSQL alone can not
>>>> provide this, we are using ranger/sentry + Hive on Spark.
>>>>
>>>> hope this can help you to get a better idea which direction you should go.
>>>>
>>>> Cheers,
>>>>
>>>> Teng
>>>>
>>>>
>>>> 2016-05-27 2:36 GMT+02:00 Koert Kuipers <koert@tresata.com>:
>>>>> We do disk-to-disk iterative algorithms in spark all the time, on
>>>>> datasets
>>>>> that do not fit in memory, and it works well for us. I usually have to
>>>>> do
>>>>> some tuning of number of partitions for a new dataset but that's about
>>>>> it in
>>>>> terms of inconveniences.
>>>>>
>>>>> On May 26, 2016 2:07 AM, "Jörn Franke" <jornfranke@gmail.com>
wrote:
>>>>>
>>>>>
>>>>> Spark can handle this true, but it is optimized for the idea that it
>>>>> works
>>>>> it works on the same full dataset in-memory due to the underlying nature
>>>>> of
>>>>> machine learning algorithms (iterative). Of course, you can spill over,
>>>>> but
>>>>> that you should avoid.
>>>>>
>>>>> That being said you should have read my final sentence about this. Both
>>>>> systems develop and change.
>>>>>
>>>>>
>>>>> On 25 May 2016, at 22:14, Reynold Xin <rxin@databricks.com> wrote:
>>>>>
>>>>>
>>>>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfranke@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Spark is more for machine learning working iteravely over the whole
>>>>>> same
>>>>>> dataset in memory. Additionally it has streaming and graph processing
>>>>>> capabilities that can be used together.
>>>>>
>>>>>
>>>>> Hi Jörn,
>>>>>
>>>>> The first part is actually no true. Spark can handle data far greater
>>>>> than
>>>>> the aggregate memory available on a cluster. The more recent versions
>>>>> (1.3+)
>>>>> of Spark have external operations for almost all built-in operators,
and
>>>>> while things may not be perfect, those external operators are becoming
>>>>> more
>>>>> and more robust with each version of Spark.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message