spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: How to insert data for 100 partitions at a time using Spark SQL
Date Sun, 22 May 2016 19:47:17 GMT
two alternatives for this ETL or ELT


   1. There is only one external ORC table and you do insert overwrite into
   that external table through Spark sql
   2. or
   3. 14k files loaded into staging area/read directory and then insert
   overwrite into an ORC table and th



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 May 2016 at 20:38, swetha kasireddy <swethakasireddy@gmail.com> wrote:

> Around 14000 partitions need to be loaded every hour. Yes, I tested this
> and its taking a lot of time to load. A partition would look something like
> the following which is further partitioned by userId with all the
> userRecords for that date inside it.
>
> 5 2016-05-20 16:03 /user/user/userRecords/dtPartitioner=2012-09-12
>
> On Sun, May 22, 2016 at 12:30 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> by partition do you mean 14000 files loaded in each batch session (say
>> daily)?.
>>
>> Have you actually tested this?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 May 2016 at 20:24, swetha kasireddy <swethakasireddy@gmail.com>
>> wrote:
>>
>>> The data is not very big. Say 1MB-10 MB at the max per partition. What
>>> is the best way to insert this 14k partitions with decent performance?
>>>
>>> On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> the acid question is how many rows are you going to insert in a batch
>>>> session? btw if this is purely an sql operation then you can do all that
in
>>>> hive running on spark engine. It will be very fast as well.
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 22 May 2016 at 20:14, Jörn Franke <jornfranke@gmail.com> wrote:
>>>>
>>>>> 14000 partitions seem to be way too many to be performant (except for
>>>>> large data sets). How much data does one partition contain?
>>>>>
>>>>> > On 22 May 2016, at 09:34, SRK <swethakasireddy@gmail.com>
wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > In my Spark SQL query to insert data, I have around 14,000
>>>>> partitions of
>>>>> > data which seems to be causing memory issues. How can I insert the
>>>>> data for
>>>>> > 100 partitions at a time to avoid any memory issues?
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
>>>>> > Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>> >
>>>>> > ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message