spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From swetha kasireddy <swethakasire...@gmail.com>
Subject Re: How to insert data for 100 partitions at a time using Spark SQL
Date Sun, 22 May 2016 19:49:30 GMT
I am doing the 1. currently using the following and it takes a lot of time.
Whats the advantage of doing 2 and how to do it?

sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING,
record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
stored as ORC LOCATION '/user/users' ")
      sqlContext.sql("  orc.compress= SNAPPY")
      sqlContext.sql(
        """ from recordsTemp ps   insert overwrite table users
partition(datePartition , idPartition )  select ps.id, ps.record ,
ps.datePartition, ps.idPartition  """.stripMargin)

On Sun, May 22, 2016 at 12:47 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> two alternatives for this ETL or ELT
>
>
>    1. There is only one external ORC table and you do insert overwrite
>    into that external table through Spark sql
>    2. or
>    3. 14k files loaded into staging area/read directory and then insert
>    overwrite into an ORC table and th
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 20:38, swetha kasireddy <swethakasireddy@gmail.com>
> wrote:
>
>> Around 14000 partitions need to be loaded every hour. Yes, I tested this
>> and its taking a lot of time to load. A partition would look something like
>> the following which is further partitioned by userId with all the
>> userRecords for that date inside it.
>>
>> 5 2016-05-20 16:03 /user/user/userRecords/dtPartitioner=2012-09-12
>>
>> On Sun, May 22, 2016 at 12:30 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> by partition do you mean 14000 files loaded in each batch session (say
>>> daily)?.
>>>
>>> Have you actually tested this?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 22 May 2016 at 20:24, swetha kasireddy <swethakasireddy@gmail.com>
>>> wrote:
>>>
>>>> The data is not very big. Say 1MB-10 MB at the max per partition. What
>>>> is the best way to insert this 14k partitions with decent performance?
>>>>
>>>> On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> the acid question is how many rows are you going to insert in a batch
>>>>> session? btw if this is purely an sql operation then you can do all that
in
>>>>> hive running on spark engine. It will be very fast as well.
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 22 May 2016 at 20:14, Jörn Franke <jornfranke@gmail.com> wrote:
>>>>>
>>>>>> 14000 partitions seem to be way too many to be performant (except
for
>>>>>> large data sets). How much data does one partition contain?
>>>>>>
>>>>>> > On 22 May 2016, at 09:34, SRK <swethakasireddy@gmail.com>
wrote:
>>>>>> >
>>>>>> > Hi,
>>>>>> >
>>>>>> > In my Spark SQL query to insert data, I have around 14,000
>>>>>> partitions of
>>>>>> > data which seems to be causing memory issues. How can I insert
the
>>>>>> data for
>>>>>> > 100 partitions at a time to avoid any memory issues?
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
>>>>>> > Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>> >
>>>>>> >
>>>>>> ---------------------------------------------------------------------
>>>>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> > For additional commands, e-mail: user-help@spark.apache.org
>>>>>> >
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message