spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: How to insert data for 100 partitions at a time using Spark SQL
Date Sun, 22 May 2016 18:51:16 GMT
ok is the staging table used as staging only.

you can create a staging *directory^ where you put your data there (you can
put 100s of files there) and do an insert/select that will take data from
100 files into your main ORC table.

I have an example of 100's of CSV files insert/select from a staging
external table into an ORC table.

My point is you are more likely interested in doing analysis on ORC table
(read internal) rather than using staging table.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 May 2016 at 19:43, swetha kasireddy <swethakasireddy@gmail.com> wrote:

> But, how do I take 100 partitions at a time from staging table?
>
> On Sun, May 22, 2016 at 11:26 AM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> ok so you still keep data as ORC in Hive for further analysis
>>
>> what I have in mind is to have an external table as staging table and do
>> insert into an orc internal table which is bucketed and partitioned.
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 May 2016 at 19:11, swetha kasireddy <swethakasireddy@gmail.com>
>> wrote:
>>
>>> I am looking at ORC. I insert the data using the following query.
>>>
>>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING,
>>> record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
>>> stored as ORC LOCATION '/user/users' ")
>>>       sqlContext.sql("  orc.compress= SNAPPY")
>>>       sqlContext.sql(
>>>         """ from recordsTemp ps   insert overwrite table users
>>> partition(datePartition , idPartition )  select ps.id, ps.record ,
>>> ps.datePartition, ps.idPartition  """.stripMargin)
>>>
>>> On Sun, May 22, 2016 at 12:37 AM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> where is your base table and what format is it Parquet, ORC etc)
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 22 May 2016 at 08:34, SRK <swethakasireddy@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> In my Spark SQL query to insert data, I have around 14,000 partitions
>>>>> of
>>>>> data which seems to be causing memory issues. How can I insert the
>>>>> data for
>>>>> 100 partitions at a time to avoid any memory issues?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message