spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From swetha kasireddy <swethakasire...@gmail.com>
Subject Re: How to insert data for 100 partitions at a time using Spark SQL
Date Sun, 22 May 2016 18:59:02 GMT
So, if I put 1000 records at a time and if the next 1000 records have some
records that has same  partition as the previous records then the data will
be overwritten. How can I prevent overwriting valid data in this case?
Could you post the example that you are talking about?

What I am doing is in the final insert into the ORC table, I
insert/overwrite the data. So, I need to have  a way to insert all the data
related to one partition at a time so that it is not overwritten when I
insert the next set of records.

On Sun, May 22, 2016 at 11:51 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com
> wrote:

> ok is the staging table used as staging only.
>
> you can create a staging *directory^ where you put your data there (you
> can put 100s of files there) and do an insert/select that will take data
> from 100 files into your main ORC table.
>
> I have an example of 100's of CSV files insert/select from a staging
> external table into an ORC table.
>
> My point is you are more likely interested in doing analysis on ORC table
> (read internal) rather than using staging table.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 19:43, swetha kasireddy <swethakasireddy@gmail.com>
> wrote:
>
>> But, how do I take 100 partitions at a time from staging table?
>>
>> On Sun, May 22, 2016 at 11:26 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> ok so you still keep data as ORC in Hive for further analysis
>>>
>>> what I have in mind is to have an external table as staging table and do
>>> insert into an orc internal table which is bucketed and partitioned.
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 22 May 2016 at 19:11, swetha kasireddy <swethakasireddy@gmail.com>
>>> wrote:
>>>
>>>> I am looking at ORC. I insert the data using the following query.
>>>>
>>>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS records (id
>>>> STRING,
>>>> record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
>>>> stored as ORC LOCATION '/user/users' ")
>>>>       sqlContext.sql("  orc.compress= SNAPPY")
>>>>       sqlContext.sql(
>>>>         """ from recordsTemp ps   insert overwrite table users
>>>> partition(datePartition , idPartition )  select ps.id, ps.record ,
>>>> ps.datePartition, ps.idPartition  """.stripMargin)
>>>>
>>>> On Sun, May 22, 2016 at 12:37 AM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> where is your base table and what format is it Parquet, ORC etc)
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 22 May 2016 at 08:34, SRK <swethakasireddy@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> In my Spark SQL query to insert data, I have around 14,000 partitions
>>>>>> of
>>>>>> data which seems to be causing memory issues. How can I insert the
>>>>>> data for
>>>>>> 100 partitions at a time to avoid any memory issues?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message