spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: Multiple DataFrames per Parquet file?
Date Sun, 17 May 2015 08:18:58 GMT
You can union all the df together, then call repartition().

On Sun, May 10, 2015 at 8:34 AM, Peter Aberline
<peter.aberline@gmail.com> wrote:
> Hi
>
> Thanks for the quick response.
>
> No I'm not using Streaming. Each DataFrame represents tabular data read from
> a CSV file. They have the same schema.
>
> There is also the option of appending each DF to the parquet file, but then
> I can't maintain them as separate DF when reading back in without filtering.
>
> I'll rethink maintaining each CSV file as a single DF.
>
> Thanks,
> Peter
>
>
> On 10 May 2015 at 15:51, ayan guha <guha.ayan@gmail.com> wrote:
>>
>> How did you end up with thousands of df? Are you using streaming?  In that
>> case you can do foreachRDD and keep merging incoming rdds to single rdd and
>> then save it through your own checkpoint mechanism.
>>
>> If not, please share your use case.
>>
>> On 11 May 2015 00:38, "Peter Aberline" <peter.aberline@gmail.com> wrote:
>>>
>>> Hi
>>>
>>> I have many thousands of small DataFrames that I would like to save to
>>> the one Parquet file to avoid the HDFS 'small files' problem. My
>>> understanding is that there is a 1:1 relationship between DataFrames and
>>> Parquet files if a single partition is used.
>>>
>>> Is it possible to have multiple DataFrames within the one Parquet File
>>> using PySpark?
>>> Or is the only way to achieve this to union the DataFrames into one?
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message