spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: Avro file question
Date Mon, 04 Nov 2019 19:22:03 GMT
Assuming you always read data together one large file is good and basic
hdfs use case

On Tue, 5 Nov 2019 at 4:28 am, Yaniv Harpaz <yaniv.harpaz@gmail.com> wrote:

> It depends on your usage (when and how u read).
> the smaller files you were thinking about are also larger than the HDFS
> block size?
> I would not go for something smaller than a block.
>
> Usually (if relevant to the way you read the data) the partitioning helps
> determine that.
>
>
> Yaniv Harpaz
> [ yaniv.harpaz at gmail.com ]
>
>
> On Mon, Nov 4, 2019 at 7:03 PM Sam <games2013.sam@gmail.com> wrote:
>
>> Hi,
>>
>> How do we choose between single large avro file (size much larger than
>> HDFS block size) vs multiple smaller avro files (close to HDFS block size?
>>
>> Since avro is splittable, is there even a need to split a very large avro
>> file into smaller files?
>>
>> I’m assuming that a single large avro file can also be split into
>> multiple mappers/reducers/executors during processing.
>>
>> Thanks.
>>
> --
Best Regards,
Ayan Guha

Mime
View raw message