spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Reading TB of JSON file
Date Thu, 18 Jun 2020 16:34:13 GMT
Hi,
So you have a single JSON record in multiple lines?
And all the 50 GB is in one file?

Regards,
Gourav

On Thu, 18 Jun 2020, 14:34 Chetan Khatri, <chetan.opensource@gmail.com>
wrote:

> It is dynamically generated and written at s3 bucket not historical data
> so I guess it doesn't have jsonlines format
>
> On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke <jornfranke@gmail.com> wrote:
>
>> Depends on the data types you use.
>>
>> Do you have in jsonlines format? Then the amount of memory plays much
>> less a role.
>>
>> Otherwise if it is one large object or array I would not recommend it.
>>
>> > Am 18.06.2020 um 15:12 schrieb Chetan Khatri <
>> chetan.opensource@gmail.com>:
>> >
>> > 
>> > Hi Spark Users,
>> >
>> > I have a 50GB of JSON file, I would like to read and persist at HDFS so
>> it can be taken into next transformation. I am trying to read as
>> spark.read.json(path) but this is giving Out of memory error on driver.
>> Obviously, I can't afford having 50 GB on driver memory. In general, what
>> is the best practice to read large JSON file like 50 GB?
>> >
>> > Thanks
>>
>

Mime
View raw message