spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anastasios Zouzias <zouz...@gmail.com>
Subject Re: Dataframe from 1.5G json (non JSONL)
Date Tue, 05 Jun 2018 18:55:12 GMT
Are you sure that your JSON file has the right format?

spark.read.json(...) expects a file where *each line is a json object*.

My wild guess is that

val hdf=spark.read.json("/user/tmp/hugedatafile")
hdf.show(2) or hdf.take(1) gives OOM

tries to fetch all the data into the driver. Can you reformat your input
file and try again?

Best,
Anastasios



On Tue, Jun 5, 2018 at 8:39 PM, raksja <shanmugkraja@gmail.com> wrote:

> I have a json file which is a continuous array of objects of similar type
> [{},{}...] for about 1.5GB uncompressed and 33MB gzip compressed.
>
> This is uploaded hugedatafile to hdfs and this is not a JSONL file, its a
> whole regular json file.
>
>
> [{"id":"1","entityMetadata":{"lastChange":"2018-05-11
> 01:09:18.0","createdDateTime":"2018-05-11
> 01:09:18.0","modifiedDateTime":"2018-05-11
> 01:09:18.0"},"type":"11"},{"id":"2","entityMetadata":{"
> lastChange":"2018-05-11
> 01:09:18.0","createdDateTime":"2018-05-11
> 01:09:18.0","modifiedDateTime":"2018-05-11
> 01:09:18.0"},"type":"11"},{"id":"3","entityMetadata":{"
> lastChange":"2018-05-11
> 01:09:18.0","createdDateTime":"2018-05-11
> 01:09:18.0","modifiedDateTime":"2018-05-11
> 01:09:18.0"},"type":"11"}..................]
>
>
> I get OOM on executors whenever i try to load this into spark.
>
> Try 1
> val hdf=spark.read.json("/user/tmp/hugedatafile")
> hdf.show(2) or hdf.take(1) gives OOM
>
> Try 2
> Took a small sampledatafile and got schema to avoid schema infering
> val sampleSchema=spark.read.json("/user/tmp/sampledatafile").schema
> val hdf=spark.read.schema(sampleSchema).json("/user/tmp/hugedatafile")
> hdf.show(2) or hdf.take(1) stuck for 1.5 hrs and gives OOM
>
> Try 3
> Repartition it after before performing action
> gives OOM
>
> Try 4
> Read about the https://issues.apache.org/jira/browse/SPARK-20980
> completely
> val hdf = spark.read.option("multiLine",
> true)..schema(sampleSchema).json("/user/tmp/hugedatafile")
> hdf.show(1) or hdf.take(1) gives OOM
>
>
> Can any one help me here?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
-- Anastasios Zouzias
<azo@zurich.ibm.com>

Mime
View raw message