spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: pyspark read json file with high dimensional sparse data
Date Wed, 30 Mar 2016 20:08:09 GMT
You can force the data to be loaded as a sparse map assuming the key/value
types are consistent.  Here is an example
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1863598192220754/2840265927289860/latest.html>
.

On Wed, Mar 30, 2016 at 8:17 AM, Yavuz Nuzumlalı <manuyavuz@gmail.com>
wrote:

> Hi all,
>
> I'm trying to read a data inside a json file using
> `SQLContext.read.json()` method.
>
> However, reading operation does not finish. My data is of 290000x3100
> dimensions, but it's actually really sparse, so if there is a way to
> directly read json into a sparse dataframe, it would work perfect for me.
>
> What are the alternatives for reading such data into spark?
>
> P.S. : When I try to load first 50000 rows, read operation is completed in
> ~2 minutes.
>

Mime
View raw message