spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Load json format dataset as RDD
Date Mon, 17 Nov 2014 03:42:16 GMT
Spark SQL gives you an RDD of Row objects that you can query similarly to most JSON object
libraries. For example, you can use row(0) to access feature 0, then cast it to something
like a String, an Int, a Seq, or another Row if it's a nested object. You can also select
the fields you want using SQL syntax and work just with those if you have nested fields (e.g.
"select name, location.x, location.y from dataset"). Now you'll get Rows with just those fields.

Matei

> On Nov 16, 2014, at 7:34 PM, J <johnkonet@gmail.com> wrote:
> 
> Hi, 
> 
> I am new to spark. I met a problem when I intended to load one dataset.
> 
> I have a dataset where the data is in json format and I'd like to load it as a RDD.
> 
> As one record may span multiple lines, so SparkContext.textFile() is not doable. I also
tried to use json4s to parse the json manually and then merge them into RDD one by one, but
this solution is not convenient and low efficient.
> 
> It seems that there is JsonRDD in SparkSQL, but it seems that it is for query only.
> 
> Could any one provide me some suggestion about how to load json format data as RDD? For
example, given the file path, load the dataset as RDD[JObject].
> 
> Thank you very much!
> 
> Regards,
> J


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message