spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <felixcheun...@hotmail.com>
Subject Re: GraphX build from JSON input
Date Mon, 15 Aug 2016 21:36:39 GMT
You would need to unpack it from Row.

Check out GraphFrame - with it you can operate directly on DataFrame

https://github.com/graphframes/graphframes




On Mon, Aug 15, 2016 at 1:01 PM -0700, "Gerard Casey" <gerardhughcasey@gmail.com<mailto:gerardhughcasey@gmail.com>>
wrote:

Dear all,

I am looking for some guidance. I am trying to build a graph from a vertices json.gz file
and a edges.json.gz file.

A sample vertices record:  {"toid": "osgb4000000031043205", "index": 1, "point": [508180.748,
195333.973]}
A sample edges record:  {"index": 1, "term": "Private Road - Restricted Access", "nature":
"Single Carriageway", "negativeNode": "osgb4000000023183407", "toid": "osgb4000000023296573",
"length": 112.8275895775762, "polyline": [492019.481, 156567.076, 492028, 156567, 492041.667,
156570.536, 492063.65, 156578.067, 492126.5, 156602], "positiveNode": "osgb4000000023183409"}

For vertices data, 'toid' will be the VertexId. For the edges data, the 'positiveNode' will
be the scrid and 'negativeNode' will be the dstid. The other fields are respective attributes.

I can read each file using the sqlContext.read.json method

scala> val vertices = sqlContext.read.json("/Users/gerardcasey/spark-abm/vertices.json.gz")

vertices: org.apache.spark.sql.DataFrame = [index: bigint, point: array<double>, toid:
string]

Using vertices.show() I can inspect the data. The told record is a unique identifier that
I wish to use as the VertexId.

[cid:06F32C89-FD85-4AB7-95A6-D1DF7D0FBC47@home]

Again, I can read the edges file using the same method.

scala> val edges = sqlContext.read.json("/Users/gerardcasey/spark-abm/edges.json.gz")

edges: org.apache.spark.sql.DataFrame = [index: bigint, length: double, nature: string, negativeNode:
string, polyline: array<double>, positiveNode: string, term: string, toid: string]

And using edges.show() I can inspect the data.

[cid:D69694DB-657A-47B9-B005-A1CF410B7072@home]

I now wish to build a graph using GraphX. I know that the JSON files I have read in are now
in DataFrame format but GraphX required RDD's.

I thus created a vertices_rdd file and a edges_rdd file using the .rdd method:

scala> val vertices_rdd = vertices.rdd

vertices_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[24] at
rdd at <console>:37

scala> val edges_rdd =  edges.rdd
edges_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[29] at rdd
at <console>:37

However, these are of 'org.apache.spark.sql.Row' type and the graph.apply method does not
allow.

How can I convert from these DataFrames to VertexRDD and EdgeRDD classes? Or, is there a simpler
way to input JSON for building a graph?

Many thanks for your reading if you have got this far!

Best wishes,

G


Mime
View raw message