giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paolo Castagna (Commented) (JIRA)" <>
Subject [jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph
Date Sun, 08 Apr 2012 19:54:18 GMT


Paolo Castagna commented on GIRAPH-170:

Pig and Pig Latin can certainly be used to create adjacency lists from RDF in N-Triples|N-Quads
I tend to use more plain MapReduce jobs written in Java, but I found a very old (i.e. it was
using Pig version 0.6) example on how one might write an [NQuadsStorage|]
which implements LoadFunc and StoreFunc for Pig. I shared it, even if it does not even compile
now, just to show how trivial that is.

It is my intention, in the next few weeks, to create a small library to support people wanting
to use Pig, HBase, MapReduce and Giraph to process RDF data.
For Pig the first (and only?) thing to do is to implement LoadFunc and StoreFunc for RDF data.
It seems possible (although not easy) to map the SPARQL algebra to Pig Latin physical operators
(and SPARQL property paths to Giraph jobs? ;-)), that would provide a good and scalable batch
processing solution for those into SPARQL. 
For HBase, the first step is to store RDF data, even a plain [(G)|S|P|O] solution would do
For MapReduce, blank nodes can be painful, I have some tricks to share here. Input/output
formats and record readers/writers, etc.

In relation to Giraph, to bring the discussion on topic, until I am proven wrong, I am going
for the adjacency list approach as discussed above and do graph processing as other 'usual'
Giraph jobs.

The question: what are the RDF processing use cases which are a good fit for Giraph is still
open for me (and I'll find out soon).
> Workflow for loading RDF graph data into Giraph
> -----------------------------------------------
>                 Key: GIRAPH-170
>                 URL:
>             Project: Giraph
>          Issue Type: New Feature
>            Reporter: Dan Brickley
>            Priority: Minor
> W3C RDF provides a family of Web standards for exchanging graph-based data. RDF uses
sets of simple binary relationships, labeling nodes and links with Web identifiers (URIs).
Many public datasets are available as RDF, including the "Linked Data" cloud (see
). Many such datasets are listed at
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple line-oriented
format is N-Triples. A format aligned with RDF's SPARQL query language is Turtle. Apache Jena
and Any23 provide software to handle all these;
> This JIRA leaves open the strategy for loading RDF data into Giraph. There are various
possibilites, including exploitation of intermediate Hadoop-friendly stores, or pre-processing
with e.g. Pig-based tools into a more Giraph-friendly form, or writing custom loaders. Even
a HOWTO document or implementor notes here would be an advance on the current state of the
art. The BluePrints Graph API (Gremlin etc.) has also been aligned with various RDF datasources.
> Related topics: multigraphs touches
on the issue (since we can't currently easily represent fully general RDF graphs since two
nodes might be connected by more than one typed edge). Even without multigraphs it ought to
be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies + People subset
of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe VertexOutputFormat) would
certainly [despite GIRAPH-141] still help"

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message