whirr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paolo Castagna <castagna.li...@googlemail.com>
Subject Re: An experiment with tdbloader3 (i.e. how to create TDB indexes with MapReduce)
Date Thu, 03 Nov 2011 16:33:29 GMT
Hi Andrei

Andrei Savu wrote:
> Great work! Thanks for sharing with us. 

Thanks, I am happy to share and point more people at Apache Whirr
since it's a great project and it enables others (or students) to
learn more about all the services which Whirr makes trivial to
install on EC2.

Unfortunately, for serious usage (and serious performances) you need
slightly bigger clusters and that comes with a considerable cost on
EC2 or other cloud services.

I am still searching if someone has an Hadoop cluster I could use
to run a couple of bigger experiments. Not really "big data"...
~120 GB of uncompressed RDF files (in N-Quad format) a few hours
each run.

I'd like to see how many nodes you need to use to achieve a x10
improvement on what we have right now. :-)

> I hope others will find your work useful. I'm definitely impressed.

Next is inference. ;-)

For that MapReduce can work, but maybe Pregel would be a better fit.
I need to give Giraph [1] a proper look and what are its requirements,
probably just a vanilla Hadoop cluster (but I am not sure right now).

Also, I have a few experiment to see how best one could store RDF
into HBase, an incomplete prototype is here [2]. Vaibhav Khadilkar
did most of the work there. But, more experiments are necessary.

If you data access patterns are more like: give everything you have
in your RDF data about <this>, then something like Cassandra or other
key-value stores, might be better.

The good thing for me is that Apache Whirr has support for all of these
things, so I can focus on the algorithms or my specific implementation
and run the test on a small-medium size cluster (without having a real
one 24/7).

I'd love to be more helpful with Whirr (but time is limited at the
moment and I did not have time to dig into the sources yet!)... and
it's not only Whirr it's also jcloud. :-/

Cheers,
Paolo

  [1] http://incubator.apache.org/giraph/
  [2] https://github.com/castagna/hbase-rdf

> 
> Cheers,
> 
> -- Andrei Savu
> 
> On Thu, Nov 3, 2011 at 2:04 PM, Paolo Castagna 
> <castagna.lists@googlemail.com <mailto:castagna.lists@googlemail.com>> 
> wrote:
> 
>     Hi,
>     I wanted to share with this (use case) with you, I wrote "users" instead
>     of "user" so I am sending this to the right address now.
> 
>     Paolo
> 
>     -------- Original Message --------
>     Subject: An experiment with tdbloader3 (i.e. how to create TDB
>     indexes with MapReduce)
>     Date: Thu, 03 Nov 2011 11:38:20 +0000
>     From: Paolo Castagna <castagna.lists@googlemail.com
>     <mailto:castagna.lists@googlemail.com>__>
>     To: jena-users@incubator.apache.__org
>     <mailto:jena-users@incubator.apache.org>
>     CC: users@whirr.apache.org <mailto:users@whirr.apache.org>
> 
>     Hi,
>     I want to share the results of an experiment I did using tdbloader3
>     with an Hadoop cluster (provisioned via Apache Whirr) running on EC2.
> 
>     tdbloader3 generates TDB indexes with a sequence of four MapReduce jobs.
>     tdbloader3 is ASL but it is not an official module for Apache Jena.
> 
>     At the moment it's a prototype/experiment and this is the reason why
>     its sources are on GitHub: https://github.com/castagna/__tdbloader3
>     <https://github.com/castagna/tdbloader3>
> 
>     *If* it turns out to be useful to people and *if* there will be demand,
>     I have no problems in contributing this to Apache Jena as a separate
>     module (and support it). But, I suspect there aren't many people with
>     Hadoop clusters of ~100 nodes and RDF datasets of billion of triples
>     or quads and the need to generate TDB indexes. :-)
> 
>     For now GitHub works well. More testing and experiments are necessary.
> 
>     The dataset I used in my experiment is an Open Library version in RDF:
>     962,071,613 triples, I found it somewhere in the office in our S3
>     account. I am not sure if it is publicly available. Sorry.
> 
>     To run the experiment I used a 17 nodes Hadoop cluster on EC2.
> 
>     This is the Apache Whirr recipe I used to provision the cluster:
> 
>     ---------[ hadoop-ec2.properties ]----------
>     whirr.cluster-name=hadoop
>     whirr.instance-templates=1 hadoop-namenode+hadoop-__jobtracker,16
>     hadoop-datanode+hadoop-__tasktracker
>     whirr.instance-templates-max-__percent-failures=100
>     hadoop-namenode+hadoop-__jobtracker,80
>     hadoop-datanode+hadoop-__tasktracker
>     whirr.max-startup-retries=1
>     whirr.provider=aws-ec2
>     whirr.identity=${env:AWS___ACCESS_KEY_ID_LIVE}
>     whirr.credential=${env:AWS___SECRET_ACCESS_KEY_LIVE}
>     whirr.hardware-id=c1.xlarge
>     whirr.location-id=us-east-1
>     whirr.image-id=us-east-1/ami-__1136fb78
>     whirr.private-key-file=${sys:__user.home}/.ssh/whirr
>     whirr.public-key-file=${whirr.__private-key-file}.pub
>     whirr.hadoop.version=0.20.204.__0
>     whirr.hadoop.tarball.url=http:__//archive.apache.org/dist/__hadoop/core/hadoop-${whirr.__hadoop.version}/hadoop-${__whirr.hadoop.version}.tar.gz
>     <http://archive.apache.org/dist/hadoop/core/hadoop-$%7Bwhirr.hadoop.version%7D/hadoop-$%7Bwhirr.hadoop.version%7D.tar.gz>
>     ---------
> 
>     The recipe specifies a 17 nodes cluster, it also specifies that a
>     20% failure rate during the startup of the cluster is acceptable.
>     Only 15 datanodes/tasktrackers were successfully started and joined
>     the Hadoop cluster. Moreover, during the processing, 2 datanodes/
>     tasktrackers died. Hadoop coped with that with no problems.
> 
>     As you can see, I used c1.xlarge instances which have 7 GB of RAM,
>     20 "EC2 compute units" (8 virtual cores with 2.5 EC2 compute units
>     each) and 1.6 GB of dist space. Price of c1.xlarge instances in
>     US East (Virginia) is $0.68 per hour.
> 
>     These are the elapsed times for the jobs I run on the cluster:
> 
>                distcp: 1hrs, 19mins (1)
>      tdbloader3-stats: 1hrs, 22mins (2)
>      tdbloader3-first: 1hrs, 48mins (3)
>      tdbloader3-second: 2hrs, 40mins
>      tdbloader3-third: 49mins
>      tdbloader3-fourth: 1hrs, 36mins
>              download: -            (4)
> 
>     (1) In my case, distcp copied 112 GB (uncompressed) from S3 onto HDFS.
> 
>     (2) tdbloader3-stats is optional and it is not strictly required for
>     the TDB indexes, but it is useful to produce the stats.opt file.
>     It can probably be merged into one of tdbloader3-first|second jobs.
> 
>     (3) considering the tdbloader3-{first-fourth} jobs only, they took ~7
>     hours, therefore an overall speed of about 38,000 triples per second.
>     Not impressive and a bit too expensive to be run on a cluster running
>     on the cloud (it's a good business for Amazon though).
> 
>     (4) it depends on the bandwidth available. This step can be improved
>     compressing the nodes.dat files and part of what download does can be
>     done as a 10 map only job.
> 
>     For those lucky enough to already have a 40-100 nodes Hadoop cluster,
>     tdbloader3 can be an option alongside tdbloader and tdbloader2.
>     Certainly not a replacement of these two great tools + as much RAM as
>     you can, for the rest of us. :-)
> 
>     Last but not least, if you are reading this and you are one of those
>     lucky enough to have a (sometimes idle) Hadoop cluster of 40-100 nodes
>     and you would like to help me out with the testing of tdbloader3 or run
>     a few experiment with it, please, get in touch with me.
> 
>     Cheers,
>     Paolo
> 
> 
>     PS:
>     Apache Whirr (http://whirr.apache.org/) is a great project to quickly
>     have an Hadoop cluster up and running for testing or running
>     experiments.
>     Moreover, Whirr is not limited to Hadoop, it has support for: ZooKeeper,
>     HBase, Cassandra, ElasticSearch, etc.)
> 
>     EC2 is great for short live, bursts and the easy/elastic provisioning.
>     However, it is probably not the best choice for long-running clusters.
>     If you use m1.small instances to run you Hadoop cluster you are likely
>     to have problems with RAM.
> 
> 


Mime
View raw message