whirr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Savu <savu.and...@gmail.com>
Subject Re: An experiment with tdbloader3 (i.e. how to create TDB indexes with MapReduce)
Date Thu, 03 Nov 2011 13:01:02 GMT
Great work! Thanks for sharing with us.

I hope others will find your work useful. I'm definitely impressed.


-- Andrei Savu

On Thu, Nov 3, 2011 at 2:04 PM, Paolo Castagna <
castagna.lists@googlemail.com> wrote:

> Hi,
> I wanted to share with this (use case) with you, I wrote "users" instead
> of "user" so I am sending this to the right address now.
> Paolo
> -------- Original Message --------
> Subject: An experiment with tdbloader3 (i.e. how to create TDB indexes
> with MapReduce)
> Date: Thu, 03 Nov 2011 11:38:20 +0000
> From: Paolo Castagna <castagna.lists@googlemail.com**>
> To: jena-users@incubator.apache.**org <jena-users@incubator.apache.org>
> CC: users@whirr.apache.org
> Hi,
> I want to share the results of an experiment I did using tdbloader3
> with an Hadoop cluster (provisioned via Apache Whirr) running on EC2.
> tdbloader3 generates TDB indexes with a sequence of four MapReduce jobs.
> tdbloader3 is ASL but it is not an official module for Apache Jena.
> At the moment it's a prototype/experiment and this is the reason why
> its sources are on GitHub: https://github.com/castagna/**tdbloader3<https://github.com/castagna/tdbloader3>
> *If* it turns out to be useful to people and *if* there will be demand,
> I have no problems in contributing this to Apache Jena as a separate
> module (and support it). But, I suspect there aren't many people with
> Hadoop clusters of ~100 nodes and RDF datasets of billion of triples
> or quads and the need to generate TDB indexes. :-)
> For now GitHub works well. More testing and experiments are necessary.
> The dataset I used in my experiment is an Open Library version in RDF:
> 962,071,613 triples, I found it somewhere in the office in our S3
> account. I am not sure if it is publicly available. Sorry.
> To run the experiment I used a 17 nodes Hadoop cluster on EC2.
> This is the Apache Whirr recipe I used to provision the cluster:
> ---------[ hadoop-ec2.properties ]----------
> whirr.cluster-name=hadoop
> whirr.instance-templates=1 hadoop-namenode+hadoop-**jobtracker,16
> hadoop-datanode+hadoop-**tasktracker
> whirr.instance-templates-max-**percent-failures=100
> hadoop-namenode+hadoop-**jobtracker,80
> hadoop-datanode+hadoop-**tasktracker
> whirr.max-startup-retries=1
> whirr.provider=aws-ec2
> whirr.identity=${env:AWS_**ACCESS_KEY_ID_LIVE}
> whirr.credential=${env:AWS_**SECRET_ACCESS_KEY_LIVE}
> whirr.hardware-id=c1.xlarge
> whirr.location-id=us-east-1
> whirr.image-id=us-east-1/ami-**1136fb78
> whirr.private-key-file=${sys:**user.home}/.ssh/whirr
> whirr.public-key-file=${whirr.**private-key-file}.pub
> whirr.hadoop.version=0.20.204.**0
> whirr.hadoop.tarball.url=http:**//archive.apache.org/dist/**
> hadoop/core/hadoop-${whirr.**hadoop.version}/hadoop-${**
> whirr.hadoop.version}.tar.gz<http://archive.apache.org/dist/hadoop/core/hadoop-$%7Bwhirr.hadoop.version%7D/hadoop-$%7Bwhirr.hadoop.version%7D.tar.gz>
> ---------
> The recipe specifies a 17 nodes cluster, it also specifies that a
> 20% failure rate during the startup of the cluster is acceptable.
> Only 15 datanodes/tasktrackers were successfully started and joined
> the Hadoop cluster. Moreover, during the processing, 2 datanodes/
> tasktrackers died. Hadoop coped with that with no problems.
> As you can see, I used c1.xlarge instances which have 7 GB of RAM,
> 20 "EC2 compute units" (8 virtual cores with 2.5 EC2 compute units
> each) and 1.6 GB of dist space. Price of c1.xlarge instances in
> US East (Virginia) is $0.68 per hour.
> These are the elapsed times for the jobs I run on the cluster:
>            distcp: 1hrs, 19mins (1)
>  tdbloader3-stats: 1hrs, 22mins (2)
>  tdbloader3-first: 1hrs, 48mins (3)
>  tdbloader3-second: 2hrs, 40mins
>  tdbloader3-third: 49mins
>  tdbloader3-fourth: 1hrs, 36mins
>          download: -            (4)
> (1) In my case, distcp copied 112 GB (uncompressed) from S3 onto HDFS.
> (2) tdbloader3-stats is optional and it is not strictly required for
> the TDB indexes, but it is useful to produce the stats.opt file.
> It can probably be merged into one of tdbloader3-first|second jobs.
> (3) considering the tdbloader3-{first-fourth} jobs only, they took ~7
> hours, therefore an overall speed of about 38,000 triples per second.
> Not impressive and a bit too expensive to be run on a cluster running
> on the cloud (it's a good business for Amazon though).
> (4) it depends on the bandwidth available. This step can be improved
> compressing the nodes.dat files and part of what download does can be
> done as a 10 map only job.
> For those lucky enough to already have a 40-100 nodes Hadoop cluster,
> tdbloader3 can be an option alongside tdbloader and tdbloader2.
> Certainly not a replacement of these two great tools + as much RAM as
> you can, for the rest of us. :-)
> Last but not least, if you are reading this and you are one of those
> lucky enough to have a (sometimes idle) Hadoop cluster of 40-100 nodes
> and you would like to help me out with the testing of tdbloader3 or run
> a few experiment with it, please, get in touch with me.
> Cheers,
> Paolo
> PS:
> Apache Whirr (http://whirr.apache.org/) is a great project to quickly
> have an Hadoop cluster up and running for testing or running experiments.
> Moreover, Whirr is not limited to Hadoop, it has support for: ZooKeeper,
> HBase, Cassandra, ElasticSearch, etc.)
> EC2 is great for short live, bursts and the easy/elastic provisioning.
> However, it is probably not the best choice for long-running clusters.
> If you use m1.small instances to run you Hadoop cluster you are likely
> to have problems with RAM.

View raw message