gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Spark Backend Support for Gora (GORA-386) Midterm Report
Date Wed, 01 Jul 2015 07:35:36 GMT
This is fantastic.
Needless to say the project will be progressing through mid term.
Your blogging is very positive for dissemination of your work.
Also like to extend a personal thank you to Talat. Excellent job and on
behalf of the community here an exc potent effort to drive this GSOC
project so far only half way through :).
Looking forward to committing the initial patches into master branch and
also your LogManagerSpark which will lower the barrier to adopting the

On Wednesday, July 1, 2015, Furkan KAMACI <furkankamaci@gmail.com> wrote:

> Hi,
> First of all, I would like to thank all. As you know that I've been
> accepted to GSoC 2015 with my proposal for developing a Spark Backend
> Support for Gora (GORA-386) and it is the time for midterm evaluations. I
> want to share my current progress of project and my midterm proposal as
> well.
> During my GSoC period, I've blogged at my personal website (
> http://furkankamaci.com/) and created a fork from Apache Gora's master
> branch and worked on it: https://github.com/kamaci/gora
> At community bonding period, I've read Apache Gora documentation and
> Apache Gora source code to be more familiar
> with project. I've analyzed related projects including Apache Flink and
> Apache Crunch to implement a Spark backend into Apache Gora. I've picked up
> an issue from Jira (https://issues.apache.org/jira/browse/GORA-262) and
> fixed.
> At coding period, due to implementing this project needs an infrastructure
> about Apache Spark, I've started with analyzing Spark's first papers. I've
> analyzed “Spark: Cluster Computing with Working” (
> http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf) and
> “Resilient
> Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster
> Computing”
> (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I've
> published two posts about Spark and Cluster Computing
> (http://furkankamaci.com/spark-and-cluster-computing/) and Resilient
> Distributed Datasets (
> http://furkankamaci.com/resilient-distributed-datasets-rdds/) at my
> personal blog. I've followed Apache Spark documentation and developed
> examples to analyze RDDs.
> I've analyzed Apache Gora's GoraInputFormat class and Spark's newHadoopRDD
> method. I've implemented an example application to read data from Hbase.
> Apache Gora supports reading/writing data from/to Hadoop files. Spark has
> a method for generating an RDD compatible with Hadoop files. So, an
> architecture is designed which creates a bridge between GoraInputFormat and
> RDD due to both of them support Hadoop files.
> I've created a base class for Apache Gora and Spark integration named as:
> GoraSparkEngine. It has initialize methods that takes Spark context, data
> store, optional Hadoop configuration and returns an RDD.
> After implementing a base for GoraSpark engine, I've developed a new
> example aligned to LogAnalytics named as:
> LogAnalyticsSpark. I've developed map and reduce parts (except for writing
> results into database) which does the same thing as
> LogAnalytics and also something more i.e. printing number of lines in
> tables.
> When we get an RDD from GoraSpark engine, we can do the operations over it
> as like making operations on any other RDDs which is not created over
> Apache Gora. Whole code can be checked from code base:
> https://github.com/kamaci/gora
> Project progress is ahead from the proposed timeline up to now.
> GoraInputFormat and RDD transformation is done and it is shown that map,
> reduce and other methods can properly work on that kind of RDDs.
> Before the next steps, I am planning to design an overall architecture
> according to feedbacks from community (there are some
> prerequisites when designing an architecture: i.e. configuration of a
> context at Spark cannot be changed after context has been initialized).
> When necessary functionalities are implemented examples, tests and
> documentations will be done. After that if I have extra time, I'm planning
> to make a performance benchmark of Apache Gora with Hadoop MapReduce,
> Hadoop MapReduce, Apache Spark and Apache Gora with Spark as well.
> Special thanks to Lewis and Talat. I should also mention that it is a real
> chance to be able to talk with your mentor face to face. We met with Talat
> many times and he helped me a lot about how Hadoop and Apache Gora works.
> PS: I've attached my midterm report and my previous reports can be found
> here:
> https://cwiki.apache.org/confluence/display/GORA/Spark+Backend+Support+for+Gora+%28GORA-386%29+Reports
> Kind Regards,
> Furkan KAMACI


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message