gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Spark Backend Support for Gora (GORA-386) Midterm Report
Date Wed, 01 Jul 2015 07:35:36 GMT
This is fantastic.
Needless to say the project will be progressing through mid term.
Your blogging is very positive for dissemination of your work.
Also like to extend a personal thank you to Talat. Excellent job and on
behalf of the community here an exc potent effort to drive this GSOC
project so far only half way through :).
Looking forward to committing the initial patches into master branch and
also your LogManagerSpark which will lower the barrier to adopting the
module.
Thanks
Lewis

On Wednesday, July 1, 2015, Furkan KAMACI <furkankamaci@gmail.com> wrote:

> Hi,
>
> First of all, I would like to thank all. As you know that I've been
> accepted to GSoC 2015 with my proposal for developing a Spark Backend
> Support for Gora (GORA-386) and it is the time for midterm evaluations. I
> want to share my current progress of project and my midterm proposal as
> well.
>
> During my GSoC period, I've blogged at my personal website (
> http://furkankamaci.com/) and created a fork from Apache Gora's master
> branch and worked on it: https://github.com/kamaci/gora
>
> At community bonding period, I've read Apache Gora documentation and
> Apache Gora source code to be more familiar
> with project. I've analyzed related projects including Apache Flink and
> Apache Crunch to implement a Spark backend into Apache Gora. I've picked up
> an issue from Jira (https://issues.apache.org/jira/browse/GORA-262) and
> fixed.
>
> At coding period, due to implementing this project needs an infrastructure
> about Apache Spark, I've started with analyzing Spark's first papers. I've
> analyzed “Spark: Cluster Computing with Working” (
> http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf) and
> “Resilient
> Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster
> Computing”
> (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I've
> published two posts about Spark and Cluster Computing
> (http://furkankamaci.com/spark-and-cluster-computing/) and Resilient
> Distributed Datasets (
> http://furkankamaci.com/resilient-distributed-datasets-rdds/) at my
> personal blog. I've followed Apache Spark documentation and developed
> examples to analyze RDDs.
>
> I've analyzed Apache Gora's GoraInputFormat class and Spark's newHadoopRDD
> method. I've implemented an example application to read data from Hbase.
>
> Apache Gora supports reading/writing data from/to Hadoop files. Spark has
> a method for generating an RDD compatible with Hadoop files. So, an
> architecture is designed which creates a bridge between GoraInputFormat and
> RDD due to both of them support Hadoop files.
>
> I've created a base class for Apache Gora and Spark integration named as:
> GoraSparkEngine. It has initialize methods that takes Spark context, data
> store, optional Hadoop configuration and returns an RDD.
>
> After implementing a base for GoraSpark engine, I've developed a new
> example aligned to LogAnalytics named as:
> LogAnalyticsSpark. I've developed map and reduce parts (except for writing
> results into database) which does the same thing as
> LogAnalytics and also something more i.e. printing number of lines in
> tables.
>
> When we get an RDD from GoraSpark engine, we can do the operations over it
> as like making operations on any other RDDs which is not created over
> Apache Gora. Whole code can be checked from code base:
> https://github.com/kamaci/gora
>
> Project progress is ahead from the proposed timeline up to now.
> GoraInputFormat and RDD transformation is done and it is shown that map,
> reduce and other methods can properly work on that kind of RDDs.
>
> Before the next steps, I am planning to design an overall architecture
> according to feedbacks from community (there are some
> prerequisites when designing an architecture: i.e. configuration of a
> context at Spark cannot be changed after context has been initialized).
>
> When necessary functionalities are implemented examples, tests and
> documentations will be done. After that if I have extra time, I'm planning
> to make a performance benchmark of Apache Gora with Hadoop MapReduce,
> Hadoop MapReduce, Apache Spark and Apache Gora with Spark as well.
>
> Special thanks to Lewis and Talat. I should also mention that it is a real
> chance to be able to talk with your mentor face to face. We met with Talat
> many times and he helped me a lot about how Hadoop and Apache Gora works.
>
> PS: I've attached my midterm report and my previous reports can be found
> here:
>
> https://cwiki.apache.org/confluence/display/GORA/Spark+Backend+Support+for+Gora+%28GORA-386%29+Reports
>
> Kind Regards,
> Furkan KAMACI
>


-- 
*Lewis*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message