spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang <teddyyyy...@gmail.com>
Subject Re: how to start reading the spark source code?
Date Mon, 20 Jul 2015 07:38:57 GMT
ok.... got some headstart:

pull the git source to 14719b93ff4ea7c3234a9389621be3c97fa278b9 (first
release so that I could at least build it)

then build it according to README.md,
then get eclipse setup , with scala-ide
then create new scala project, set the project directory to be
SCALA_SOURCE_HOME/core  instead of the default

in eclipse remove the test from source,

copy all the jars from SCALA_SOURCE_HOME/lib_managed into a separate dir,
then in eclipse add all these as external jars.

set ur scala project run time to be 2.10.5 (the one coming with spark seems
to be 2.10.4 , eclipse default is 2.9 something)
there would be 2 compile errors , one due to Tuple() , change it to Tuple2,
another one is "currentThread", change it to Thread.currentThread()

then it will build fine

I pasted the hello-world from docs , since the "getting started "doc is for
latest version, I had to make some minor changes:



package spark



import spark.SparkContext
import spark.SparkContext._

object Tryout {
  def main(args: Array[String]) {
    val logFile = "../README.md" // Should be some file on your system
    val sc = new SparkContext("local", "tryout", ".",
List(System.getenv("SPARK_EXAMPLES_JAR")))
    val logData = sc.textFile(logFile, 2).cache()

//    val logData = scala.io.Source.fromFile(args(0)).getLines().toArray

    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}




then I debug through this and it became fairly clear

On Sun, Jul 19, 2015 at 10:13 PM, Yang <teddyyyy123@gmail.com> wrote:

> thanks, my point is that earlier versions are normally much simpler so
> it's easier to follow. and the basic structure should at least bare great
> similarity with latest version
>
> On Sun, Jul 19, 2015 at 9:27 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> e5c4cd8a5e188592f8786a265 was from 2011.
>>
>> Not sure why you started with such an early commit.
>>
>> Spark project has evolved quite fast.
>>
>> I suggest you clone Spark project from github.com/apache/spark/ and
>> start with core/src/main/scala/org/apache/spark/rdd/RDD.scala
>>
>> Cheers
>>
>> On Sun, Jul 19, 2015 at 7:44 PM, Yang <teddyyyy123@gmail.com> wrote:
>>
>>> I'm trying to understand how spark works under the hood, so I tried to
>>> read the source code.
>>>
>>> as I normally do, I downloaded the git source code, reverted to the very
>>> first version ( actually e5c4cd8a5e188592f8786a265c0cd073c69ac886 since the
>>> first version even lacked the definition of RDD.scala)
>>>
>>> but the code looks "too simple" and I can't find where the "magic"
>>> happens, i.e. a transformation /computation is scheduled on  a machine,
>>> bytes stored etc.
>>>
>>> it would be great if someone could show me a path in which the different
>>> source files are involved, so that I could read each of them in turn.
>>>
>>> thanks!
>>> yang
>>>
>>
>>
>

Mime
View raw message