spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saisai Shao <sai.sai.s...@gmail.com>
Subject Re: Apache Spark toDebugString producing different output for python and scala repl
Date Tue, 16 Aug 2016 03:16:57 GMT
The implementation inside the Python API and Scala API for RDD is slightly
different, so the difference of RDD lineage you printed is expected.

On Tue, Aug 16, 2016 at 10:58 AM, DEEPAK SHARMA <deepak_dehradun@outlook.com
> wrote:

> Hi All,
>
>
> Below is the small piece of code in scala and python REPL in Apache
> Spark.However I am getting different output in both the language when I
> execute toDebugString.I am using cloudera quick start VM.
>
> PYTHON
>
> rdd2 = sc.textFile('file:/home/training/training_materials/
> data/frostroad.txt').map(lambda x:x.upper()).filter(lambda x : 'THE' in x)
>
> print rdd2.toDebugString()(1) PythonRDD[56] at RDD at PythonRDD.scala:42 []
>  |  file:/home/training/training_materials/data/frostroad.txt MapPartitionsRDD[55] at
textFile at NativeMethodAccessorImpl.java:-2 []
>  |  file:/home/training/training_materials/data/frostroad.txt HadoopRDD[54] at textFile
at ......
>
> SCALA
>
>  val rdd2 = sc.textFile("file:/home/training/training_materials/data/frostroad.txt").map(x
=> x.toUpperCase()).filter(x => x.contains("THE"))
>
>
>
> rdd2.toDebugString
> res1: String = (1) MapPartitionsRDD[3] at filter at <console>:21 []
>  |  MapPartitionsRDD[2] at map at <console>:21 []
>  |  file:/home/training/training_materials/data/frostroad.txt MapPartitionsRDD[1] at
textFile at <console>:21 []
>  |  file:/home/training/training_materials/data/frostroad.txt HadoopRDD[0] at textFile
at <
>
>
> Also one of cloudera slides say that the default partitions  is 2 however
> its 1 (looking at output of toDebugString).
>
>
> Appreciate any help.
>
>
> Thanks
>
> Deepak Sharma
>

Mime
View raw message