spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <>
Subject Transform Functions and Python Modules
Date Mon, 08 Jun 2015 13:27:41 GMT
I am learning more about Spark (and in this case Spark Streaming) and am
getting that a functions like  takes a function call and does
something to each element of the rdd and that in turn returns a new rdd
based on the original.

That's cool for the simple map functions in the examples where a lambda is
used to to take x and do x * x but what happens in Python  (specifically)
with more complex functions? Especially those that use modules (that ARE
build in on all nodes).

For example, instead of a simple map, I want to take  line of data and
regex parse it into fields. It's still not a map (not a flat map) in that
it's a one to one return. (One record of the RDD, a line of text, would
return on parsed record in a Python dict)

in my Spark Streaming Job, I have import re in the "main" part of the file,
and this all seems to work, but I want to ensure I am not "by default"
forcing computations in the driver rather than distributed.

This is "working" as in it's returning the expected data, however I want to
ensure I am not doing something weird by having a transform function using
a module that's imported only at the driver.  (Should I be calling import
re IN the functioon?)

If there are any good docs on this, I'd love to understand it more.




def parseLine(line):

    restr = "^(\w\w\w  ?\d\d? \d\d:\d\d:\d\d) ([^ ]+) "

    logre = re.compile(restr)

    m =[1]) # Why does every record of he RDD have a NONE
value in the first position of the tuple?

    rec = {}

    if m:

        rec['field1'] =

        rec['field2] =

        return rec

fwlog_dstream = KafkaUtils.createStream(ssc, zkQuorum,
"sparkstreaming-fwlog_parsed", {kafka_src_topic: 1})

recs =

View raw message