spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diana Carroll <dcarr...@cloudera.com>
Subject Re: logging in pyspark
Date Wed, 07 May 2014 13:32:01 GMT
foreach vs. map isn't the issue.  Both require serializing the called
function, so the pickle error would still apply, yes?

And at the moment, I'm just testing.  Definitely wouldn't want to log
something for each element, but may want to detect something and log for
SOME elements.

So my question is: how are other people doing logging from distributed
tasks, given the serialization issues?

The same issue actually exists in Scala, too.  I could work around it by
creating a small serializable object that provides a logger, but it seems
kind of kludgy to me, so I'm wondering if other people are logging from
tasks, and if so, how?

Diana


On Tue, May 6, 2014 at 6:24 PM, Nicholas Chammas <nicholas.chammas@gmail.com
> wrote:

> I think you're looking for RDD.foreach()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach>
> .
>
> According to the programming guide<http://spark.apache.org/docs/latest/scala-programming-guide.html>
> :
>
> Run a function func on each element of the dataset. This is usually done
>> for side effects such as updating an accumulator variable (see below) or
>> interacting with external storage systems.
>
>
> Do you really want to log something for each element of your RDD?
>
> Nick
>
>
> On Tue, May 6, 2014 at 3:31 PM, Diana Carroll <dcarroll@cloudera.com>wrote:
>
>> What should I do if I want to log something as part of a task?
>>
>> This is what I tried.  To set up a logger, I followed the advice here:
>> http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off
>>
>> logger = logging.getLogger("py4j")
>> logger.setLevel(logging.INFO)
>> logger.addHandler(logging.StreamHandler())
>>
>> This works fine when I call it from my driver (ie pyspark):
>> logger.info("this works fine")
>>
>> But I want to try logging within a distributed task so I did this:
>>
>> def logTestMap(a):
>>      logger.info("test")
>>     return a
>>
>> myrdd.map(logTestMap).count()
>>
>> and got:
>> PicklingError: Can't pickle 'lock' object
>>
>> So it's trying to serialize my function and can't because of a lock
>> object used in logger, presumably for thread-safeness.  But then...how
>> would I do it?  Or is this just a really bad idea?
>>
>> Thanks
>> Diana
>>
>
>

Mime
View raw message