spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shivaram Venkataraman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext
Date Tue, 02 Dec 2014 02:44:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230878#comment-14230878
] 

Shivaram Venkataraman commented on SPARK-3963:
----------------------------------------------

[~pwendell] This looks pretty useful -- Was this postponed from 1.2 ? I have a use case that
needs Hadoop file names and was wondering if there was a workaround before this is implemented.

> Support getting task-scoped properties from TaskContext
> -------------------------------------------------------
>
>                 Key: SPARK-3963
>                 URL: https://issues.apache.org/jira/browse/SPARK-3963
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Patrick Wendell
>
> This is a proposal for a minor feature. Given stabilization of the TaskContext API, it
would be nice to have a mechanism for Spark jobs to access properties that are defined based
on task-level scope by Spark RDD's. I'd like to propose adding a simple properties hash map
with some standard spark properties that users can access. Later it would be nice to support
users setting these properties, but for now to keep it simple in 1.2. I'd prefer users not
be able to set them.
> The main use case is providing the file name from Hadoop RDD's, a very common request.
But I'd imagine us using this for other things later on. We could also use this to expose
some of the taskMetrics, such as e.g. the input bytes.
> {code}
> val data = sc.textFile("s3n//..2014/*/*/*.json")
> data.mapPartitions { 
>   val tc = TaskContext.get
>   val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME)
>   val parts = fileName.split("/")
>   val (year, month, day) = (parts[3], parts[4], parts[5])
>   ...
> }
> {code}
> Internally we'd have a method called setProperty, but this wouldn't be exposed initially.
This is structured as a simple (String, String) hash map for ease of porting to python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message