spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: setting inputMetrics in HadoopRDD#compute()
Date Sat, 26 Jul 2014 17:47:30 GMT
There is one piece of information that'd be useful to know, which is the
source of the input. Even in the presence of an IOException, the input
metrics still specifies the task is reading from Hadoop.

However, I'm slightly confused by this -- I think usually we'd want to
report the number of bytes read, rather than the total input size. For
example, if there is a limit (only read the first 5 records), the actual
number of bytes read is much smaller than the total split size.

Kay, am I mis-interpreting this?



On Sat, Jul 26, 2014 at 7:42 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> Hi,
> Starting at line 203:
>       try {
>         /* bytesRead may not exactly equal the bytes read by a task: split
> boundaries aren't
>          * always at record boundaries, so tasks may need to read into
> other splits to complete
>          * a record. */
>         inputMetrics.bytesRead = split.inputSplit.value.getLength()
>       } catch {
>         case e: java.io.IOException =>
>           logWarning("Unable to get input size to set InputMetrics for
> task", e)
>       }
>       context.taskMetrics.inputMetrics = Some(inputMetrics)
>
> If there is IOException, context.taskMetrics.inputMetrics is set by
> wrapping inputMetrics - as if there wasn't any error.
>
> I wonder if the above code should distinguish the error condition.
>
> Cheers
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message