spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuanjian Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources
Date Tue, 25 Dec 2018 08:37:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728620#comment-16728620
] 

Yuanjian Li commented on SPARK-26225:
-------------------------------------

We define decoding time here as the time which the system cost on converting data from the
storage format to 'InternalRow' of Spark. I list decoding source code here and divide them
into two parts.
1. Row-based data sources
All decoding work happened in 'buildReader' function of row-based data sources, which override
from FileFormat.buildReader.
||Data Source||Decode Logic||Code Link||
|Json-TextInputJsonDataSource|FailureSafeParser.parse|[jsonDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L231-L232]|
|Json-MultiLineJsonDataSource|FailureSafeParser.parse|[jsonDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L145]|
|CSV-TextInputCSVDataSource|UnivocityParser.parseIterator|[CSVDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L105]|
|CSV-MultiLineCSVDataSource|UnivocityParser.parseStream|[CSVDataSource.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L178-L182]|
|Avro|AvroDeserializer.deserialize|[AvroFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L238]|
|Text|UnsafeRowWriter.write|[TextFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L128-L134]|
|ORC-hive|OrcFileFormat.unwrapOrcStructs|[hive/orc/OrcFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala#L174-L179]|
|Image|RowEncoder.toRow|[ImageFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala#L95]|
|LibSVM|RowEncoder.toRow|[LibSVMRelation.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala#L175-L179]|

Instead of dealing with all scenario separately, we can handle them uniformly by timing FileFormat.buildreader
if we can accept the initializing work(like reader initialization, schema preparation, etc)
count in decoding time. That can be more code and logical clean as well as overhead minimize.

2. Column-based data sources

All decoding work triggered in buildReaderWithPartitionValures which override from FileFormat, it
should discuss separately by batch read mode enable or disable.
||Data Source||Batch Read||Decode Logic||Code Link||
|ORC-native|false|OrcDeserializer.deserialize|[datasources/orc/OrcFileFormat.scala\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala#L229-L234]|
|ORC-native|true|Full fill column vector in OrcColumnBatchReader.nextBatch|[OrcColumnarBatchReader.java\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java#L252-L259]|
|Parquet|false|InternalParquetRecordReader|This part of code not in Spark, the decoding work
is done in RecordMaterializer|
|Parquet|true|Full fill column vector in VectorizedColumnReader.readBatch|[VectorizedColumnReader.java\|https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L259-L262]|

Listing decoding logic of column-based data sources, if further work is needed later.

> Scan: track decoding time for row-based data sources
> ----------------------------------------------------
>
>                 Key: SPARK-26225
>                 URL: https://issues.apache.org/jira/browse/SPARK-26225
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Reynold Xin
>            Priority: Major
>
> Scan node should report decoding time for each record, if it is not too much overhead.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message