spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance
Date Thu, 23 Feb 2017 00:58:44 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879629#comment-15879629
] 

Hyukjin Kwon commented on SPARK-14480:
--------------------------------------

This seems not blocked by any of those [~pes2009k]. I sent a PR for multiple line support
https://github.com/apache/spark/pull/16976

> Remove meaningless StringIteratorReader for CSV data source for better performance
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-14480
>                 URL: https://issues.apache.org/jira/browse/SPARK-14480
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>             Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made
like this for better performance. However, it looks there are two problems.
> Firstly, it was actually not faster than processing line by line with {{Iterator}} due
to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics to allow
every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about
parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 1000000.
> - End-to-end tests and parsing time tests are performed 10 times and averages are calculated
for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)])

> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>       .read
>       .format("csv")
>       .option("header", "false")
>       .option("delimiter", "|")
>       .load(path)
> time(df.take(1000000))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message