spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <ja...@japila.pl>
Subject Re: [SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!
Date Wed, 17 Aug 2016 06:15:44 GMT
Hi Michael,

Thanks a lot for your help. See below explains for csv and text. Do
you see anything worth investigating?

scala> spark.read.csv("people.csv").cache.explain(extended = true)
== Parsed Logical Plan ==
Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv

== Analyzed Logical Plan ==
_c0: string, _c1: string, _c2: string, _c3: string
Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv

== Optimized Logical Plan ==
InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 10000,
StorageLevel(disk, memory, deserialized, 1 replicas)
   +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false,
Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<_c0:string,_c1:string,_c2:string,_c3:string>

== Physical Plan ==
InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42]
   +- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 10000,
StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched:
false, Format: CSV, InputPaths:
file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [],
PushedFilters: [], ReadSchema:
struct<_c0:string,_c1:string,_c2:string,_c3:string>


scala> spark.read.text("people.csv").cache.explain(extended = true)
== Parsed Logical Plan ==
Relation[value#24] text

== Analyzed Logical Plan ==
value: string
Relation[value#24] text

== Optimized Logical Plan ==
InMemoryRelation [value#24], true, 10000, StorageLevel(disk, memory,
deserialized, 1 replicas)
   +- *FileScan text [value#24] Batched: false, Format:
org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c,
InputPaths: file:/Users/jacek/dev/oss/spark/people.csv,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<value:string>

== Physical Plan ==
InMemoryTableScan [value#24]
   +- InMemoryRelation [value#24], true, 10000, StorageLevel(disk,
memory, deserialized, 1 replicas)
         +- *FileScan text [value#24] Batched: false, Format:
org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c,
InputPaths: file:/Users/jacek/dev/oss/spark/people.csv,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<value:string>

The only thing I could find "interesting" is that TextFileFormat does
not print TEXT as CSV does. Anything special you see?

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Aug 16, 2016 at 7:24 PM, Michael Armbrust
<michael@databricks.com> wrote:
> try running explain on each of these.  my guess would be caching in broken
> in some cases.
>
> On Tue, Aug 16, 2016 at 6:05 PM, Jacek Laskowski <jacek@japila.pl> wrote:
>>
>> Hi,
>>
>> Can anyone explain why spark.read.csv("people.csv").cache.show ends up
>> with a WARN while spark.read.text("people.csv").cache.show does not?
>> It happens in 2.0 and today's build.
>>
>> scala> sc.version
>> res5: String = 2.1.0-SNAPSHOT
>>
>> scala> spark.read.csv("people.csv").cache.show
>> +---------+---------+-------+----+
>> |      _c0|      _c1|    _c2| _c3|
>> +---------+---------+-------+----+
>> |kolumna 1|kolumna 2|kolumn3|size|
>> |    Jacek| Warszawa| Polska|  40|
>> +---------+---------+-------+----+
>>
>> scala> spark.read.csv("people.csv").cache.show
>> 16/08/16 18:01:52 WARN CacheManager: Asked to cache already cached data.
>> +---------+---------+-------+----+
>> |      _c0|      _c1|    _c2| _c3|
>> +---------+---------+-------+----+
>> |kolumna 1|kolumna 2|kolumn3|size|
>> |    Jacek| Warszawa| Polska|  40|
>> +---------+---------+-------+----+
>>
>> scala> spark.read.text("people.csv").cache.show
>> +--------------------+
>> |               value|
>> +--------------------+
>> |kolumna 1,kolumna...|
>> |Jacek,Warszawa,Po...|
>> +--------------------+
>>
>> scala> spark.read.text("people.csv").cache.show
>> +--------------------+
>> |               value|
>> +--------------------+
>> |kolumna 1,kolumna...|
>> |Jacek,Warszawa,Po...|
>> +--------------------+
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message