spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Kumar Prusty <Rohit_Pru...@infosys.com>
Subject RE: After calling persist, why the size in sparkui is not matching with the actual file size
Date Tue, 30 Aug 2016 04:27:20 GMT
Thanks Denis, for your quick response.

My original file is not compressed. It is just a text log file.

Regards
Rohit Kumar Prusty
+91-9884070075

From: Denis Bolshakov [mailto:bolshakov.denis@gmail.com]
Sent: Monday, August 29, 2016 9:03 PM
To: Rohit Kumar Prusty <Rohit_Prusty@infosys.com>
Cc: user@spark.apache.org
Subject: Re: After calling persist, why the size in sparkui is not matching with the actual
file size

Hello,

Spark uses snappy by default, is your original file compressed?
Also it keeps data in own representation format (column base), and it's not the same as text.

Best regards,
Denis

On 29 August 2016 at 16:52, Rohit Kumar Prusty <Rohit_Prusty@infosys.com<mailto:Rohit_Prusty@infosys.com>>
wrote:
Hi Team,
I am new to spark and have this basic question. After calling persist, why the size in sparkui
is not matching with the actual file size?

Actaul File Size for “/user/rohit_prusty/application2.log” – 39 KB

Code snippet:
===========
logData = sc.textFile("/user/rohit_prusty/application2.log")
logData.persist()
logData.count()
errors = logData.filter(lambda line: "ERROR" in line)
errors.persist()
errors.count()

Output in SparkUI
==============
logData RDD takes 2.1 KB
errors RDD takes 1.3 KB

Regards
Rohit Kumar Prusty
+91-9884070075<tel:%2B91-9884070075>




--
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.denis@gmail.com<mailto:bolshakov.denis@gmail.com>
Mime
View raw message