spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From neeraj <neeraj_gar...@infosys.com>
Subject Re: Help required on exercise Data Exploratin using Spark SQL
Date Fri, 17 Oct 2014 10:32:31 GMT
Hi,

When I run given Spark SQL commands in the exercise, it returns with
unexpected results. I'm explaining the results below for quick reference:
1. The output of query : wikiData.count() shows some count in the file.

2. after running following query: 
sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE
username <> '' GROUP BY username ORDER BY cnt DESC LIMIT
10").collect().foreach(println)

I get output like below. Couple of last lines of this output is shown here.
It doesn't show the actual results of query. I tried increasing the driver
memory as suggested in the exercise, however, id doesn't work. The output is
almost same.
14/10/17 15:29:39 INFO executor.Executor: Finished task 199.0 in stage 2.0
(TID 401). 2170 bytes result sent to driver
14/10/17 15:29:39 INFO executor.Executor: Finished task 198.0 in stage 2.0
(TID 400). 2170 bytes result sent to driver
14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 198.0 in
stage 2.0 (TID 400) in 13 ms on localhost (199/200)
14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 199.0 in
stage 2.0 (TID 401) in 10 ms on localhost (200/200)
14/10/17 15:29:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0,
whose tasks have all completed, from pool
14/10/17 15:29:39 INFO scheduler.DAGScheduler: Stage 2 (takeOrdered at
basicOperators.scala:171) finished in 1.296 s
14/10/17 15:29:39 INFO spark.SparkContext: Job finished: takeOrdered at
basicOperators.scala:171, took 3.150021634 s

3. I tried some other Spark SQL commands as below:
*sqlContext.sql("SELECT username FROM wikiData LIMIT
10").collect().foreach(println)*
*output is* : [[B@787cf559]
[[B@53cfe3db]
[[B@757869d9]
[[B@346d61cf]
[[B@793077ec]
[[B@5d11651c]
[[B@21054100]
[[B@5fee77ef]
[[B@21041d1d]
[[B@15136bda]


*sqlContext.sql("SELECT * FROM wikiData LIMIT
10").collect().foreach(println)*
*output is *:
[12140913,[B@1d74e696,1394582048,[B@65ce90f5,[B@5c8ef90a]
[12154508,[B@2e802eff,1393177457,[B@618d7f32,[B@1099dda7]
[12165267,[B@65a70774,1398418319,[B@38da84cf,[B@12454f32]
[12184073,[B@45264fd,1395243737,[B@3d642042,[B@7881ec8a]
[12194348,[B@19d095d5,1372914018,[B@4d1ce030,[B@22c296dd]
[12212394,[B@153e98ff,1389794332,[B@40ae983e,[B@68d2f9f]
[12224899,[B@1f317315,1396830262,[B@677a77b2,[B@19487c31]
[12240745,[B@65d181ee,1389890826,[B@1da9647b,[B@5c03d673]
[12258034,[B@7ff44736,1385050943,[B@7e6f6bda,[B@4511f60f]
[12279301,[B@1e317636,1382277991,[B@4147e2b6,[B@56753c35]

I'm sure the about output of the queries is not the correct content of
parquet file.. I'm not able to read the content of parquet file directly. 

How to validate the output of these queries with the actual content in the
parquet file.
What is the workaround for this issue. 
How to read the file through Spark SQL. 
Is there a need to change the queries? What changes can be made in the
queries to get the exact result.

Regards,
Neeraj



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569p16673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message