spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: Hive From Spark: Jdbc VS sparkContext
Date Sun, 15 Oct 2017 14:07:21 GMT
Hi Gourav

> what if the table has partitions and sub-partitions? 

well this also work with multiple orc files having same schema:
val people = sqlContext.read.format("orc").load("hdfs://cluster/people*")
Am I missing something?

> And you do not want to access the entire data?

This works for static datasets, or when new data is comming by batch
processes, the spark application should be reloaded to get the new files
in the folder


>> On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris <niparisco@gmail.com> wrote:
> 
>     Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
>     > I wonder the differences accessing HIVE tables in two different ways:
>     > - with jdbc access
>     > - with sparkContext
> 
>     Well there is also a third way to access the hive data from spark:
>     - with direct file access (here ORC format)
> 
> 
>     For example:
> 
>     val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>     sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
>     val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_
>     people")
>     people.createOrReplaceTempView("people")
>     sqlContext.sql("SELECT count(1) FROM people WHERE ...").show()
> 
> 
>     This method looks much faster than both:
>     - with jdbc access
>     - with sparkContext
> 
>     Any experience on that ?
> 
> 
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message