spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Spark - “min key = null, max key = null” while reading ORC file
Date Mon, 20 Jun 2016 05:00:21 GMT
Hi,

To start when you store the data in ORC file can you verify that the data
is there?

For example register it as tempTable

processDF.register("tmp")
sql("select count(1) from tmp).show

Also what do you mean by index file in ORC?

HTH






Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 June 2016 at 05:01, Mohanraj Ragupathiraj <mohanaugust@gmail.com>
wrote:

> I am trying to join a Dataframe(say 100 records) with an ORC file with 500
> million records through Spark(can increase to 4-5 billion, 25 bytes each
> record).
>
> I used Spark hiveContext API.
>
> *ORC File Creation Code*
>
> //fsdtRdd is JavaRDD, fsdtSchema is StructType schema
> DataFrame fsdtDf = hiveContext.createDataFrame(fsdtRdd,fsdtSchema);
> fsdtDf.write().mode(SaveMode.Overwrite).orc("orcFileToRead");
>
> *ORC File Reading Code*
>
> HiveContext hiveContext = new HiveContext(sparkContext);
> DataFrame orcFileData= hiveContext.read().orc("orcFileToRead");
> // allRecords is dataframe
> DataFrame processDf = allRecords.join(orcFileData,allRecords.col("id").equalTo(orcFileData.col("id").as("ID")),"left_outer_join");
> processDf.show();
>
> When I read the ORC file, the get following in my Spark Logs:
>
> Input split: file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc:0+3163348*min
key = null, max key = null*
> Reading ORC rows from file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc
with {include: [true, true, true], offset: 0, length: 9223372036854775807}
> Finished task 55.0 in stage 2.0 (TID 59). 2455 bytes result sent to driver
> Starting task 56.0 in stage 2.0 (TID 60, localhost, partition 56,PROCESS_LOCAL, 2220
bytes)
> Finished task 55.0 in stage 2.0 (TID 59) in 5846 ms on localhost (56/84)
> Running task 56.0 in stage 2.0 (TID 60)
>
> Although the Spark job completes successfully, I think, its not able to
> utilize ORC index file capability and thus checks through entire block of
> ORC data before moving on.
>
> *Question*
>
> -- Is it a normal behaviour, or I have to set any configuration before
> saving the data in ORC format?
>
> -- If it is *NORMAL*, what is the best way to join so that we discrad
> non-matching records on the disk level(maybe only the index file for ORC
> data is loaded)?
>

Mime
View raw message