spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohanraj Ragupathiraj <>
Subject Spark - “min key = null, max key = null” while reading ORC file
Date Mon, 20 Jun 2016 04:01:42 GMT
I am trying to join a Dataframe(say 100 records) with an ORC file with 500
million records through Spark(can increase to 4-5 billion, 25 bytes each

I used Spark hiveContext API.

*ORC File Creation Code*

//fsdtRdd is JavaRDD, fsdtSchema is StructType schema
DataFrame fsdtDf = hiveContext.createDataFrame(fsdtRdd,fsdtSchema);

*ORC File Reading Code*

HiveContext hiveContext = new HiveContext(sparkContext);
DataFrame orcFileData="orcFileToRead");
// allRecords is dataframe
DataFrame processDf =

When I read the ORC file, the get following in my Spark Logs:

Input split: file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc:0+3163348*min
key = null, max key = null*
Reading ORC rows from
with {include: [true, true, true], offset: 0, length:
Finished task 55.0 in stage 2.0 (TID 59). 2455 bytes result sent to driver
Starting task 56.0 in stage 2.0 (TID 60, localhost, partition
56,PROCESS_LOCAL, 2220 bytes)
Finished task 55.0 in stage 2.0 (TID 59) in 5846 ms on localhost (56/84)
Running task 56.0 in stage 2.0 (TID 60)

Although the Spark job completes successfully, I think, its not able to
utilize ORC index file capability and thus checks through entire block of
ORC data before moving on.


-- Is it a normal behaviour, or I have to set any configuration before
saving the data in ORC format?

-- If it is *NORMAL*, what is the best way to join so that we discrad
non-matching records on the disk level(maybe only the index file for ORC
data is loaded)?

View raw message