spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Re: Reading Hive RCFiles?
Date Mon, 29 Jan 2018 15:32:37 GMT
Just to follow up…

I was able to create an RDD from the file, however,  diving in to the RDD is a bit weird,
and I’m working thru it.  My test file seems to be one block … 3K rows. So when I tried
to get the first column of the first row, I ended up getting all of the rows for the first
column which were comma delimited.   The other issue is then converting numeric fields back
from their byte code.  I have the schema so I can do that.  (This is also an issue with RCFileCat
 (sorry if I messed that name up…) things work great if you’re using strings only. )

I guess this could be a start of a project (time permitting) to enhance the ability to read
older file formats as easy as it is to read Parquet and ORC files.

Will have to follow up in Dev.

Thanks everyone for the pointers.


On Jan 20, 2018, at 5:55 PM, Jörn Franke <jornfranke@gmail.com<mailto:jornfranke@gmail.com>>
wrote:

Forgot to add the mailinglist

On 18. Jan 2018, at 18:55, Jörn Franke <jornfranke@gmail.com<mailto:jornfranke@gmail.com>>
wrote:

Welll you can use:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopRDD-org.apache.hadoop.mapred.JobConf-java.lang.Class-java.lang.Class-java.lang.Class-int-

with the following inputformat:
https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/RCFileInputFormat.html

(note the version of the Javadoc does not matter it is already possible since a long time).

Writing is similarly with PairRDD and RCFileOutputFormat

On Thu, Jan 18, 2018 at 5:02 PM, Michael Segel <msegel_hadoop@hotmail.com<mailto:msegel_hadoop@hotmail.com>>
wrote:
No idea on how that last line of garbage got in the message.


> On Jan 18, 2018, at 9:32 AM, Michael Segel <msegel_hadoop@hotmail.com<mailto:msegel_hadoop@hotmail.com>>
wrote:
>
> Hi,
>
> I’m trying to find out if there’s a simple way for Spark to be able to read an RCFile.
>
> I know I can create a table in Hive, then drop the files in to that directory and use
a sql context to read the file from Hive, however I wanted to read the file directly.
>
> Not a lot of details to go on… even the Apache site’s links are broken.
> See :
> https://cwiki.apache.org/confluence/display/Hive/RCFile
>
> Then try to follow the Javadoc link.
>
>
> Any suggestions?
>
> Thx
>
> -Mike
>
>


Mime
View raw message