Just to follow up…
I was able to create an RDD from the file, however, diving in to the RDD is a bit weird, and I’m working thru it. My test file seems to be one block … 3K rows. So when I tried to get the first column of the first row, I ended up getting all
of the rows for the first column which were comma delimited. The other issue is then converting numeric fields back from their byte code. I have the schema so I can do that. (This is also an issue with RCFileCat (sorry if I messed that name up…) things
work great if you’re using strings only. )
I guess this could be a start of a project (time permitting) to enhance the ability to read older file formats as easy as it is to read Parquet and ORC files.
Will have to follow up in Dev.
Thanks everyone for the pointers.
Forgot to add the mailinglist
Welll you can use:
with the following inputformat:
(note the version of the Javadoc does not matter it is already possible since a long time).
Writing is similarly with PairRDD and RCFileOutputFormat