spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Tiernay <btier...@hotmail.com>
Subject RE: Directory / File Reading Patterns
Date Sun, 18 Jan 2015 15:51:04 GMT
Also, I used the following pattern to extract information from a file path and add it to the
output of a transformation:
https://gist.github.com/btiernay/1ad5e3dea08904fe07d9
You may find it useful as well.





Cheers,
Bob

From: btiernay@hotmail.com
To: sowen@cloudera.com; snunez@hortonworks.com
CC: user@spark.apache.org
Subject: RE: Directory / File Reading Patterns
Date: Sun, 18 Jan 2015 15:41:53 +0000




You may also want to keep an eye on SPARK-5182 / SPARK-5302 which may help if you are using
Spark SQL. It should be noted that this is possible with HiveContext today.





Cheers,
Bob

Date: Sun, 18 Jan 2015 08:47:06 +0000
Subject: Re: Directory / File Reading Patterns
From: sowen@cloudera.com
To: snunez@hortonworks.com
CC: user@spark.apache.org

I think that putting part of the data (only) in a filename is an anti-pattern, but we sometimes
have to play these where they lie.
You can list all the directory paths containing the CSV files, map them each to RDDs with
textFile, transform the RDDs to include info from the path, and then simply union them. 
This should be pretty fine performance wise even. 
On Jan 17, 2015 11:48 PM, "Steve Nunez" <snunez@hortonworks.com> wrote:





Hello Users,



I’ve got a real-world use case that seems common enough that its pattern would be documented
somewhere, but I can’t find any references to a simple solution. The challenge is that data
is getting dumped into a directory structure, and that directory structure
 itself contains features that I need in my model. For example:



bank_code
Trader
Day-1.csv
Day-2.csv
…



Each CVS file contains a list of all the trades made by that individual each day. The problem
is that the bank & trader should be part of the feature set. I.e. We need the RDD to look
like:
(bank, trader, day, <list-of-trades>)



Anyone got any elegant solutions for doing this?



Cheers,
- SteveN














 		 	   		   		 	   		  
Mime
View raw message