spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Bell <>
Subject Re: Spark newbie desires feedback on first program
Date Mon, 16 Feb 2015 21:42:29 GMT
Thanks Charles. I just realized a few minutes ago that I neglected to 
show the step where I generated the key on the person ID. Thanks for the 
pointer on the HDFS URL.

Next step is to process data from multiple RDDS. My data originates from 
7 tables in a MySQL database. I used sqoop to create avro files from 
these tables, and in turn created RDDs using SparkSQL from the avro 
files. Since the groupByKey only operates on a single RDD, I'm not quite 
sure yet how I'm going to process 7 tables as a transformation to get 
all the data I need into my objects.

I'm vascillating on whether I should be doing it this way, or if it 
would be a lot simpler to query MySQL to get all the Person IDs, 
parallelize them, and have my Person class make queries directly to the 
MySQL database. Since in theory I only have to do this once, I'm not 
sure there's much to be gained in moving the data from MySQL to Spark first.

I have yet to find any non-trivial examples of ETL logic on the web ... 
it seems like it's mostly word count map-reduce replacements.

On 02/16/2015 01:32 PM, Charles Feduke wrote:
> I cannot comment about the correctness of Python code. I will assume 
> your caper_kv is keyed on something that uniquely identifies all the 
> rows that make up the person's record so your group by key makes 
> sense, as does the map. (I will also assume all of the rows that 
> comprise a single person's record will always fit in memory. If not 
> you will need another approach.)
> You should be able to get away with removing the "localhost:9000" from 
> your HDFS URL, i.e., "hdfs:///sma/processJSON/people" and let your 
> HDFS configuration for Spark supply the missing pieces.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message