Hi, i am a beginner too, but as i have learned, hadoop works better with big files, at least with 64MB, 128MB or even more. I think you need to aggregate all the files into a new big one. Then you must copy to HDFS using this command:


hadoop just copy MYFILE into hadoop distributed file system.

Can i recommend you what i have done? go to BigDataUniversity.com and take the Hadoop Fundamentals I course. It is free and very well documented.


I am sorry for the beginners question but...
I have a spark java code which reads a file (c:\my-input.csv) process it and writes an output file (my-output.csv)
Now I want to run it on Hadoop in a distributed environment 
1) My inlut file should be one big file or separate smaller files? 
2) if we are using smaller files, how does my code needs to change to process all of the input files?

Will Hadoop just copy the files to different servers or will it also split their content among servers?

Any example will be great!
