spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Setting up Spark/flume/? to Ingest 10TB from FTP
Date Mon, 17 Aug 2015 17:35:17 GMT

with the right ftp client JAR on your classpath (I forget which), you can use ftp:// a a source
for a hadoop FS operation. you may even be able to use it as an input for some spark (non
streaming job directly.


On 14 Aug 2015, at 14:11, Varadhan, Jawahar <varadhan@yahoo.com.INVALID<mailto:varadhan@yahoo.com.INVALID>>
wrote:

Thanks Marcelo. But our problem is little complicated.

We have 10+ ftp sites that we will be transferring data from. The ftp server info, filename,
credentials are all coming via Kafka message. So, I want to read those kafka message and dynamically
connect to the ftp site and download those fat files and store it in HDFS.

And hence, I was planning to use Spark Streaming with Kafka or Flume with Kafka. But flume
runs on a JVM and may not be the best option as the huge file will create memory issues. Please
suggest someway to run it inside the cluster.




________________________________
From: Marcelo Vanzin <vanzin@cloudera.com<mailto:vanzin@cloudera.com>>
To: "Varadhan, Jawahar" <varadhan@yahoo.com<mailto:varadhan@yahoo.com>>
Cc: "dev@spark.apache.org<mailto:dev@spark.apache.org>" <dev@spark.apache.org<mailto:dev@spark.apache.org>>
Sent: Friday, August 14, 2015 3:23 PM
Subject: Re: Setting up Spark/flume/? to Ingest 10TB from FTP

Why do you need to use Spark or Flume for this?

You can just use curl and hdfs:

  curl ftp://blah<ftp://blah/> | hdfs dfs -put - /blah




On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar <varadhan@yahoo.com.invalid<mailto:varadhan@yahoo.com.invalid>>
wrote:
What is the best way to bring such a huge file from a FTP server into Hadoop to persist in
HDFS? Since a single jvm process might run out of memory, I was wondering if I can use Spark
or Flume to do this. Any help on this matter is appreciated.

I prefer a application/process running inside Hadoop which is doing this transfer

Thanks.



--
Marcelo






Mime
View raw message