sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jarek Jarcec Cecho <jar...@apache.org>
Subject Re: /tmp dir for import configurable?
Date Sat, 06 Apr 2013 05:05:31 GMT
Hi Christian,
thank you very much for sharing the log and please accept my apologies for late response.


Closely looking into your exception, I can confirm that it's the S3 file system that is creating
the files in /tmp and not Sqoop itself.

> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
> /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'

Taking a brief look into the source code [1], it seems that it's the method newBackupFile()
defined on line 195 that is responsible for creating the temporary file. And also it seems
that it's behaviour can be altered using fs.s3.buffer.dir property. Would you mind to try
use it in your Sqoop execution?

  sqoop import -Dfs.s3.buffer.dir=/custom/path ...

I've also noticed that you're using the LocalJobRunner which is suggesting Sqoop is executing
all jobs locally on your machine and not on your Hadoop cluster. I would recommend checking
Hadoop configuration in case that your intention is to run your data transfer in parallel.

Jarcec

Links:
1: http://hadoop.apache.org/docs/r2.0.3-alpha/api/src-html/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html

On Tue, Apr 02, 2013 at 11:38:35AM +0100, Christian Prokopp wrote:
> Hi Jarcec,
> 
> I am running the command on the CLI of a cluster node. It appears to run a
> local MR job writing the results to /tmp before sending it to S3:
> 
> [..]
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> Beginning mysqldump fast path import
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> Performing import of table image from database some_db
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> Converting data to use specified delimiters.
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: (For
> the fastest possible import, use
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> --mysql-delimiters to specify the same field
> [hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
> delimiters as are used by mysqldump.)
> [hostaddress] out: 13/04/02 01:52:54 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:52:55 INFO mapred.JobClient:  map 100%
> reduce 0%
> [hostaddress] out: 13/04/02 01:52:57 INFO mapred.LocalJobRunner:
> [..]
> [hostaddress] out: 13/04/02 01:53:03 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
> Transfer loop complete.
> [hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
> Transferred 668.9657 MB in 113.0105 seconds (5.9195 MB/sec)
> [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:54:42 INFO s3native.NativeS3FileSystem:
> OutputStream for key
> 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
> closed. Now beginning upload
> [hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:54:45 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:55:31 INFO s3native.NativeS3FileSystem:
> OutputStream for key
> 'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
> upload complete
> [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task:
> Task:attempt_local555455791_0001_m_000000_0 is done. And is in the process
> of commiting
> [hostaddress] out: 13/04/02 01:55:31 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task: Task
> attempt_local555455791_0001_m_000000_0 is allowed to commit now
> [hostaddress] out: 13/04/02 01:55:36 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:56:03 WARN output.FileOutputCommitter:
> Failed to delete the temporary output directory of task:
> attempt_local555455791_0001_m_000000_0 - s3n://secret@bucketsomewhere
> /some_table/_temporary/_attempt_local555455791_0001_m_000000_0
> [hostaddress] out: 13/04/02 01:56:03 INFO output.FileOutputCommitter: Saved
> output of task 'attempt_local555455791_0001_m_000000_0' to
> s3n://secret@bucketsomewhere/some_table
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner:
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.Task: Task
> 'attempt_local555455791_0001_m_000000_0' done.
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Finishing
> task: attempt_local555455791_0001_m_000000_0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Map task
> executor complete.
> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
> /tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'
> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' closed. Now beginning upload
> [hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
> OutputStream for key 'some_table/_SUCCESS' upload complete
> [...deleting cached jars...]
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Job complete:
> job_local555455791_0001
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Counters: 23
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   File System
> Counters
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of bytes read=6471451
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of bytes written=6623109
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of large read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
> Number of write operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of bytes read=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of bytes written=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of large read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
> Number of write operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of bytes read=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of bytes written=773081963
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of large read operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
> of write operations=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   Map-Reduce
> Framework
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map input
> records=1
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map output
> records=14324124
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Input split
> bytes=87
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Spilled
> Records=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     CPU time
> spent (ms)=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Physical
> memory (bytes) snapshot=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Virtual
> memory (bytes) snapshot=0
> [hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Total
> committed heap usage (bytes)=142147584
> [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
> Transferred 0 bytes in 201.4515 seconds (0 bytes/sec)
> [hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
> Retrieved 14324124 records.
> 
> On Thu, Mar 28, 2013 at 9:49 PM, Jarek Jarcec Cecho <jarcec@apache.org>wrote:
> 
> > Hi Christian,
> > would you mind describing a bit more the behaviour you're observing?
> >
> > Sqoop should be touching /tmp only on machine where you've executed it for
> > generating and compiling code (<1MB!). The data transfer itself is done on
> > your Hadoop cluster from within a mapreduce job and the output is directly
> > stored in your destination folder. I'm not familiar with s3 file system
> > implementation, but can it happen that it's the S3 library which is storing
> > the data in /tmp?
> >
> > Jarcec
> >
> > On Thu, Mar 28, 2013 at 03:54:11PM +0000, Christian Prokopp wrote:
> > > Thanks for the idea Alex. I considered this but that would mean I have to
> > > change my cluster setup for sqoop (last resort solution). I'd very much
> > > rather point sqoop to existing large disks.
> > >
> > > Cheers,
> > > Christian
> > >
> > >
> > > On Thu, Mar 28, 2013 at 3:50 PM, Alexander Alten-Lorenz <
> > wget.null@gmail.com
> > > > wrote:
> > >
> > > > You could mount a bigger disk into /tmp - or symlink /tmp to another
> > > > directory which have enough space.
> > > >
> > > > Best
> > > > - Alex
> > > >
> > > > On Mar 28, 2013, at 4:35 PM, Christian Prokopp <
> > christian@rangespan.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am using sqoop to copy data from MySQL to S3:
> > > > >
> > > > > (Sqoop 1.4.2-cdh4.2.0)
> > > > > $ sqoop import --connect jdbc:mysql://server:port/db --username user
> > > > --password pass  --table tablename --target-dir s3n://xyz@somehwere
> > /a/b/c
> > > > --fields-terminated-by='\001' -m 1 --direct
> > > > >
> > > > > My problem is that sqoop temporarily stores the data on /tmp, which
> > is
> > > > not big enough for the data. I am unable to find a configuration
> > option to
> > > > point sqoop to a bigger partition/disk. Any suggestions?
> > > > >
> > > > > Cheers,
> > > > > Christian
> > > > >
> > > >
> > > > --
> > > > Alexander Alten-Lorenz
> > > > http://mapredit.blogspot.com
> > > > German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > *Christian Prokopp*
> > > Data Scientist, PhD
> > > Rangespan Ltd. <http://www.rangespan.com/>
> >
> 
> 
> 
> -- 
> Best regards,
> 
> *Christian Prokopp*
> Data Scientist, PhD
> Rangespan Ltd. <http://www.rangespan.com/>

Mime
View raw message