sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Prokopp <christ...@rangespan.com>
Subject Re: /tmp dir for import configurable?
Date Tue, 02 Apr 2013 10:38:35 GMT
Hi Jarcec,

I am running the command on the CLI of a cluster node. It appears to run a
local MR job writing the results to /tmp before sending it to S3:

[..]
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
Beginning mysqldump fast path import
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
Performing import of table image from database some_db
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
Converting data to use specified delimiters.
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper: (For
the fastest possible import, use
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
--mysql-delimiters to specify the same field
[hostaddress] out: 13/04/02 01:52:49 INFO mapreduce.MySQLDumpMapper:
delimiters as are used by mysqldump.)
[hostaddress] out: 13/04/02 01:52:54 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:52:55 INFO mapred.JobClient:  map 100%
reduce 0%
[hostaddress] out: 13/04/02 01:52:57 INFO mapred.LocalJobRunner:
[..]
[hostaddress] out: 13/04/02 01:53:03 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
Transfer loop complete.
[hostaddress] out: 13/04/02 01:54:42 INFO mapreduce.MySQLDumpMapper:
Transferred 668.9657 MB in 113.0105 seconds (5.9195 MB/sec)
[hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:54:42 INFO s3native.NativeS3FileSystem:
OutputStream for key
'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
closed. Now beginning upload
[hostaddress] out: 13/04/02 01:54:42 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:54:45 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:55:31 INFO s3native.NativeS3FileSystem:
OutputStream for key
'some_table/_temporary/_attempt_local555455791_0001_m_000000_0/part-m-00000'
upload complete
[hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task:
Task:attempt_local555455791_0001_m_000000_0 is done. And is in the process
of commiting
[hostaddress] out: 13/04/02 01:55:31 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:55:31 INFO mapred.Task: Task
attempt_local555455791_0001_m_000000_0 is allowed to commit now
[hostaddress] out: 13/04/02 01:55:36 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:56:03 WARN output.FileOutputCommitter:
Failed to delete the temporary output directory of task:
attempt_local555455791_0001_m_000000_0 - s3n://secret@bucketsomewhere
/some_table/_temporary/_attempt_local555455791_0001_m_000000_0
[hostaddress] out: 13/04/02 01:56:03 INFO output.FileOutputCommitter: Saved
output of task 'attempt_local555455791_0001_m_000000_0' to
s3n://secret@bucketsomewhere/some_table
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner:
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.Task: Task
'attempt_local555455791_0001_m_000000_0' done.
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Finishing
task: attempt_local555455791_0001_m_000000_0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.LocalJobRunner: Map task
executor complete.
[hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
OutputStream for key 'some_table/_SUCCESS' writing to tempfile '*
/tmp/hadoop-jenkins/s3/output-1400873345908825433.tmp*'
[hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
OutputStream for key 'some_table/_SUCCESS' closed. Now beginning upload
[hostaddress] out: 13/04/02 01:56:03 INFO s3native.NativeS3FileSystem:
OutputStream for key 'some_table/_SUCCESS' upload complete
[...deleting cached jars...]
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Job complete:
job_local555455791_0001
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient: Counters: 23
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   File System
Counters
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
Number of bytes read=6471451
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
Number of bytes written=6623109
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
Number of read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
Number of large read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     FILE:
Number of write operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
Number of bytes read=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
Number of bytes written=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
Number of read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
Number of large read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     HDFS:
Number of write operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
of bytes read=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
of bytes written=773081963
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
of read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
of large read operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     S3N: Number
of write operations=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:   Map-Reduce
Framework
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map input
records=1
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Map output
records=14324124
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Input split
bytes=87
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Spilled
Records=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     CPU time
spent (ms)=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Physical
memory (bytes) snapshot=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Virtual
memory (bytes) snapshot=0
[hostaddress] out: 13/04/02 01:56:03 INFO mapred.JobClient:     Total
committed heap usage (bytes)=142147584
[hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
Transferred 0 bytes in 201.4515 seconds (0 bytes/sec)
[hostaddress] out: 13/04/02 01:56:03 INFO mapreduce.ImportJobBase:
Retrieved 14324124 records.

On Thu, Mar 28, 2013 at 9:49 PM, Jarek Jarcec Cecho <jarcec@apache.org>wrote:

> Hi Christian,
> would you mind describing a bit more the behaviour you're observing?
>
> Sqoop should be touching /tmp only on machine where you've executed it for
> generating and compiling code (<1MB!). The data transfer itself is done on
> your Hadoop cluster from within a mapreduce job and the output is directly
> stored in your destination folder. I'm not familiar with s3 file system
> implementation, but can it happen that it's the S3 library which is storing
> the data in /tmp?
>
> Jarcec
>
> On Thu, Mar 28, 2013 at 03:54:11PM +0000, Christian Prokopp wrote:
> > Thanks for the idea Alex. I considered this but that would mean I have to
> > change my cluster setup for sqoop (last resort solution). I'd very much
> > rather point sqoop to existing large disks.
> >
> > Cheers,
> > Christian
> >
> >
> > On Thu, Mar 28, 2013 at 3:50 PM, Alexander Alten-Lorenz <
> wget.null@gmail.com
> > > wrote:
> >
> > > You could mount a bigger disk into /tmp - or symlink /tmp to another
> > > directory which have enough space.
> > >
> > > Best
> > > - Alex
> > >
> > > On Mar 28, 2013, at 4:35 PM, Christian Prokopp <
> christian@rangespan.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am using sqoop to copy data from MySQL to S3:
> > > >
> > > > (Sqoop 1.4.2-cdh4.2.0)
> > > > $ sqoop import --connect jdbc:mysql://server:port/db --username user
> > > --password pass  --table tablename --target-dir s3n://xyz@somehwere
> /a/b/c
> > > --fields-terminated-by='\001' -m 1 --direct
> > > >
> > > > My problem is that sqoop temporarily stores the data on /tmp, which
> is
> > > not big enough for the data. I am unable to find a configuration
> option to
> > > point sqoop to a bigger partition/disk. Any suggestions?
> > > >
> > > > Cheers,
> > > > Christian
> > > >
> > >
> > > --
> > > Alexander Alten-Lorenz
> > > http://mapredit.blogspot.com
> > > German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> > >
> > >
> >
> >
> > --
> > Best regards,
> >
> > *Christian Prokopp*
> > Data Scientist, PhD
> > Rangespan Ltd. <http://www.rangespan.com/>
>



-- 
Best regards,

*Christian Prokopp*
Data Scientist, PhD
Rangespan Ltd. <http://www.rangespan.com/>

Mime
View raw message