spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: python : Out of memory: Kill process
Date Thu, 26 Mar 2015 17:29:20 GMT
Could you try to remove the line `log2.cache()` ?

On Thu, Mar 26, 2015 at 10:02 AM, Eduardo Cusa
<eduardo.cusa@usmediaconsulting.com> wrote:
> I running on ec2 :
>
> 1 Master : 4 CPU 15 GB RAM  (2 GB swap)
>
> 2 Slaves  4 CPU 15 GB RAM
>
>
> the uncompressed dataset size is 15 GB
>
>
>
>
> On Thu, Mar 26, 2015 at 10:41 AM, Eduardo Cusa
> <eduardo.cusa@usmediaconsulting.com> wrote:
>>
>> Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory.
>>
>> I ran the same code as before, I need to make any changes?
>>
>>
>>
>>
>>
>>
>> On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu <davies@databricks.com> wrote:
>>>
>>> With batchSize = 1, I think it will become even worse.
>>>
>>> I'd suggest to go with 1.3, have a taste for the new DataFrame API.
>>>
>>> On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa
>>> <eduardo.cusa@usmediaconsulting.com> wrote:
>>> > Hi Davies, I running 1.1.0.
>>> >
>>> > Now I'm following this thread that recommend use batchsize parameter =
>>> > 1
>>> >
>>> >
>>> >
>>> > http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html
>>> >
>>> > if this does not work I will install  1.2.1 or  1.3
>>> >
>>> > Regards
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu <davies@databricks.com>
>>> > wrote:
>>> >>
>>> >> What's the version of Spark you are running?
>>> >>
>>> >> There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,
>>> >>
>>> >> [1] https://issues.apache.org/jira/browse/SPARK-6055
>>> >>
>>> >> On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
>>> >> <eduardo.cusa@usmediaconsulting.com> wrote:
>>> >> > Hi Guys, I running the following function with spark-submmit and
de
>>> >> > SO
>>> >> > is
>>> >> > killing my process :
>>> >> >
>>> >> >
>>> >> >   def getRdd(self,date,provider):
>>> >> >     path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
>>> >> >     log2= self.sqlContext.jsonFile(path)
>>> >> >     log2.registerTempTable('log_test')
>>> >> >     log2.cache()
>>> >>
>>> >> You only visit the table once, cache does not help here.
>>> >>
>>> >> >     out=self.sqlContext.sql("SELECT user, tax from log_test where
>>> >> > provider =
>>> >> > '"+provider+"'and country <> ''").map(lambda row: (row.user,
>>> >> > row.tax))
>>> >> >     print "out1"
>>> >> >     return  map((lambda (x,y): (x, list(y))),
>>> >> > sorted(out.groupByKey(2000).collect()))
>>> >>
>>> >> 100 partitions (or less) will be enough for 2G dataset.
>>> >>
>>> >> >
>>> >> >
>>> >> > The input dataset has 57 zip files (2 GB)
>>> >> >
>>> >> > The same process with a smaller dataset completed successfully
>>> >> >
>>> >> > Any ideas to debug is welcome.
>>> >> >
>>> >> > Regards
>>> >> > Eduardo
>>> >> >
>>> >> >
>>> >
>>> >
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message