spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?
Date Wed, 11 May 2016 05:18:10 GMT
what does jps returning?

jps
16738 ResourceManager
14786 Worker
17059 JobHistoryServer
12421 QuorumPeerMain
9061 RunJar
9286 RunJar
5190 SparkSubmit
16806 NodeManager
16264 DataNode
16138 NameNode
16430 SecondaryNameNode
22036 SparkSubmit
9557 Jps
13240 Kafka
2522 Master

and

ps -awx | grep -i spark | grep java


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 11 May 2016 at 03:01, 李明伟 <kramer2009@126.com> wrote:

> Hi Mich
>
> From the ps command. I can find four process. 10409 is the master and
> 10603 is the worker. 12420 is the driver program and 12578 should be the
> executor (worker). Am I right?
> So you mean the 12420 is actually running both the driver and the worker
> role?
>
> [root@ES01 ~]# ps -awx | grep spark | grep java
> 10409 ?        Sl     1:40 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master
> --ip ES01 --port 7077 --webui-port 8080
> 10603 ?        Sl     6:00 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker
> --webui-port 8081 spark://ES01:7077
> 12420 ?        Sl     6:34 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit
> --master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2
> --executor-memory 4G --num-executors 1 --total-executor-cores 1
> /opt/flowSpark/sparkStream/ForAsk01.py
> 12578 ?        Sl    13:16 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://
> CoarseGrainedScheduler@10.79.148.184:52931 --executor-id 0 --hostname
> 10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url
> spark://Worker@10.79.148.184:52660
>
>
>
>
>
>
>
> At 2016-05-11 09:03:21, "Mich Talebzadeh" <mich.talebzadeh@gmail.com>
> wrote:
>
> hm,
>
> This is a standalone mode.
>
> When you are running Spark in Standalone mode, you only have one worker
> that lives within the driver JVM process that you start when you start
> spark-shell or spark-submit.
>
> However, since driver-memory setting encapsulates the JVM, you will need
> to set the amount of *driver memory *for any non-default value *before
> starting JVM by providing the new value:*
>
>
>
>
> ${SPARK_HOME}/bin/spark-submit --driver-memory 5g
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 11 May 2016 at 01:22, 李明伟 <kramer2009@126.com> wrote:
>
>> I actually provided them in submit command here:
>>
>> nohup ./bin/spark-submit   --master spark://ES01:7077 --executor-memory
>> 4G --num-executors 1 --total-executor-cores 1 --conf
>> "spark.storage.memoryFraction=0.2"  ./mycode.py 1>a.log 2>b.log &
>>
>>
>>
>>
>>
>>
>>
>> At 2016-05-10 21:19:06, "Mich Talebzadeh" <mich.talebzadeh@gmail.com>
>> wrote:
>>
>> Hi Mingwei,
>>
>> In your Spark conf setting what are you providing for these parameters. *Are
>> you capping them?*
>>
>> For example
>>
>>   val conf = new SparkConf().
>>                setAppName("AppName").
>>                setMaster("local[2]").
>>                set("spark.executor.memory", "4G").
>>                set("spark.cores.max", "2").
>>                set("spark.driver.allowMultipleContexts", "true")
>>   val sc = new SparkContext(conf)
>>
>> I assume you are running in standalone mode so each worker/aka
>> slave grabs all the available cores and allocates the remaining memory on
>> each host. Do not provide these in
>>
>> Do not provide new values for these parameter meaning overwrite them in
>>
>> *${SPARK_HOME}/bin/spark-submit  --*
>>
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 10 May 2016 at 03:12, 李明伟 <kramer2009@126.com> wrote:
>>
>>> Hi Mich
>>>
>>> I added some more infor (the spark-env.sh setting and top command output
>>> in that thread.) Can you help to check pleas?
>>>
>>> Regards
>>> Mingwei
>>>
>>>
>>>
>>>
>>>
>>> At 2016-05-09 23:45:19, "Mich Talebzadeh" <mich.talebzadeh@gmail.com>
>>> wrote:
>>>
>>> I had a look at the thread.
>>>
>>> This is what you have which I gather a standalone box in other words one
>>> worker node
>>>
>>> bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G
>>> --num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log
>>>
>>> But what I don't understand why is using 80% of your RAM as opposed to
>>> 25% of it (4GB/16GB) right?
>>>
>>> Where else have you set up these parameters for example in
>>> $SPARK_HOME/con/spark-env.sh?
>>>
>>> Can you send the output of /usr/bin/free and top
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 9 May 2016 at 16:19, 李明伟 <kramer2009@126.com> wrote:
>>>
>>>> Thanks for all the information guys.
>>>>
>>>> I wrote some code to do the test. Not using window. So only calculating
>>>> data for each batch interval. I set the interval to 30 seconds also reduce
>>>> the size of data to about 30 000 lines of csv.
>>>> Means my code should calculation on 30 000 lines of CSV in 30 seconds.
>>>> I think it is not a very heavy workload. But my spark stream code still
>>>> crash.
>>>>
>>>> I send another post to the user list here
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-td26904.html
>>>>
>>>> Is it possible for you to have a look please? Very appreciate.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> At 2016-05-09 17:49:22, "Saisai Shao" <sai.sai.shao@gmail.com> wrote:
>>>>
>>>> Pease see the inline comments.
>>>>
>>>>
>>>> On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34668@yahoo.com>
>>>> wrote:
>>>>
>>>>> Thank you.
>>>>>
>>>>> So If I create spark streaming then
>>>>>
>>>>>
>>>>>    1. The streams will always need to be cached? It cannot be stored
>>>>>    in persistent storage
>>>>>
>>>>> You don't need to cache the stream explicitly if you don't have
>>>> specific requirement, Spark will do it for you depends on different
>>>> streaming sources (Kafka or socket).
>>>>
>>>>>
>>>>>    1. The stream data cached will be distributed among all nodes of
>>>>>    Spark among executors
>>>>>    2. As I understand each Spark worker node has one executor that
>>>>>    includes cache. So the streaming data is distributed among these work
node
>>>>>    caches. For example if I have 4 worker nodes each cache will have
a quarter
>>>>>    of data (this assumes that cache size among worker nodes is the same.)
>>>>>
>>>>> Ideally, it will distributed evenly across the executors, also this is
>>>> target for tuning. Normally it depends on several conditions like receiver
>>>> distribution, partition distribution.
>>>>
>>>>
>>>>>
>>>>> The issue raises if the amount of streaming data does not fit into
>>>>> these 4 caches? Will the job crash?
>>>>>
>>>>>
>>>>> On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.shao@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>> No, each executor only stores part of data in memory (it depends on
>>>>> how the partition are distributed and how many receivers you have).
>>>>>
>>>>> For WindowedDStream, it will obviously cache the data in memory, from
>>>>> my understanding you don't need to call cache() again.
>>>>>
>>>>> On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34668@yahoo.com>
>>>>> wrote:
>>>>>
>>>>> hi,
>>>>>
>>>>> so if i have 10gb of streaming data coming in does it require 10gb of
>>>>> memory in each node?
>>>>>
>>>>> also in that case why do we need using
>>>>>
>>>>> dstream.cache()
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>> On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.shao@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>> It depends on you to write the Spark application, normally if data is
>>>>> already on the persistent storage, there's no need to be put into memory.
>>>>> The reason why Spark Streaming has to be stored in memory is that streaming
>>>>> source is not persistent source, so you need to have a place to store
the
>>>>> data.
>>>>>
>>>>> On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2009@126.com>
wrote:
>>>>>
>>>>> Thanks.
>>>>> What if I use batch calculation instead of stream computing? Do I
>>>>> still need that much memory? For example, if the 24 hour data set is
100
>>>>> GB. Do I also need a 100GB RAM to do the one time batch calculation ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.shao@gmail.com>
wrote:
>>>>>
>>>>> For window related operators, Spark Streaming will cache the data into
>>>>> memory within this window, in your case your window size is up to 24
hours,
>>>>> which means data has to be in Executor's memory for more than 1 day,
this
>>>>> may introduce several problems when memory is not enough.
>>>>>
>>>>> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>> ok terms for Spark Streaming
>>>>>
>>>>> "Batch interval" is the basic interval at which the system with
>>>>> receive the data in batches.
>>>>> This is the interval set when creating a StreamingContext. For
>>>>> example, if you set the batch interval as 300 seconds, then any input
>>>>> DStream will generate RDDs of received data at 300 seconds intervals.
>>>>> A window operator is defined by two parameters -
>>>>> - WindowDuration / WindowsLength - the length of the window
>>>>> - SlideDuration / SlidingInterval - the interval at which the window
>>>>> will slide or move forward
>>>>>
>>>>>
>>>>> Ok so your batch interval is 5 minutes. That is the rate messages are
>>>>> coming in from the source.
>>>>>
>>>>> Then you have these two params
>>>>>
>>>>> // window length - The duration of the window below that must be
>>>>> multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n))
>>>>> val windowLength = x =  m * n
>>>>> // sliding interval - The interval at which the window operation is
>>>>> performed in other words data is collected within this "previous interval'
>>>>> val slidingInterval =  y l x/y = even number
>>>>>
>>>>> Both the window length and the slidingInterval duration must be
>>>>> multiples of the batch interval, as received data is divided into batches
>>>>> of duration "batch interval".
>>>>>
>>>>> If you want to collect 1 hour data then windowLength =  12 * 5 * 60
>>>>> seconds
>>>>> If you want to collect 24 hour data then windowLength =  24 * 12 * 5
*
>>>>> 60
>>>>>
>>>>> You sliding window should be set to batch interval = 5 * 60 seconds.
>>>>> In other words that where the aggregates and summaries come for your
report.
>>>>>
>>>>> What is your data source here?
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> On 9 May 2016 at 04:19, kramer2009@126.com <kramer2009@126.com>
wrote:
>>>>>
>>>>> We have some stream data need to be calculated and considering use
>>>>> spark
>>>>> stream to do it.
>>>>>
>>>>> We need to generate three kinds of reports. The reports are based on
>>>>>
>>>>> 1. The last 5 minutes data
>>>>> 2. The last 1 hour data
>>>>> 3. The last 24 hour data
>>>>>
>>>>> The frequency of reports is 5 minutes.
>>>>>
>>>>> After reading the docs, the most obvious way to solve this seems to
>>>>> set up a
>>>>> spark stream with 5 minutes interval and two window which are 1 hour
>>>>> and 1
>>>>> day.
>>>>>
>>>>>
>>>>> But I am worrying that if the window is too big for one day and one
>>>>> hour. I
>>>>> do not have much experience on spark stream, so what is the window
>>>>> length in
>>>>> your environment?
>>>>>
>>>>> Any official docs talking about this?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
>
>
>

Mime
View raw message