spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Solimando <alessandro.solima...@gmail.com>
Subject Re: Spark job got stuck and no active tasks
Date Wed, 30 Jan 2019 13:07:58 GMT
Hello Pola,
sorry for the late reply.

I haven't checked in depth the method in error, but from the stack trace it
looks like you have problems with the cardinality of one of your columns in
the dataset. Did you try different datasets?

Does the Spark Job get stuck when no errors are raised from xgboost?

For me when an error happens at xgboost level, the job catches it and
fails, it does not become stuck, that's why I am asking.

I have been experimenting with xgboost 0.72, xgboost 0.8 and xgboost 0.81,
which version are you using?

Best regards,
Alessandro

On Fri, 25 Jan 2019 at 19:16, Pola Yao <pola.yao@gmail.com> wrote:

> Hi Alessandro,
>
> Thanks for your advice.
>
> I have pasted the parameters used for the XGBoost:
> '''
>
> val xgbParam = Map("eta" -> 0.1f,
>    "max_depth" -> 5,
>    "objective" -> "binary:logistic",
>    "num_round" -> 200,
>    //"tracker_conf" -> TrackerConf(30000L, "scala"),
>    "num_workers" -> 2,
>    "nthread" -> 1,
>    "verbosity" -> 3
> )
>
> '''
>
> Logs:
> [10:13:15] Tree method is automatically selected to be 'approx' for
> distributed training.[10:13:15] Tree method is automatically selected to be
> 'approx' for distributed training.
>
> terminate called after throwing an instance of 'dmlc::Error'
>   what():  [10:13:16]
> /xgboost/include/xgboost/../../src/common/span.h:402: Check failed: _count
> >= 0
>
> Stack trace returned 10 entries:
> [bt] (0) /tmp/libxgboost4j7047870347452838018.so(dmlc::StackTrace()+0x19d)
> [0x7ff3349a5d6d]
> [bt] (1)
> /tmp/libxgboost4j7047870347452838018.so(xgboost::common::Span<xgboost::Entry
> const, -1l>::Span(xgboost::Entry const*, long)+0x5f0) [0x7ff3349eef00]
> [bt] (2) /tmp/libxgboost4j7047870347452838018.so(+0x1e9793)
> [0x7ff334aec793]
> [bt] (3)
> /tmp/libxgboost4j7047870347452838018.so(xgboost::tree::GlobalProposalHistMaker<xgboost::tree::GradStats>::CreateHist(std::vector<xgboost::detail::GradientPairInternal<float>,
> std::allocator<xgboost::detail::GradientPairInternal<float> > > const&,
> xgboost::DMatrix*, std::vector<unsigned int, std::allocator<unsigned int> >
> const&, xgboost::RegTree const&)+0x995) [0x7ff334afa925]
> [bt] (4)
> /tmp/libxgboost4j7047870347452838018.so(xgboost::tree::HistMaker<xgboost::tree::GradStats>::Update(std::vector<xgboost::detail::GradientPairInternal<float>,
> std::allocator<xgboost::detail::GradientPairInternal<float> > > const&,
> xgboost::DMatrix*, xgboost::RegTree*)+0x2d9) [0x7ff334afce79]
> [bt] (5)
> /tmp/libxgboost4j7047870347452838018.so(xgboost::tree::HistMaker<xgboost::tree::GradStats>::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float>
> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*,
> std::allocator<xgboost::RegTree*> > const&)+0xa3) [0x7ff334aee933]
> [bt] (6)
> /tmp/libxgboost4j7047870347452838018.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float>
> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree,
> std::default_delete<xgboost::RegTree> >,
> std::allocator<std::unique_ptr<xgboost::RegTree,
> std::default_delete<xgboost::RegTree> > > >*)+0x809) [0x7ff334a295a9]
> [bt] (7)
> /tmp/libxgboost4j7047870347452838018.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*,
> xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*,
> xgboost::ObjFunction*)+0x885) [0x7ff334a2a035]
> [bt] (8)
> /tmp/libxgboost4j7047870347452838018.so(xgboost::LearnerImpl::UpdateOneIter(int,
> xgboost::DMatrix*)+0x3e0) [0x7ff334a37560]
> [bt] (9)
> /tmp/libxgboost4j7047870347452838018.so(XGBoosterUpdateOneIter+0x35)
> [0x7ff3349ac185]
>
>
>
> On Fri, Jan 25, 2019 at 2:16 AM Alessandro Solimando <
> alessandro.solimando@gmail.com> wrote:
>
>> Hello Pola,
>> do you mind providing us the full XGBoost parameter map?
>>
>> Did you try activating the "verbose mode" of XGBoost and see if something
>> wrong is happening in the native code?
>>
>> I usually increase verbosity from XGBoost by passing "silent = 0" in the
>> param map, and calling
>> Logger.getLogger("ml.dmlc.xgboost4j").setLevel(Level.INFO) (not sure if
>> that's the right way to do it though).
>>
>> Best regards,
>> Alessandro
>>
>>
>> On Fri, 25 Jan 2019 at 07:05, Pola Yao <pola.yao@gmail.com> wrote:
>>
>>> I was trying to train a ML pipeline (including a XGBoost classifier) on
>>> a very small data (i.e., 10Mb). During the training, certain
>>> MissingOutputLocation exceptions occurred, however, the Spark job got stuck
>>> with no active tasks, and never quit.
>>> [image: 1.png]
>>> [image: 2.png]
>>> [image: 3.png]
>>>
>>> Spark env:
>>> Version 2.3.0, yarn cluster, #executors: 4, #executor-cores: 2, executor-memory:
>>> 4g, executor-memory-overhead: 2g
>>>
>>> Spark code:
>>>  '''
>>>
>>> def train(...): PipelineModel = {
>>>
>>> val data = readData(...)
>>>
>>> val pipeline = new Pipeline().setStages(...) // the last stage is an
>>> XGBoostClassifier
>>>
>>> val model = pipeline.fit(data)
>>>
>>> val result = model.transform(data)
>>>
>>> model
>>>
>>> }
>>>
>>> def main(String[] args): Unit = {
>>>
>>> val spark = //create a sparksession
>>>
>>> ...
>>>
>>> val a = Try(train(...))
>>>
>>> a match {
>>>
>>> case Success(model) => println("Successfully trained")
>>>
>>> case Failure(ex) => println("Failed to train " + ex)
>>>
>>> }
>>>
>>> spark.stop()
>>>
>>> '''
>>>
>>> Does anybody have a clue about this scenario? Is it due to the yarn
>>> cluster? Or some weird exceptions/errors occurred, but my code failed to
>>> catch them?
>>>
>>> Thanks!
>>>
>>

Mime
View raw message