Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0.0-SNAPSHOT dmlc/XGBoost train FAILED on "Invalid data ordinal" #9409

Closed
NvTimLiu opened this issue Jul 24, 2023 · 2 comments · Fixed by #9412
Closed

2.0.0-SNAPSHOT dmlc/XGBoost train FAILED on "Invalid data ordinal" #9409

NvTimLiu opened this issue Jul 24, 2023 · 2 comments · Fixed by #9412

Comments

@NvTimLiu
Copy link

xgb-driver.txt
xgb-executor.txt

2.0.0-SNAPSHOT dmlc/XGBoost train FAILED on "Invalid data ordinal" since 20230717on NGC cluster with multi GPU hosts.

Invalid device ordinal. Data is associated with a different device ordinal than the booster. The device ordinal of the data is: 2; the device ordinal of the Booster is: 0

PASS JAR:
xgboost4j-gpu_2.12-2.0.0-20230716.001501-545.jar

FAILED JAR:
xgboost4j-gpu_2.12-2.0.0-20230717.191954-548.jar

--------driver -------

cat /raid/tmp/driver-agaricus-Main-GPU.log
23/07/22 09:36:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/07/22 09:36:46 INFO SparkContext: Running Spark version 3.1.2
23/07/22 09:36:46 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/07/22 09:36:46 INFO ResourceUtils: ==============================================================
23/07/22 09:36:46 INFO ResourceUtils: No custom resources configured for spark.driver.
23/07/22 09:36:46 INFO ResourceUtils: ==============================================================
23/07/22 09:36:46 INFO SparkContext: Submitted application: Agaricus-Mai-csv
23/07/22 09:36:46 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 8, script: , vendor: , memory -> name: memory, amount: 32768, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: , gpu -> name: gpu, amount: 1, script: , vendor: nvidia.com), task resources: Map(cpus -> name: cpus, amount: 1.0, gpu -> name: gpu, amount: 1.0)

....

23/07/22 09:37:54 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 3 because the barrier taskSet requires 4 slots, while the total number of available slots is 1.
23/07/22 09:37:54 WARN TaskSetManager: Lost task 2.0 in stage 3.0 (TID 5) (127.0.0.1 executor 0): ml.dmlc.xgboost4j.java.XGBoostError: [09:37:54] /workspace/src/data/data.cc:730: Invalid device ordinal. Data is associated with a different device ordinal than the booster. The device ordinal of the data is: 2; the device ordinal of the Booster is: 0
Stack trace:
  [bt] (0) /raid/tmp/libxgboost4j8219882498856405938.so( 0x59993a) [0x7f4af479b93a]
  [bt] (1) /raid/tmp/libxgboost4j8219882498856405938.so( 0x599a43) [0x7f4af479ba43]
  [bt] (2) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::MetaInfo::Validate(int) const 0x2b2) [0x7f4af47b0802]
  [bt] (3) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerConfiguration::InitBaseScore(xgboost::DMatrix const*) 0xdd) [0x7f4af48fe28d]
  [bt] (4) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>) 0x74) [0x7f4af4902e84]
  [bt] (5) /raid/tmp/libxgboost4j8219882498856405938.so(XGBoosterUpdateOneIter 0x70) [0x7f4af45b4e20]
  [bt] (6) [0x7f5ad83a03e7]


        at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
        at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:218)
        at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:219)
        at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:306)
        at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:68)
        at scala.Option.getOrElse(Option.scala:189)
        at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:65)
        at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:108)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:344)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:418)
        at scala.Option.map(Option.scala:230)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:417)
        at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
        at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

......

real    1m10.856s
user    0m49.489s
sys     0m4.209s

-------- executor -----

cat 0/stderr 
Spark Executor Command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/test/rapids-4-spark.jar:/spark-3.1.2-bin-hadoop3.2/conf/:/spark-3.1.2-bin-hadoop3.2/jars/*" "-Xmx32768M" "-Dspark.driver.port=41657" "-Dspark.network.timeout=600s" "-Djava.io.tmpdir=/raid/tmp" "-Dai.rapids.cudf.prefer-pinned=true" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:41657" "--executor-id" "0" "--hostname" "127.0.0.1" "--cores" "8" "--app-id" "app-20230722093648-0000" "--worker-url" "spark://[email protected]:45319" "--resourcesFile" "/spark-3.1.2-bin-hadoop3.2/work/app-20230722093648-0000/0/resource-executor-7056249029782452200.json"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/07/22 09:36:50 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 1100@4902560
23/07/22 09:36:50 INFO SignalUtils: Registering signal handler for TERM
23/07/22 09:36:50 INFO SignalUtils: Registering signal handler for HUP
23/07/22 09:36:50 INFO SignalUtils: Registering signal handler for INT

......

23/07/22 09:37:39 INFO TransportClientFactory: Successfully created connection to /127.0.0.1:42277 after 2 ms (0 ms spent in bootstraps)
[09:37:49] task 2 got new rank 0
[09:37:50] WARNING: /workspace/src/common/error_msg.cc:25: The tree method `gpu_hist` is deprecated since 2.0.0. To use GPU training, set the `device` parameter to CUDA instead.

    E.g. tree_method = "hist", device = "cuda"

23/07/22 09:37:54 ERROR XGBoostSpark: XGBooster worker 2 has failed 0 times due to 
ml.dmlc.xgboost4j.java.XGBoostError: [09:37:54] /workspace/src/data/data.cc:730: Invalid device ordinal. Data is associated with a different device ordinal than the booster. The device ordinal of the data is: 2; the device ordinal of the Booster is: 0
Stack trace:
  [bt] (0) /raid/tmp/libxgboost4j8219882498856405938.so( 0x59993a) [0x7f4af479b93a]
  [bt] (1) /raid/tmp/libxgboost4j8219882498856405938.so( 0x599a43) [0x7f4af479ba43]
  [bt] (2) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::MetaInfo::Validate(int) const 0x2b2) [0x7f4af47b0802]
  [bt] (3) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerConfiguration::InitBaseScore(xgboost::DMatrix const*) 0xdd) [0x7f4af48fe28d]
  [bt] (4) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>) 0x74) [0x7f4af4902e84]
  [bt] (5) /raid/tmp/libxgboost4j8219882498856405938.so(XGBoosterUpdateOneIter 0x70) [0x7f4af45b4e20]
  [bt] (6) [0x7f5ad83a03e7]


        at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
        at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:218)
        at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:219)
        at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:306)
        at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:68)
        at scala.Option.getOrElse(Option.scala:189)
        at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:65)
        at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:108)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:344)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:418)
        at scala.Option.map(Option.scala:230)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:417)
        at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
        at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
23/07/22 09:37:54 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 5)
ml.dmlc.xgboost4j.java.XGBoostError: [09:37:54] /workspace/src/data/data.cc:730: Invalid device ordinal. Data is associated with a different device ordinal than the booster. The device ordinal of the data is: 2; the device ordinal of the Booster is: 0
Stack trace:
  [bt] (0) /raid/tmp/libxgboost4j8219882498856405938.so( 0x59993a) [0x7f4af479b93a]
  [bt] (1) /raid/tmp/libxgboost4j8219882498856405938.so( 0x599a43) [0x7f4af479ba43]
  [bt] (2) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::MetaInfo::Validate(int) const 0x2b2) [0x7f4af47b0802]
  [bt] (3) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerConfiguration::InitBaseScore(xgboost::DMatrix const*) 0xdd) [0x7f4af48fe28d]
  [bt] (4) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>) 0x74) [0x7f4af4902e84]
  [bt] (5) /raid/tmp/libxgboost4j8219882498856405938.so(XGBoosterUpdateOneIter 0x70) [0x7f4af45b4e20]
  [bt] (6) [0x7f5ad83a03e7]


        at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
        at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:218)
        at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:219)
        at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:306)
        at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:68)
        at scala.Option.getOrElse(Option.scala:189)
        at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:65)
        at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:108)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:344)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:418)
        at scala.Option.map(Option.scala:230)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:417)
        at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
        at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
23/07/22 09:37:54 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
23/07/22 09:37:54 INFO RapidsBufferCatalog: Closing storage
23/07/22 09:37:54 ERROR CoarseGrainedExecutorBackend: RECEIVroot@4902560:/spark-3.1.2-bin-hadoop3.2/work
@NvTimLiu
Copy link
Author

cc @wbo4958

@wbo4958
Copy link
Contributor

wbo4958 commented Jul 24, 2023

sure, checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants