You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2.0.0-SNAPSHOT dmlc/XGBoost train FAILED on "Invalid data ordinal" since 20230717on NGC cluster with multi GPU hosts.
Invalid device ordinal. Data is associated with a different device ordinal than the booster. The device ordinal of the data is: 2; the device ordinal of the Booster is: 0
cat /raid/tmp/driver-agaricus-Main-GPU.log
23/07/22 09:36:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/07/22 09:36:46 INFO SparkContext: Running Spark version 3.1.2
23/07/22 09:36:46 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/07/22 09:36:46 INFO ResourceUtils: ==============================================================
23/07/22 09:36:46 INFO ResourceUtils: No custom resources configured for spark.driver.
23/07/22 09:36:46 INFO ResourceUtils: ==============================================================
23/07/22 09:36:46 INFO SparkContext: Submitted application: Agaricus-Mai-csv
23/07/22 09:36:46 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 8, script: , vendor: , memory -> name: memory, amount: 32768, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: , gpu -> name: gpu, amount: 1, script: , vendor: nvidia.com), task resources: Map(cpus -> name: cpus, amount: 1.0, gpu -> name: gpu, amount: 1.0)
....
23/07/22 09:37:54 INFO TaskSchedulerImpl: Skip current round of resource offers for barrier stage 3 because the barrier taskSet requires 4 slots, while the total number of available slots is 1.
23/07/22 09:37:54 WARN TaskSetManager: Lost task 2.0 in stage 3.0 (TID 5) (127.0.0.1 executor 0): ml.dmlc.xgboost4j.java.XGBoostError: [09:37:54] /workspace/src/data/data.cc:730: Invalid device ordinal. Data is associated with a different device ordinal than the booster. The device ordinal of the data is: 2; the device ordinal of the Booster is: 0
Stack trace:
[bt] (0) /raid/tmp/libxgboost4j8219882498856405938.so( 0x59993a) [0x7f4af479b93a]
[bt] (1) /raid/tmp/libxgboost4j8219882498856405938.so( 0x599a43) [0x7f4af479ba43]
[bt] (2) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::MetaInfo::Validate(int) const 0x2b2) [0x7f4af47b0802]
[bt] (3) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerConfiguration::InitBaseScore(xgboost::DMatrix const*) 0xdd) [0x7f4af48fe28d]
[bt] (4) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>) 0x74) [0x7f4af4902e84]
[bt] (5) /raid/tmp/libxgboost4j8219882498856405938.so(XGBoosterUpdateOneIter 0x70) [0x7f4af45b4e20]
[bt] (6) [0x7f5ad83a03e7]
at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:218)
at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:219)
at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:306)
at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:68)
at scala.Option.getOrElse(Option.scala:189)
at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:65)
at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:108)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:344)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:418)
at scala.Option.map(Option.scala:230)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:417)
at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
......
real 1m10.856s
user 0m49.489s
sys 0m4.209s
-------- executor -----
cat 0/stderr
Spark Executor Command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" "/test/rapids-4-spark.jar:/spark-3.1.2-bin-hadoop3.2/conf/:/spark-3.1.2-bin-hadoop3.2/jars/*" "-Xmx32768M" "-Dspark.driver.port=41657" "-Dspark.network.timeout=600s" "-Djava.io.tmpdir=/raid/tmp" "-Dai.rapids.cudf.prefer-pinned=true" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:41657" "--executor-id" "0" "--hostname" "127.0.0.1" "--cores" "8" "--app-id" "app-20230722093648-0000" "--worker-url" "spark://[email protected]:45319" "--resourcesFile" "/spark-3.1.2-bin-hadoop3.2/work/app-20230722093648-0000/0/resource-executor-7056249029782452200.json"
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/07/22 09:36:50 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 1100@4902560
23/07/22 09:36:50 INFO SignalUtils: Registering signal handler for TERM
23/07/22 09:36:50 INFO SignalUtils: Registering signal handler for HUP
23/07/22 09:36:50 INFO SignalUtils: Registering signal handler for INT
......
23/07/22 09:37:39 INFO TransportClientFactory: Successfully created connection to /127.0.0.1:42277 after 2 ms (0 ms spent in bootstraps)
[09:37:49] task 2 got new rank 0
[09:37:50] WARNING: /workspace/src/common/error_msg.cc:25: The tree method `gpu_hist` is deprecated since 2.0.0. To use GPU training, set the `device` parameter to CUDA instead.
E.g. tree_method = "hist", device = "cuda"
23/07/22 09:37:54 ERROR XGBoostSpark: XGBooster worker 2 has failed 0 times due to
ml.dmlc.xgboost4j.java.XGBoostError: [09:37:54] /workspace/src/data/data.cc:730: Invalid device ordinal. Data is associated with a different device ordinal than the booster. The device ordinal of the data is: 2; the device ordinal of the Booster is: 0
Stack trace:
[bt] (0) /raid/tmp/libxgboost4j8219882498856405938.so( 0x59993a) [0x7f4af479b93a]
[bt] (1) /raid/tmp/libxgboost4j8219882498856405938.so( 0x599a43) [0x7f4af479ba43]
[bt] (2) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::MetaInfo::Validate(int) const 0x2b2) [0x7f4af47b0802]
[bt] (3) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerConfiguration::InitBaseScore(xgboost::DMatrix const*) 0xdd) [0x7f4af48fe28d]
[bt] (4) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>) 0x74) [0x7f4af4902e84]
[bt] (5) /raid/tmp/libxgboost4j8219882498856405938.so(XGBoosterUpdateOneIter 0x70) [0x7f4af45b4e20]
[bt] (6) [0x7f5ad83a03e7]
at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:218)
at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:219)
at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:306)
at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:68)
at scala.Option.getOrElse(Option.scala:189)
at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:65)
at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:108)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:344)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:418)
at scala.Option.map(Option.scala:230)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:417)
at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
23/07/22 09:37:54 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 5)
ml.dmlc.xgboost4j.java.XGBoostError: [09:37:54] /workspace/src/data/data.cc:730: Invalid device ordinal. Data is associated with a different device ordinal than the booster. The device ordinal of the data is: 2; the device ordinal of the Booster is: 0
Stack trace:
[bt] (0) /raid/tmp/libxgboost4j8219882498856405938.so( 0x59993a) [0x7f4af479b93a]
[bt] (1) /raid/tmp/libxgboost4j8219882498856405938.so( 0x599a43) [0x7f4af479ba43]
[bt] (2) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::MetaInfo::Validate(int) const 0x2b2) [0x7f4af47b0802]
[bt] (3) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerConfiguration::InitBaseScore(xgboost::DMatrix const*) 0xdd) [0x7f4af48fe28d]
[bt] (4) /raid/tmp/libxgboost4j8219882498856405938.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>) 0x74) [0x7f4af4902e84]
[bt] (5) /raid/tmp/libxgboost4j8219882498856405938.so(XGBoosterUpdateOneIter 0x70) [0x7f4af45b4e20]
[bt] (6) [0x7f5ad83a03e7]
at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:218)
at ml.dmlc.xgboost4j.java.XGBoost.trainAndSaveCheckpoint(XGBoost.java:219)
at ml.dmlc.xgboost4j.java.XGBoost.train(XGBoost.java:306)
at ml.dmlc.xgboost4j.scala.XGBoost$.$anonfun$trainAndSaveCheckpoint$5(XGBoost.scala:68)
at scala.Option.getOrElse(Option.scala:189)
at ml.dmlc.xgboost4j.scala.XGBoost$.trainAndSaveCheckpoint(XGBoost.scala:65)
at ml.dmlc.xgboost4j.scala.XGBoost$.train(XGBoost.scala:108)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:344)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:418)
at scala.Option.map(Option.scala:230)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:417)
at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
23/07/22 09:37:54 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
23/07/22 09:37:54 INFO RapidsBufferCatalog: Closing storage
23/07/22 09:37:54 ERROR CoarseGrainedExecutorBackend: RECEIVroot@4902560:/spark-3.1.2-bin-hadoop3.2/work
The text was updated successfully, but these errors were encountered:
xgb-driver.txt
xgb-executor.txt
2.0.0-SNAPSHOT dmlc/XGBoost train FAILED on "Invalid data ordinal" since 20230717on NGC cluster with multi GPU hosts.
Invalid device ordinal. Data is associated with a different device ordinal than the booster. The device ordinal of the data is: 2; the device ordinal of the Booster is: 0
PASS JAR:
xgboost4j-gpu_2.12-2.0.0-20230716.001501-545.jar
FAILED JAR:
xgboost4j-gpu_2.12-2.0.0-20230717.191954-548.jar
--------driver -------
-------- executor -----
The text was updated successfully, but these errors were encountered: