-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
22.05.11: Regression: Node (offline) state not persisted across server restart #2579
Comments
Looking further at the code and differences in behaviour between 20.0.1 and 22.05.11 it appears that previously the state was saved to the DB twice: Once as integer in the In 22.05.11 the node state attribute is no longer saved to the database at all. Therefore
diff --git a/src/server/pbsd_init.c b/src/server/pbsd_init.c
index 63b18fb5..e8d90c50 100644
--- a/src/server/pbsd_init.c
b/src/server/pbsd_init.c
@@ -1557,6 1557,8 @@ pbsd_init_node(pbs_db_node_info_t *dbnode, int type)
add_mom_to_pool(np->nd_moms[0]);
}
}
np->nd_state = dbnode->nd_state;
} else {
if (rc == PBSE_NODEEXIST)
sprintf(log_buffer, "duplicate node \"%s\"", dbnode->nd_name); Would this be a valid workaround/fix? |
Hello mw-a-san,
|
Hi @CRAY-Hoshina, Thanks for testing! I fully expected there to be corner cases like that. That's likely why the old state was previously applied incrementally trough what looks like a state machine to take all these into account. Since I only care about the offline state, it could also be as simple as only copying the single flag like so (untested): diff --git a/src/server/pbsd_init.c b/src/server/pbsd_init.c
index 63b18fb5..e8d90c50 100644
--- a/src/server/pbsd_init.c
b/src/server/pbsd_init.c
@@ -1557,6 1557,8 @@ pbsd_init_node(pbs_db_node_info_t *dbnode, int type)
add_mom_to_pool(np->nd_moms[0]);
}
}
np->nd_state |= dbnode->nd_state & INUSE_OFFLINE;
} else {
if (rc == PBSE_NODEEXIST)
sprintf(log_buffer, "duplicate node \"%s\"", dbnode->nd_name); Would love to get feedback from PBS devs on the issue. |
Hello mw-a-san, Thanks for simplifying the code. I will use this code at my own risk until it is officially adopted. Thank you again, |
Hello, |
Hi @vchlum, thanks for getting back on this! Can you say which upcoming release this will be in and how far off it is? |
Sorry @mw-a, I am not from Altair. I am just a contributor. I have no idea about upcoming official releases. We do our own releases based on the public code for our needs. |
Hello,
it seems with version 22.05.11 node offline state is no longer persisted across server restart. Reproducer:
Expected behaviour:
Node is offline and jobs remain queued queued.
Actual behaviour:
Node is open after restart and jobs execute.
From a look at the source code I think this regression may have been introduced in 4bbecab where
node_recov_db_raw()
was replaced withnode_recov_db()
but while the call tonode_recov_db_raw()
was removed fromsetup_nodes()
innode_func.c
, no call tonode_recov_db()
was added anywhere.The text was updated successfully, but these errors were encountered: