Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

22.05.11: Regression: Node (offline) state not persisted across server restart #2579

Open
mw-a opened this issue Mar 31, 2023 · 8 comments
Open

Comments

@mw-a
Copy link

mw-a commented Mar 31, 2023

Hello,

it seems with version 22.05.11 node offline state is no longer persisted across server restart. Reproducer:

Expected behaviour:

[root@rockypbsmaster spool]# pbsnodes --version
pbs_version = 20.0.1
[root@rockypbsmaster spool]# pbsnodes -o rockypbsnode 
[root@rockypbsmaster spool]# pbsnodes -l
rockypbsnode         offline 
root@rockypbsmaster spool]# systemctl restart pbs 
[root@rockypbsmaster spool]# pbsnodes -l
rockypbsnode         offline 

Node is offline and jobs remain queued queued.

Actual behaviour:

[root@rockypbsmaster openpbs_22.05.11.rockylinux_8.5]# pbsnodes --version
pbs_version = 22.05.11
[root@rockypbsmaster openpbs_22.05.11.rockylinux_8.5]# pbsnodes -o rockypbsnode
[root@rockypbsmaster openpbs_22.05.11.rockylinux_8.5]# pbsnodes -l
rockypbsnode         offline 
[root@rockypbsmaster openpbs_22.05.11.rockylinux_8.5]# systemctl restart pbs 
[root@rockypbsmaster openpbs_22.05.11.rockylinux_8.5]# pbsnodes -l
[root@rockypbsmaster openpbs_22.05.11.rockylinux_8.5]#          

Node is open after restart and jobs execute.

From a look at the source code I think this regression may have been introduced in 4bbecab where node_recov_db_raw() was replaced with node_recov_db() but while the call to node_recov_db_raw() was removed from setup_nodes() in node_func.c, no call to node_recov_db() was added anywhere.

@mw-a
Copy link
Author

mw-a commented Apr 5, 2023

Looking further at the code and differences in behaviour between 20.0.1 and 22.05.11 it appears that previously the state was saved to the DB twice: Once as integer in the nd_state column of the pbs.node table and again as attribute in the attributes array column. It seems node state was not recovered from the nd_state column but incrementally by applying the state attribute through setup_nodes()/create_pbs_node2()/mgr_set_attr2().

In 22.05.11 the node state attribute is no longer saved to the database at all. Therefore setup_nodes() cannot restore the state that way. Is this intentional?

nd_state is also not recovered from the DB. The unused node_recov_db()/db_to_node() code path would do it but it isn't called anywhere. The actually active setup_nodes()/pbs_db_search()/recov_node_cb()/pbsd_init_node() codepath does not seem to do it. Adding it in the latter like so seems to make node state survive server restart:

diff --git a/src/server/pbsd_init.c b/src/server/pbsd_init.c
index 63b18fb5..e8d90c50 100644
--- a/src/server/pbsd_init.c
    b/src/server/pbsd_init.c
@@ -1557,6  1557,8 @@ pbsd_init_node(pbs_db_node_info_t *dbnode, int type)
                                add_mom_to_pool(np->nd_moms[0]);
                        }
                }
 
                np->nd_state = dbnode->nd_state;
        } else {
                if (rc == PBSE_NODEEXIST)
                        sprintf(log_buffer, "duplicate node \"%s\"", dbnode->nd_name);

Would this be a valid workaround/fix?

@CRAY-Hoshina
Copy link

CRAY-Hoshina commented May 19, 2023

Hello mw-a-san,
I tried your patch. So, I think it is better to copy only OFFLINE, due to even nodes that have not started MOM will inherit the past status and become FREE.

diff -u a/src/server/pbsd_init.c b/src/server/pbsd_init.c
--- a/src/server/pbsd_init.c    2023-04-25 11:35:24.837031589  0900
    b/src/server/pbsd_init.c    2023-05-19 13:21:29.025573419  0900
@@ -1558,6  1558,9 @@
                                add_mom_to_pool(np->nd_moms[0]);
                        }
                }
                if (dbnode->nd_state & INUSE_OFFLINE) {
                        np->nd_state = dbnode->nd_state;
                }
        } else {
                if (rc == PBSE_NODEEXIST)
                        sprintf(log_buffer, "duplicate node \"%s\"", dbnode->nd_name);

@mw-a
Copy link
Author

mw-a commented May 22, 2023

Hi @CRAY-Hoshina,

Thanks for testing! I fully expected there to be corner cases like that. That's likely why the old state was previously applied incrementally trough what looks like a state machine to take all these into account. Since I only care about the offline state, it could also be as simple as only copying the single flag like so (untested):

diff --git a/src/server/pbsd_init.c b/src/server/pbsd_init.c
index 63b18fb5..e8d90c50 100644
--- a/src/server/pbsd_init.c
    b/src/server/pbsd_init.c
@@ -1557,6  1557,8 @@ pbsd_init_node(pbs_db_node_info_t *dbnode, int type)
                                add_mom_to_pool(np->nd_moms[0]);
                        }
                }
 
                np->nd_state |= dbnode->nd_state & INUSE_OFFLINE;
        } else {
                if (rc == PBSE_NODEEXIST)
                        sprintf(log_buffer, "duplicate node \"%s\"", dbnode->nd_name);

Would love to get feedback from PBS devs on the issue.

@CRAY-Hoshina
Copy link

Hello mw-a-san,

Thanks for simplifying the code. I will use this code at my own risk until it is officially adopted.

Thank you again,

@fnevgeny
Copy link
Contributor

fnevgeny commented Sep 2, 2024

Hello,
I can confirm the bug is still present in 23.06.06. Could somebody from the developers please comment on this?

@vchlum
Copy link
Contributor

vchlum commented Sep 2, 2024

Hi! This problem has been resolved by 49b6967. If you apply this patch, you should also apply 9f77977.

@mw-a
Copy link
Author

mw-a commented Sep 2, 2024

Hi @vchlum, thanks for getting back on this! Can you say which upcoming release this will be in and how far off it is?

@vchlum
Copy link
Contributor

vchlum commented Sep 2, 2024

Sorry @mw-a, I am not from Altair. I am just a contributor. I have no idea about upcoming official releases. We do our own releases based on the public code for our needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@fnevgeny @vchlum @mw-a @CRAY-Hoshina and others