Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cylc clean's advice on which to find the database file to remove to fix issues could be better #6165

Closed
jarich opened this issue Jun 21, 2024 · 4 comments · Fixed by #6234
Closed
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@jarich
Copy link

jarich commented Jun 21, 2024

Description

Sometimes we end up with corrupted databases. Running a cylc clean on these gives us the error:

[user@host Suite]$ cylc clean workflow_pp
WARNING - This database is either corrupted or not compatible
    with this version of "cylc clean".
    Try using the version of Cylc the workflow was last ran with to
    remove it.
    Otherwise please delete the database file.
CylcError: Clean failed:
Workflow: workflow_pp
Error: Cannot clean workflow_pp - no such table: task_jobs

Deleting $HOME/cylc-run/workflow_pp/runN/log/db doesn't fix the error.

Delving into the code shows that it specifically references $HOME/cylc-run/workflow_pp/runN/.service/db. Deleting this db resolves the issue and the workflow is cleaned up.

It appears there are two (usually identical?) copies of the database. These definitely aren't the same file:

[user@host ~]$  if [ "$(stat -L -c %d:%i $HOME/cylc-run/workflow_pp/runN/log/db)" = "$(stat -L -c %d:%i  $HOME/cylc-run/workflow_pp/runN/.service/db)" ]; then
>   echo "FILE1 and FILE2 refer to a single file, with one inode, on one device."
> else
>   echo "no match"
> fi
no match

Deliberately corrupting $HOME/cylc-run/workflow_pp/runN/.service/db suggests that I can "recover" the database by copying $HOME/cylc-run/workflow_pp/runN/log/db over the top of $HOME/cylc-run/workflow_pp/runN/.service/db.

[user@host ~]$ > cylc-run/workflow_pp/runN/.service/db   # truncate to empty file
[user@host ~]$ cylc clean workflow_pp
WARNING - This database is either corrupted or not compatible
    with this version of "cylc clean".
    Try using the version of Cylc the workflow was last ran with to
    remove it.
    Otherwise please delete the database file.
CylcError: Clean failed:
Workflow: workflow_pp
Error: Cannot clean workflow_pp - no such table: task_jobs
[user@host ~]$ cp cylc-run/workflow_pp/run1/log/db cylc-run/workflow_pp/run1/.service/
[user@host ~]$ cylc clean workflow_pp
Would clean the following workflows:
  workflow_pp/run1
Remove these workflows (y/n): y
INFO - Cleaning workflow_pp/run1 on install target: user:c:kit
INFO - [user:c:kit]
    INFO - Removing symlink and its target directory: ...
    INFO - Removing symlink and its target directory: ...
    INFO - Removing directory: ...
    INFO - Removing directory: ...
    INFO - Removing directory: ...
    INFO - Removing directory: ...
INFO - Removing directory: ...
INFO - Removing directory: ...
INFO - Removing directory: ...

Reproducible Example

  1. Truncate .service/db: > $HOME/cylc-run/workflow_pp/runN/.service/db
  2. cylc clean workflow_pp # Note it fails
  3. Remove "obvious" database: rm $HOME/cylc-run/workflow_pp/runN/log/db
  4. cylc clean workflow_pp # Note it continues to fail

Expected Behaviour

Error message would advise that $HOME/cylc-run/workflow_pp/runN/.service/db, specifically, needs to be deleted. For example:

WARNING - This database is either corrupted or not compatible
    with this version of "cylc clean".
    Try using the version of Cylc the workflow was last ran with to
    remove it.
    Otherwise please delete the database file: workflow_pp/runN/.service/db
CylcError: Clean failed:
Workflow: workflow_pp
Error: Cannot clean workflow_pp - no such table: task_jobs
@jarich jarich added the bug Something is wrong :( label Jun 21, 2024
@oliver-sanders
Copy link
Member

Sometimes we end up with corrupted databases

Bigger question, why are you getting corrupted databases! This should not happen and hints at a deeper problem.

Filesystem locks and sqlite implementation should mean that this is not possible. This page outlines the circumstances under which corruption can happen: https://sqlite.org/howtocorrupt.html

It appears there are two (usually identical?) copies of the database

Yes, Cylc maintains two databases, a "private" database in the .service directory for use by the Cylc scheduler and a "public" database in the log directory which may be used by downstream services such as cylc review, rose_prune, fcm_make, etc.

Due to the nature of sqlite, parallel access can potentially result in DB locking issues. The Cylc scheduler will detect a locked public database and recover it from the private database. So the public database serves to isolate the scheduler's database from external interference.

@oliver-sanders
Copy link
Member

oliver-sanders commented Jun 21, 2024

The cylc clean command uses the workflow's database to determine which remote filesystems Cylc has installed the workflow onto so that it can locate and remove these installations. Sadly, if the database has become corrupted, the cylc clean command cannot perform this function. The only thing you can do is get Cylc to remove the local files cylc clean --local and remove the remote files manually. Removing this database has a similar effect (make sure you clean those remote files!).

@oliver-sanders oliver-sanders self-assigned this Jul 15, 2024
@oliver-sanders oliver-sanders added this to the 8.3.3 milestone Jul 15, 2024
@wxtim wxtim modified the milestones: 8.3.3, 8.3.4 Jul 23, 2024
@MetRonnie
Copy link
Member

Closed by #6234?

@oliver-sanders
Copy link
Member

@jarich, we've changed the error message slightly as requested.

However, this isn't something that should be possible, corrupted databases are a cause for concern. I'll close this issue now, but feel free to follow up on the database issue.

@MetRonnie MetRonnie linked a pull request Aug 12, 2024 that will close this issue
8 tasks
@MetRonnie MetRonnie modified the milestones: 8.3.4, 8.3.3 Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants