Page MenuHomePhabricator

Moving a Flow discussion page can render it inaccessible
Open, HighPublic

Description

In T371582: [Flow] The Forum des nouveaux is broken, fr.wikipedia encountered an issue with the Forum des nouveaux page. They archived it to https://fr.wikipedia.org/wiki/Wikipédia:Forum_des_nouveaux/Archives_Flow, which rendered the page inaccessible with the following error:

The Structured Discussions workflow is not associated with this page.
[d9ceb316-1237-49b9-8fe3-91c477eb9976] 2024-08-01 08:58:23: Fatal exception of type "Flow\Exception\InvalidDataException"

This happened due to a mismatch of the workflow title (which was the old one, Forum des nouveaux) and the title reported by MediaWiki Core (Forum des nouveaux/Archives Flow). Such mismatches can be fixed by running FlowFixInconsistentBoards.php, but doing so requires a sysadmin, and definitely should not be a regular part of moving a Flow discussion page.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This task has a high potential of impacting T332022: [Epic] Undeploying StructuredDiscussions (Flow), as it might happen with archiving other Flow-based discussion pages, slowing down the archiving process.

@KStoller-WMF @ppelberg @Trizek-WMF, leaving it up to you to prioritise as needed.

KStoller-WMF moved this task from Inbox to Up Next on the Growth-Team board.

This task has a high potential of impacting T332022: [Epic] Undeploying StructuredDiscussions (Flow), as it might happen with archiving other Flow-based discussion pages, slowing down the archiving process.

@KStoller-WMF @ppelberg @Trizek-WMF, leaving it up to you to prioritise as needed.

Great spot – thank you for identifying this, @Urbanecm_WMF.

It sounds like to avoid the issue captured in this ticket, the Editing Team needs to ensure:

  1. Any manual instructions we offer people manually migrating active Flow pages to sub-pages/archives mention ____
  2. The migration script we will be writing, and are investigating in T371738, will need to ____

Martin: can you please fill in the two blanks (____) in "1)" and "2)" above? Of course, if there is anything about the two points above that you think should be modified, I'd value knowing as much.

Zooming out a bit, per what @KStoller-WMF and I talked about offline a short-time ago, if issues like this emerge again in the time between now and when we implement what "1)" and "2)" in T371769#10043631 describe, the Editing and Growth Teams will share the responsibility for fixing them with the guiding principle being, whoever finds it first fixes it.

It sounds like to avoid the issue captured in this ticket, the Editing Team needs to ensure:

  1. Any manual instructions we offer people manually migrating active Flow pages to sub-pages/archives mention ____
  2. The migration script we will be writing, and are investigating in T371738, will need to ____

Martin: can you please fill in the two blanks (____) in "1)" and "2)" above? Of course, if there is anything about the two points above that you think should be modified, I'd value knowing as much.

@Urbanecm_WMF, gently nudging this to ask: could you please help us fill in the blank within item "2." above?

Thanks for the reminder, @ppelberg! Here are my answers:

Ad 1) Verify the archived page can still be accessed (by visiting it manually). That way, the archiver verifies the purpose of archiving was fulfilled. Since this bug pretty much takes an effect immediately, it should be enough to visit the page right after the archiving and see if it is still readable.

Ad 2) Ideally, the script would do the same thing as users would in (1) – visit the page and ensure there is no error. Unfortunately, that looks to be fairly tricky to do – especially since the error was not logged in Logstash (T371586), which means we somehow rely on user reports (at least momentarily). However, it is fairly easy to check for the symptoms of this task: both MediaWiki and Flow keeps track of the page's name separately. Under normal circumstances, both MediaWiki and Flow have the same page name in their records. When those names get out of sync, this bug occurs. As such, the script can verify Flow and MediaWiki report the same page title. If that is the case, then this bug shouldn't occur. However, with Flow having as many bugs as it does, a bug with a different cause might surprise us fairly easily, so this name check might not prevent everything.

Hope this clarification makes sense. Let me know if not!

The clarification you shared in T371769#10135854 very much helps...thank you, @Urbanecm_WMF.

I (or someone else from the Editing Team) will be in touch if/when new questions emerge.

@Urbanecm_WMF do you know what process was followed to move the page?

@Urbanecm_WMF do you know what process was followed to move the page?

No, I only know what happened after the move (which I believe should be described in the description). Based on the on-wiki logs (link), I presume they used Special:MovePage to trigger the move, but I do not know that for certain. @Trizek-WMF might recall more details, as they reported the breakage that led to this (T371582). If we want to confirm with certainty, it might be a good idea to contact the admin who did the move (if this is needed, I believe @Trizek-WMF would be better positioned to do that than I would). Hope this helps!

Special:MovePage being the only public way of doing this, I believe it was used. I asked for confirmation.

Previous cases of pages being inaccessible where after moves I performed using Special:MovePage.

Following some quick investigation: Flow is certainly trying to handle this when Special:MovePage is used. It hooks onTitleMoveStarting to prepare moving the page to the new name, and onPageMoveCompleting to actually commit that change. This is trying to do the same thing as the FlowFixInconsistentBoards.php script is, and that clearly works. As such, I suspect it's something to do with the state of the page during the move.

A quick check of the file's history leaves me somewhat suspicious of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/ /607398 which switched the onTitleMoveCompleting hook to onPageMoveCompleting and added $newTitle = Title::newFromLinkTarget( $newTitle ); leaving us with:

	public function onPageMoveCompleting(
		$oldTitle,
		$newTitle,
		$user,
		$pageid,
		$redirid,
		$reason,
		$revisionRecord
	) {
		$newTitle = Title::newFromLinkTarget( $newTitle );
		if ( $newTitle->getContentModel() === CONTENT_MODEL_FLOW_BOARD ) {
			Container::get( 'board_mover' )->commit();
		}
	}

The conversion is needed because a LinkTarget doesn't know about content models. However, I'm fairly sure that this switch means that when that $newTitle->getContentModel() happens it'll try to fetch the contentmodel from the DB... while we're in the middle of doing the move, and explicitly before the call to endAtomic in MovePage that'll actually commit the changes. (Alternately, it would be available, but it doesn't know to fetch from the primary DB in this situation.) I'm insufficiently familiar with mw-core's atomic database commits to know whether this would be causing other problems via it not ending/canceling that section -- I presume either it's not, or this isn't really the issue.

I assume that the reason for checking the content model in that hook is just to make sure there's an in-progress move of a page happening before commit is called. If I'm right about this, we could probably just switch it out for a check on the BoardMover to see whether it has an open transaction.

Having gotten Flow running locally and stepped through this more closely, that LinkTarget conversion looks like a red herring -- it's passing in a Title, and newFromLinkTarget has explicit code to notice that and give back the Title rather than trying to refetch anything.

Also, locally I can't reproduce the error -- which admittedly could be because my local setup is insufficiently like the production setup with db replication.

I'd speculate that this might be a cache invalidation issue. Martin unfortunately only checked the database and tried purging the cache in T371582#10034787 after he’d already run the fixer maintenance script, but a plausible explanation of what’s going on is that the database is being properly updated when the page move happens and then the cache is occasionally either not being properly cleared or is being immediately filled with stale data.

The move process does do things that look like they should be putting the new data into the cache, so I’d suspect the latter. If so the mechanism could perhaps be a request hitting the old page at just the right moment such that the page move hasn't finished replicating out from the primary, so the cache is refilled with the data from the secondary database. I don't have a great grasp of Flow's caching layer, so I don't know how plausible this is as an explanation.

A possible approach if this theory is correct would just be to catch the error in Flow's ViewAction and purge that workflow from the cache when we see it. The drawback of this would be that if it's some other issue, we'd just be increasing database load slightly when that broken board is visited.

From a meeting: we're not going to do anything to proactively fix this right now. Instead we're going to just run the fixup maintenance script and purge the cache after our migration script, since we know that fixed the previous broken board. Then if we discover a broken board after a later manual move, we might dive in to investigate it and see whether we can learn more.

From a meeting: we're not going to do anything to proactively fix this right now. Instead we're going to just run the fixup maintenance script and purge the cache after our migration script, since we know that fixed the previous broken board.

Note for our future-selves: "run the fixup maintenance script" is documented as requirement "3." in T371738:

Once script executes steps "1." and "2.", run the fix inconsistent boards maintenance script and purge the workflow cache

I'd speculate that this might be a cache invalidation issue. Martin unfortunately only checked the database and tried purging the cache in T371582#10034787 after he’d already run the fixer maintenance script, but a plausible explanation of what’s going on is that the database is being properly updated when the page move happens and then the cache is occasionally either not being properly cleared or is being immediately filled with stale data.

Let's check it before then! Growth gets tickets about broken boards from time to time. Today, another one (T377360) came, so I checked this theory and posted results as T377360#10239075. Unfortunately, it seems this theory doesn't appear to be correct – in this case, the DB appears to be not updated.

It's worth pointing out that the code that threw an exception here was specific to the Beta feature disabling, whereas the previous case was a manual move.

Yeah, this seems to be the exact opposite of the previous issue: it's a failed move that did update the workflow page title, as opposed to a successful move that didn't update the workflow page title.

Next steps
If/when this issue recurs, we're going to revisit this task with clearer reproductions steps in-hand.

In the time between now and then, we're going to move this to the Editing Team backlog.