Moving a Flow discussion page can render it inaccessible
Open, HighPublic
Actions

Assigned To

Authored By

	Urbanecm_WMF
	Aug 3 2024, 10:51 PM

Description

In T371582: [Flow] The Forum des nouveaux is broken, fr.wikipedia encountered an issue with the Forum des nouveaux page. They archived it to https://fr.wikipedia.org/wiki/Wikipédia:Forum_des_nouveaux/Archives_Flow, which rendered the page inaccessible with the following error:

The Structured Discussions workflow is not associated with this page.
[d9ceb316-1237-49b9-8fe3-91c477eb9976] 2024-08-01 08:58:23: Fatal exception of type "Flow\Exception\InvalidDataException"

This happened due to a mismatch of the workflow title (which was the old one, Forum des nouveaux) and the title reported by MediaWiki Core (Forum des nouveaux/Archives Flow). Such mismatches can be fixed by running FlowFixInconsistentBoards.php, but doing so requires a sysadmin, and definitely should not be a regular part of moving a Flow discussion page.

Related Objects
Search...

Status	Assigned	Task
Open	None	T335670 Disable Flow/Structured Discussions from all namespaces, except for user talk pages, on Catalan Wikipedia
Open	None	T106123 Extensions needing to be removed from Wikimedia wikis
Open	None	T332022 [Epic] Undeploying StructuredDiscussions (Flow)
Open	None	T377050 Phase #1: Archive all active Flow boards
Resolved	DLynch	T371738 Write a script to automatically migrate Flow boards to sub-pages
Open	DLynch	T371769 Moving a Flow discussion page can render it inaccessible

Event Timeline

Urbanecm_WMF created this task.Aug 3 2024, 10:51 PM

Restricted Application added a project: Growth-Team. · View Herald TranscriptAug 3 2024, 10:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This task has a high potential of impacting T332022: [Epic] Undeploying StructuredDiscussions (Flow), as it might happen with archiving other Flow-based discussion pages, slowing down the archiving process.

@KStoller-WMF @ppelberg @Trizek-WMF, leaving it up to you to prioritise as needed.

Urbanecm_WMF mentioned this in T371582: [Flow] The Forum des nouveaux is broken.Aug 3 2024, 11:00 PM

KStoller-WMF triaged this task as High priority.Aug 5 2024, 3:36 AM

KStoller-WMF moved this task from Inbox to Up Next on the Growth-Team board.

ppelberg added a parent task: T371738: Write a script to automatically migrate Flow boards to sub-pages.Aug 5 2024, 7:46 PM

ppelberg added a project: Editing-team.

In T371769#10040614, @Urbanecm_WMF wrote:

This task has a high potential of impacting T332022: [Epic] Undeploying StructuredDiscussions (Flow), as it might happen with archiving other Flow-based discussion pages, slowing down the archiving process.

@KStoller-WMF @ppelberg @Trizek-WMF, leaving it up to you to prioritise as needed.

Great spot – thank you for identifying this, @Urbanecm_WMF.

It sounds like to avoid the issue captured in this ticket, the Editing Team needs to ensure:

Any manual instructions we offer people manually migrating active Flow pages to sub-pages/archives mention ____
The migration script we will be writing, and are investigating in T371738, will need to ____

Martin: can you please fill in the two blanks (____) in "1)" and "2)" above? Of course, if there is anything about the two points above that you think should be modified, I'd value knowing as much.

Zooming out a bit, per what @KStoller-WMF and I talked about offline a short-time ago, if issues like this emerge again in the time between now and when we implement what "1)" and "2)" in T371769#10043631 describe, the Editing and Growth Teams will share the responsibility for fixing them with the guiding principle being, whoever finds it first fixes it.

Novem_Linguae subscribed.Aug 6 2024, 3:29 PM

KStoller-WMF moved this task from Up Next to Needs Discussion on the Growth-Team board.Aug 6 2024, 4:05 PM

In T371769#10043631, @ppelberg wrote:

It sounds like to avoid the issue captured in this ticket, the Editing Team needs to ensure:

Any manual instructions we offer people manually migrating active Flow pages to sub-pages/archives mention ____

The migration script we will be writing, and are investigating in T371738, will need to ____

Martin: can you please fill in the two blanks (____) in "1)" and "2)" above? Of course, if there is anything about the two points above that you think should be modified, I'd value knowing as much.

@Urbanecm_WMF, gently nudging this to ask: could you please help us fill in the blank within item "2." above?

Thanks for the reminder, @ppelberg! Here are my answers:

Ad 1) Verify the archived page can still be accessed (by visiting it manually). That way, the archiver verifies the purpose of archiving was fulfilled. Since this bug pretty much takes an effect immediately, it should be enough to visit the page right after the archiving and see if it is still readable.

Ad 2) Ideally, the script would do the same thing as users would in (1) – visit the page and ensure there is no error. Unfortunately, that looks to be fairly tricky to do – especially since the error was not logged in Logstash (T371586), which means we somehow rely on user reports (at least momentarily). However, it is fairly easy to check for the symptoms of this task: both MediaWiki and Flow keeps track of the page's name separately. Under normal circumstances, both MediaWiki and Flow have the same page name in their records. When those names get out of sync, this bug occurs. As such, the script can verify Flow and MediaWiki report the same page title. If that is the case, then this bug shouldn't occur. However, with Flow having as many bugs as it does, a bug with a different cause might surprise us fairly easily, so this name check might not prevent everything.

Hope this clarification makes sense. Let me know if not!

The clarification you shared in T371769#10135854 very much helps...thank you, @Urbanecm_WMF.

I (or someone else from the Editing Team) will be in touch if/when new questions emerge.

ppelberg mentioned this in T371738: Write a script to automatically migrate Flow boards to sub-pages.Sep 13 2024, 5:48 PM

KStoller-WMF moved this task from Needs Discussion to Triaged on the Growth-Team board.Sep 18 2024, 9:07 PM

@Urbanecm_WMF do you know what process was followed to move the page?

ppelberg edited projects, added Editing-team (Kanban Board); removed Editing-team.Sep 25 2024, 4:44 PM

ppelberg moved this task from Incoming to Ready to Be Worked On on the Editing-team (Kanban Board) board.

ppelberg added a parent task: T370722: Set Flow and LQT sunsetting timeline and sequence.Sep 25 2024, 7:15 PM

In T371769#10176460, @DLynch wrote:

@Urbanecm_WMF do you know what process was followed to move the page?

No, I only know what happened after the move (which I believe should be described in the description). Based on the on-wiki logs (link), I presume they used Special:MovePage to trigger the move, but I do not know that for certain. @Trizek-WMF might recall more details, as they reported the breakage that led to this (T371582). If we want to confirm with certainty, it might be a good idea to contact the admin who did the move (if this is needed, I believe @Trizek-WMF would be better positioned to do that than I would). Hope this helps!

Special:MovePage being the only public way of doing this, I believe it was used. I asked for confirmation.

Previous cases of pages being inaccessible where after moves I performed using Special:MovePage.

ppelberg assigned this task to DLynch.Sep 30 2024, 5:26 PM

Following some quick investigation: Flow is certainly trying to handle this when Special:MovePage is used. It hooks onTitleMoveStarting to prepare moving the page to the new name, and onPageMoveCompleting to actually commit that change. This is trying to do the same thing as the FlowFixInconsistentBoards.php script is, and that clearly works. As such, I suspect it's something to do with the state of the page during the move.

A quick check of the file's history leaves me somewhat suspicious of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/ /607398 which switched the onTitleMoveCompleting hook to onPageMoveCompleting and added $newTitle = Title::newFromLinkTarget( $newTitle ); leaving us with:

	public function onPageMoveCompleting(
		$oldTitle,
		$newTitle,
		$user,
		$pageid,
		$redirid,
		$reason,
		$revisionRecord
	) {
		$newTitle = Title::newFromLinkTarget( $newTitle );
		if ( $newTitle->getContentModel() === CONTENT_MODEL_FLOW_BOARD ) {
			Container::get( 'board_mover' )->commit();
		}
	}

The conversion is needed because a LinkTarget doesn't know about content models. However, I'm fairly sure that this switch means that when that $newTitle->getContentModel() happens it'll try to fetch the contentmodel from the DB... while we're in the middle of doing the move, and explicitly before the call to endAtomic in MovePage that'll actually commit the changes. (Alternately, it would be available, but it doesn't know to fetch from the primary DB in this situation.) I'm insufficiently familiar with mw-core's atomic database commits to know whether this would be causing other problems via it not ending/canceling that section -- I presume either it's not, or this isn't really the issue.

I assume that the reason for checking the content model in that hook is just to make sure there's an in-progress move of a page happening before commit is called. If I'm right about this, we could probably just switch it out for a check on the BoardMover to see whether it has an open transaction.

Infopetal subscribed.Oct 7 2024, 5:19 PM

Having gotten Flow running locally and stepped through this more closely, that LinkTarget conversion looks like a red herring -- it's passing in a Title, and newFromLinkTarget has explicit code to notice that and give back the Title rather than trying to refetch anything.

Also, locally I can't reproduce the error -- which admittedly could be because my local setup is insufficiently like the production setup with db replication.

I'd speculate that this might be a cache invalidation issue. Martin unfortunately only checked the database and tried purging the cache in T371582#10034787 after he’d already run the fixer maintenance script, but a plausible explanation of what’s going on is that the database is being properly updated when the page move happens and then the cache is occasionally either not being properly cleared or is being immediately filled with stale data.

The move process does do things that look like they should be putting the new data into the cache, so I’d suspect the latter. If so the mechanism could perhaps be a request hitting the old page at just the right moment such that the page move hasn't finished replicating out from the primary, so the cache is refilled with the data from the secondary database. I don't have a great grasp of Flow's caching layer, so I don't know how plausible this is as an explanation.

A possible approach if this theory is correct would just be to catch the error in Flow's ViewAction and purge that workflow from the cache when we see it. The drawback of this would be that if it's some other issue, we'd just be increasing database load slightly when that broken board is visited.

ppelberg moved this task from Ready to Be Worked On to Doing on the Editing-team (Kanban Board) board.Oct 8 2024, 3:59 AM

From a meeting: we're not going to do anything to proactively fix this right now. Instead we're going to just run the fixup maintenance script and purge the cache after our migration script, since we know that fixed the previous broken board. Then if we discover a broken board after a later manual move, we might dive in to investigate it and see whether we can learn more.

In T371769#10215115, @DLynch wrote:

From a meeting: we're not going to do anything to proactively fix this right now. Instead we're going to just run the fixup maintenance script and purge the cache after our migration script, since we know that fixed the previous broken board.

Note for our future-selves: "run the fixup maintenance script" is documented as requirement "3." in T371738:

Once script executes steps "1." and "2.", run the fix inconsistent boards maintenance script and purge the workflow cache

ppelberg added a parent task: T377050: Phase #1: Archive all active Flow boards.Oct 12 2024, 12:03 AM

ppelberg removed parent tasks: T377050: Phase #1: Archive all active Flow boards, T370722: Set Flow and LQT sunsetting timeline and sequence.

TheDJ mentioned this in T377360: Run fixInconsistentBoards on Wikidata.Oct 17 2024, 1:28 PM

In T371769#10209219, @DLynch wrote:

I'd speculate that this might be a cache invalidation issue. Martin unfortunately only checked the database and tried purging the cache in T371582#10034787 after he’d already run the fixer maintenance script, but a plausible explanation of what’s going on is that the database is being properly updated when the page move happens and then the cache is occasionally either not being properly cleared or is being immediately filled with stale data.

Let's check it before then! Growth gets tickets about broken boards from time to time. Today, another one (T377360) came, so I checked this theory and posted results as T377360#10239075. Unfortunately, it seems this theory doesn't appear to be correct – in this case, the DB appears to be not updated.

In T377360#10239309, @DLynch wrote:

In T377360#10239276, @Pppery wrote:

It's worth pointing out that the code that threw an exception here was specific to the Beta feature disabling, whereas the previous case was a manual move.

Yeah, this seems to be the exact opposite of the previous issue: it's a failed move that did update the workflow page title, as opposed to a successful move that didn't update the workflow page title.