Clean up the EditAttemptStep schema and its implementations
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Krenair
	Nov 7 2015, 2:23 AM

Description

The EditAttemptStep schema (previously the Edit schema) has been around since 2013 and provided data for important projects like the various rollouts of the visual editor.

However, over that time it has also accumulated many bugs, oddities, and unanswered strategic questions. This task tracks the resolution of those issues, and will be complete when (eons hence) the schema's stakeholders agree on its purpose and scope, the schema has been modified to implement that purpose and scope, and when all the necessary implementations conform to that schema.

Use cases

Edit completion rate
Time to loaded and time to interactive
Overall edit duration

Scope

Interfaces currently logging to the schema:

visual editor (phone and desktop)
2010 wikitext (phone and desktop)
2017 wikitext
ContentTranslation (but without a separate value of editor_interface to distinguish it from desktop VE)

Should the app editors start using this schema? What about the Flow editor? What about the Wikidata description tool on Android?

Having a common schema increases the probability that you can get comparable data across all these interfaces (because it forces teams to collaborate), but it doesn't ensure it.
We should only incur the collaboration overhead if the benefits of more comparability are worth it—there's not much point in comparing, say, edit completion rate of Wikidata description editing with general page editing, because their contexts are so very different.

Session identification

The schema defines editingSessionId as "a string of 32 alphanumeric characters, unique to the current page view session; used for grouping events".
- mw.user provides a number of different methods for generating session IDs.
- MobileFrontend uses sessionId().
- The visual editor uses generateRandomSessionId()
- The 2010 wikitext editor uses MWCryptRand::generateHex(32).
Our current implementation of editing sessions is tightly coupled to a page view. However, this doesn't map very well to what we think of as a single edit session: on desktop, switching between the visual editor and the wikitext editor while retaining changes causes a new page view, while on MobileFrontend, aborting an edit using the back button and then reopening the editor (which doesn't preserve your changes) all happens in one page view.
We don't use the core EventLogging code for client-side session token generation and sampling.

Timings

There's no reason we should have a separate timing field for each event type when we can have a single one whose meaning varies by event type (T207803#4790039)
init_timing currently not logged, but the information described in the schema ("timing information about action=init – time in milliseconds since the page was loaded") does not seem useful.

Other issues

The new ability to switch back-and-forth between the visual editor and wikitext invalidates some key assumptions (for example, we probably want to update action.init.mechanism)
How should we account for "micro-editing experiences" like Flow? Should they be included in this schema at all?
Even with T124676 resolved, the table is still quite large. Consider whether to drop mostly unused fields like page.title or normalize the schema (T123958)
Do our action.saveFailure.type values cover all the options?
- For example, T197499 deals with a save failure because the wiki is in read-only mode, which isn't covered.
The switch* values of abort_type are probably unnecessary now because we started logging switches as VisualEditorFeatureUse events (T221191#5290393), and in any case it doesn't seem right to consider switches as aborts because logically they are just one intermediate step in a single edit attempt.
We have started discarding ready and loaded events that occur after a switch (T220697), but it's not clear if we're doing that everywhere
We need a standard way to deal with multi-interface sessions in analysis—which interface, if any, do we attribute them to?
The fact that the 2010 wikitext editor logs saveSuccess and init events on the server side, unlike every other event in the schema, create significant inconsistencies (T214132)
Should we log the user name rather than the user ID? On one hand, the user ID is immutable; on the other hand, the user name is the main global user identifier and easier for humans to use.

Data tidiness

We should have separate this into two separate tables: EditAttempt (containing data that applies to all steps in an attempt, such as platform, user agent, and user name; this won't include editor_interface because that can differ within a single edit attempt because of switching) and EditAttemptStep (not containing that attempt-wide data).
We should probably merge VisualEditorFeatureUse into EditAttemptStep with a featureUse action. The observational unit is the same, and it's much easier to subset data from one table than to union data from two tables.

Related Objects
Search...

Status	Assigned	Task
Open	None	T118063 Clean up the EditAttemptStep schema and its implementations
Resolved	nshahquinn-wmf	T124845 Sample edit events in MobileFrontend at 6.25%
Resolved	Jdforrester-WMF	T125598 Sample edit events in desktop visual editor at 6.25%
Declined	None	T123958 Consider scrapping Schema:PageContentSaveComplete and Schema:NewEditorEdit, given we have Schema:Edit
Resolved	Jdforrester-WMF	T116718 Invalid value None for integer property "action.abort.timing" (schema:Edit)
Resolved	Jdforrester-WMF	T116717 EventLogging validation error: 'user.id' is a required property (schema:Edit)
Resolved	None	T204779 Mobile editors do not log inits on direct URL initiation
Open	None	T205161 2010 wikitext editor does not log timings for most Edit events
Resolved	nshahquinn-wmf	T214931 Consider how the EditAttemptStep schema can apply to ContentTranslation
Resolved	MNeisler	T231024 Some save success events do not contain a new revision ID
Open	None	T227931 Oversampling changes edit completion rate
Declined	None	T234535 Draft a concrete plan for the EditAttemptStep cleanup
Resolved	DLynch	T270636 Mark EditAttemptStep events from VisualEditor reusers like Content Translation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Jdforrester-WMF closed subtask T116718: Invalid value None for integer property "action.abort.timing" (schema:Edit) as Resolved.Aug 9 2016, 7:39 PM

nshahquinn-wmf moved this task from Next up to Backlog on the Contributors-Analysis board.Feb 7 2017, 10:22 PM

We need to have a long hard think about analytics and metrics at some point. There's not much point doing that before there is a data analyst assigned to Contributors.

• Deskana moved this task from TR0: Interrupt to Freezer on the VisualEditor board.Sep 12 2017, 7:42 PM

In T118063#3601883, @Deskana wrote:

We need to have a long hard think about analytics and metrics at some point. There's not much point doing that before there is a data analyst assigned to Contributors.

But yeah, I know exactly what you mean.

In T118063#3602513, @Neil_P._Quinn_WMF wrote:

In T118063#3601883, @Deskana wrote:

We need to have a long hard think about analytics and metrics at some point. There's not much point doing that before there is a data analyst assigned to Contributors.

:(

Hey, if you want to dive into this, I'd love to do it with you! But I assumed you were busy with New Editors. :-)

! In T118063#3601883, @Deskana wrote:
Hey, if you want to dive into this, I'd love to do it with you! But I assumed you were busy with New Editors. :-)

No, your instincts are totally right; NEX is taking up about 75% of my time right now, and the rest is mostly going to high-level stuff like movement metrics (monthly active editors, new user retention, etc.) and the editor survey. It just makes me sad that I have to neglect all this important stuff. Thanks for being understanding :)

• Deskana moved this task from Backlog to Freezer on the Contributors-Analysis board.Oct 31 2017, 3:04 PM

• Phabricator_maintenance added a project: Product-Analytics.Apr 18 2018, 11:18 PM

nshahquinn-wmf moved this task from Triage to Backlog on the Product-Analytics board.Apr 19 2018, 8:21 PM

nshahquinn-wmf renamed this task from Revise the Edit schema and its use to Overhaul and vet the Edit event log.Aug 21 2018, 6:11 PM

nshahquinn-wmf added a project: Epic.

nshahquinn-wmf updated the task description. (Show Details)

nshahquinn-wmf removed the point value for this task.

nshahquinn-wmf updated the task description. (Show Details)Aug 21 2018, 6:20 PM

nshahquinn-wmf renamed this task from Overhaul and vet the Edit event log to Reconsider the schema of the Edit event log.Aug 27 2018, 8:30 PM

nshahquinn-wmf removed a subtask: T202437: Identify and fix data quality problems in the Edit event log.

nshahquinn-wmf updated the task description. (Show Details)

@Deskana I've split apart the two tasks about the Edit logs. T202437 is now about actual data quality problems, where we're not logging the data we expect. This task is about a broader reconsideration of the schema; there are important issues here, but they're less urgent than the data problems in the other task and don't block fixing them.

nshahquinn-wmf updated the task description. (Show Details)Aug 27 2018, 11:19 PM

nshahquinn-wmf updated the task description. (Show Details)Aug 29 2018, 8:25 PM

nshahquinn-wmf updated the task description. (Show Details)

nshahquinn-wmf updated the task description. (Show Details)Aug 30 2018, 7:54 PM

nshahquinn-wmf added subscribers: phuedx, • Tbayer.

• Tbayer updated the task description. (Show Details)Aug 30 2018, 11:05 PM

To add detail re T118063#4546661: @phuedx looked into that not long ago and I understand from him that mw.user.sessionId()may not persist through the entire browser session, but rather only for pageviews and other actions in a particular tab (and perhaps other tabs opened from that tab). I.e. the documentation is at least misleading in that regard ("This ID is ephemeral for everyone, staying in their browser only until they close their browsing session"). A more precise description of the token's behavior would be great.

In T118063#4547178, @Tbayer wrote:

To add detail re T118063#4546661: @phuedx looked into that not long ago and I understand from him that mw.user.sessionId()may not persist through the entire browser session, but rather only for pageviews and other actions in a particular tab (and perhaps other tabs opened from that tab). I.e. the documentation is at least misleading in that regard ("This ID is ephemeral for everyone, staying in their browser only until they close their browsing session"). A more precise description of the token's behavior would be great.

Ah, yeah, I was looking into this as well. The sessionId is stored in sessionStorage, which according to MDN behaves as follows:

Data stored in sessionStorage gets cleared when the page session ends. A page session lasts for as long as the browser is open and survives over page reloads and restores. Opening a page in a new tab or window will cause a new session to be initiated with the value of the top-level browsing context, which differs from how session cookies work.

I tested a bit, and this is what happens when you start on a wiki page and do the following in Firefox 62.0b20:

Follow a link to another wiki page: ID persists
Reload the page: ID persists
Hard reload the page (cmd shift R on Mac): ID persists
Navigate to another wiki page by typing the address in the same tab: ID persists
Navigate to another website, then back to the wiki, in the same tab: ID persists
Quit and reopen the browser with session restoration enabled: ID persists
Open a link to another page on the same wiki in a new tab (cmd click on Mac): ID changes

Also, naturally, the session ID is specific to a single MediaWiki site, so navigating between sites in the same results in an ID change.

I don't know where exactly this documentation should go, but yeah, it would be good to keep this straight.

I can confirm that I see the same behaviour described in T118063#4547350 – it's always good when browsers behave as specified!

In T118063#4547350, @Neil_P._Quinn_WMF wrote:

I don't know where exactly this documentation should go, but yeah, it would be good to keep this straight.

My advice is to always start with the source (https://doc.wikimedia.org/mediawiki-core/master/js/source/mediawiki.user.html#mw-user-method-sessionId) as it's closest to the truth.

I can't find any reference to "sessionid" or "page token" on mediawikiwiki, so I'd also recommend that we create a high-level documentation page there that covers both mw.user.sessionId and the pageview token generated by EventLogging.

JTannerWMF moved this task from Freezer to Needs Discussion/Analysis on the VisualEditor board.Sep 5 2018, 4:46 PM

nshahquinn-wmf mentioned this in T203620: Implementations of the Edit schema generate session IDs differently.Sep 5 2018, 11:33 PM

nshahquinn-wmf updated the task description. (Show Details)Sep 21 2018, 12:47 AM

nshahquinn-wmf added a subtask: T205161: 2010 wikitext editor does not log timings for most Edit events.Oct 2 2018, 1:38 AM

nshahquinn-wmf added a subtask: T205166: Mobile editors do not log timings for any Edit events.

nshahquinn-wmf updated the task description. (Show Details)Oct 2 2018, 1:40 AM

nshahquinn-wmf removed a subtask: T205166: Mobile editors do not log timings for any Edit events.Oct 2 2018, 6:15 PM

nshahquinn-wmf updated the task description. (Show Details)Oct 2 2018, 10:19 PM

phuedx unsubscribed.Oct 3 2018, 8:32 AM

nshahquinn-wmf added subscribers: Catrope, nettrom_WMF, kostajh.Oct 4 2018, 7:25 PM

In T118063#4548060, @phuedx wrote:

I can confirm that I see the same behaviour described in T118063#4547350 – it's always good when browsers behave as specified!

In T118063#4547350, @Neil_P._Quinn_WMF wrote:

I don't know where exactly this documentation should go, but yeah, it would be good to keep this straight.

My advice is to always start with the source (https://doc.wikimedia.org/mediawiki-core/master/js/source/mediawiki.user.html#mw-user-method-sessionId) as it's closest to the truth.

I can't find any reference to "sessionid" or "page token" on mediawikiwiki, so I'd also recommend that we create a high-level documentation page there that covers both mw.user.sessionId and the pageview token generated by EventLogging.

Sounds like a good idea! In the meantime, I have submitted a patch to at least add a caveat to the existing documentation.

• Tbayer removed a subscriber: phuedx.Oct 5 2018, 9:33 PM

By the way, we have some data on how often links are being opened in a new tab (or window), i.e. how frequently a new mw.user.sessionId()is generated in course of a browser session (in the usual sense that aligns with session cookie storage).

It looks like tabbed browsing is not very popular, with around 90%[1] of clicks on internal links on desktop opening them in the same tab (probably even a bit more than 90%[2]). Presumably this rate is even higher on mobile web, but I don't know whether we have data there too.

[1]

SELECT event_action, COUNT(*)
FROM log.Popups_15906495
WHERE wiki IN ('huwiki', 'itwiki', 'ruwiki')
AND event_isAnon = 1
AND event_popupEnabled = 0
AND LEFT(timestamp, 8) >= '20160925'
AND LEFT(timestamp, 8) < '20161030'
AND event_action LIKE 'opened%'
GROUP BY event_action;

 ---------------------- ---------- 
| event_action         | COUNT(*) |
 ---------------------- ---------- 
| opened in new tab    |   170203 |
| opened in new window |      406 |
| opened in same tab   |  1496078 |
 ---------------------- ---------- 
3 rows in set (10 min 1.30 sec)

[2] This comes from an old version of the Popups schema, which had several bugs that we fixed afterwards. In particular, T175918 meant that data from pageviews that were the first in a (sessionId-based) session are over-represented in this data. Assuming that users are more likely to open links in a new tab during the first pageview than on later pageviews, that would mean that the true percentage of same-tab clicks is even higher than 90%. (Also it was limited to these three wikis and modern - sendBeacon-capable - browsers.)

PS: My apologies regarding the abuse of this ticket for this topic - if someone wants to open a new one focusing on sessionId and document it all clearly there, please do.

MMiller_WMF subscribed.Oct 16 2018, 7:27 PM

nshahquinn-wmf mentioned this in T202348: Resume refinement of edit events in Data Lake.Oct 18 2018, 7:45 PM

Sorry @phuedx, I don't know why you keep getting resubscribed.

Damn it!

Oh, I see, it's because his username was in the task description.

nshahquinn-wmf updated the task description. (Show Details)Dec 19 2018, 10:42 PM

nshahquinn-wmf added a subscriber: Halfak.

nshahquinn-wmf mentioned this in T207803: Update EventLogging code to facilitate move to EditAttemptStep schema.Dec 19 2018, 11:14 PM

nshahquinn-wmf updated the task description. (Show Details)Jan 29 2019, 6:06 PM

ppelberg subscribed.Feb 14 2019, 4:54 PM

• bmansurov mentioned this in T220413: session_token changes when opening new tabs.Apr 9 2019, 2:56 PM

nshahquinn-wmf moved this task from Needs Discussion/Analysis to Analysis on the VisualEditor board.May 8 2019, 8:45 PM

nshahquinn-wmf updated the task description. (Show Details)Jul 2 2019, 1:53 PM

nshahquinn-wmf updated the task description. (Show Details)Jul 3 2019, 9:19 PM

nshahquinn-wmf updated the task description. (Show Details)Jul 4 2019, 4:50 PM

MMiller_WMF unsubscribed.Jul 8 2019, 9:41 PM

nshahquinn-wmf updated the task description. (Show Details)Jul 12 2019, 5:09 PM

nshahquinn-wmf updated the task description. (Show Details)Jul 12 2019, 5:13 PM

DLynch subscribed.Jul 15 2019, 4:31 PM

• marcella subscribed.Jul 15 2019, 6:05 PM

nshahquinn-wmf updated the task description. (Show Details)Aug 22 2019, 2:12 PM

nshahquinn-wmf added a subtask: T231024: Some save success events do not contain a new revision ID.Aug 22 2019, 3:42 PM

nshahquinn-wmf added a subtask: T227931: Oversampling changes edit completion rate .Aug 22 2019, 4:54 PM

We had a meeting about this earlier this week (notes). Generally, people felt that this wasn't urgent but would still be good to do relatively soon (perhaps Jan-Mar 2020) while we have a lot of accumulated experience with the data stream.

I believe there will be more discussions soon, but it's already clear I should put the plan that's already formed in my mind down into writing. That's T234535.

nshahquinn-wmf changed the task status from Stalled to Open.Oct 3 2019, 3:27 PM

• mmodell edited projects, added Product-Analytics (Kanban); removed Product-Analytics.Oct 16 2019, 5:47 PM

Restricted Application edited projects, added Product-Analytics; removed Product-Analytics (Kanban). · View Herald TranscriptOct 16 2019, 5:47 PM

Mayakp.wiki subscribed.Oct 22 2019, 5:51 PM

MNeisler closed subtask T231024: Some save success events do not contain a new revision ID as Resolved.Oct 23 2019, 4:28 PM

ppelberg mentioned this in T244498: Replies v2.0: determine what additional instrumentation is needed.May 15 2020, 11:21 PM

nshahquinn-wmf closed subtask T214931: Consider how the EditAttemptStep schema can apply to ContentTranslation as Resolved.Dec 21 2020, 5:41 PM

kzimmerman closed subtask T234535: Draft a concrete plan for the EditAttemptStep cleanup as Declined.Mar 15 2021, 4:56 PM

ldelench_wmf closed subtask T123958: Consider scrapping Schema:PageContentSaveComplete and Schema:NewEditorEdit, given we have Schema:Edit as Declined.Apr 19 2021, 4:56 PM

MNeisler mentioned this in T290931: Log save_success_timing in DiscussionTools.Sep 27 2021, 3:30 PM

matmarex closed subtask T204779: Mobile editors do not log inits on direct URL initiation as Resolved.Dec 28 2021, 3:39 PM

ppelberg closed subtask T270636: Mark EditAttemptStep events from VisualEditor reusers like Content Translation as Resolved.Feb 14 2022, 5:44 PM