Page MenuHomePhabricator

Clean up the EditAttemptStep schema and its implementations
Open, MediumPublic

Description

The EditAttemptStep schema (previously the Edit schema) has been around since 2013 and provided data for important projects like the various rollouts of the visual editor.

However, over that time it has also accumulated many bugs, oddities, and unanswered strategic questions. This task tracks the resolution of those issues, and will be complete when (eons hence) the schema's stakeholders agree on its purpose and scope, the schema has been modified to implement that purpose and scope, and when all the necessary implementations conform to that schema.

Use cases

  • Edit completion rate
  • Time to loaded and time to interactive
  • Overall edit duration

Scope

Interfaces currently logging to the schema:

  • visual editor (phone and desktop)
  • 2010 wikitext (phone and desktop)
  • 2017 wikitext
  • ContentTranslation (but without a separate value of editor_interface to distinguish it from desktop VE)

Should the app editors start using this schema? What about the Flow editor? What about the Wikidata description tool on Android?

  • Having a common schema increases the probability that you can get comparable data across all these interfaces (because it forces teams to collaborate), but it doesn't ensure it.
  • We should only incur the collaboration overhead if the benefits of more comparability are worth it—there's not much point in comparing, say, edit completion rate of Wikidata description editing with general page editing, because their contexts are so very different.

Session identification

  • The schema defines editingSessionId as "a string of 32 alphanumeric characters, unique to the current page view session; used for grouping events".
  • Our current implementation of editing sessions is tightly coupled to a page view. However, this doesn't map very well to what we think of as a single edit session: on desktop, switching between the visual editor and the wikitext editor while retaining changes causes a new page view, while on MobileFrontend, aborting an edit using the back button and then reopening the editor (which doesn't preserve your changes) all happens in one page view.
  • We don't use the core EventLogging code for client-side session token generation and sampling.

Timings

  • There's no reason we should have a separate timing field for each event type when we can have a single one whose meaning varies by event type (T207803#4790039)
  • init_timing currently not logged, but the information described in the schema ("timing information about action=init – time in milliseconds since the page was loaded") does not seem useful.

Other issues

  • The new ability to switch back-and-forth between the visual editor and wikitext invalidates some key assumptions (for example, we probably want to update action.init.mechanism)
  • How should we account for "micro-editing experiences" like Flow? Should they be included in this schema at all?
  • Even with T124676 resolved, the table is still quite large. Consider whether to drop mostly unused fields like page.title or normalize the schema (T123958)
  • Do our action.saveFailure.type values cover all the options?
    • For example, T197499 deals with a save failure because the wiki is in read-only mode, which isn't covered.
  • The switch* values of abort_type are probably unnecessary now because we started logging switches as VisualEditorFeatureUse events (T221191#5290393), and in any case it doesn't seem right to consider switches as aborts because logically they are just one intermediate step in a single edit attempt.
  • We have started discarding ready and loaded events that occur after a switch (T220697), but it's not clear if we're doing that everywhere
  • We need a standard way to deal with multi-interface sessions in analysis—which interface, if any, do we attribute them to?
  • The fact that the 2010 wikitext editor logs saveSuccess and init events on the server side, unlike every other event in the schema, create significant inconsistencies (T214132)
  • Should we log the user name rather than the user ID? On one hand, the user ID is immutable; on the other hand, the user name is the main global user identifier and easier for humans to use.

Data tidiness

  • We should have separate this into two separate tables: EditAttempt (containing data that applies to all steps in an attempt, such as platform, user agent, and user name; this won't include editor_interface because that can differ within a single edit attempt because of switching) and EditAttemptStep (not containing that attempt-wide data).
  • We should probably merge VisualEditorFeatureUse into EditAttemptStep with a featureUse action. The observational unit is the same, and it's much easier to subset data from one table than to union data from two tables.

See also

  • @Halfak's 2016 proposal for splitting this into five separate schemas:
    • EditingSession (one per page edit session)
    • EditingStage (one per editing stage)
    • EditingAbort (one per aborted edit)
    • EditingSaveFailure (one per save failure)
    • PageContentSaveComplete (note that this schema already exists)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Deskana changed the task status from Open to Stalled.Sep 12 2017, 7:41 PM
Deskana subscribed.

We need to have a long hard think about analytics and metrics at some point. There's not much point doing that before there is a data analyst assigned to Contributors.

We need to have a long hard think about analytics and metrics at some point. There's not much point doing that before there is a data analyst assigned to Contributors.

:(

But yeah, I know exactly what you mean.

In T118063#3602513, @Neil_P._Quinn_WMF wrote:

We need to have a long hard think about analytics and metrics at some point. There's not much point doing that before there is a data analyst assigned to Contributors.

:(

Hey, if you want to dive into this, I'd love to do it with you! But I assumed you were busy with New Editors. :-)

! In T118063#3601883, @Deskana wrote:
Hey, if you want to dive into this, I'd love to do it with you! But I assumed you were busy with New Editors. :-)

No, your instincts are totally right; NEX is taking up about 75% of my time right now, and the rest is mostly going to high-level stuff like movement metrics (monthly active editors, new user retention, etc.) and the editor survey. It just makes me sad that I have to neglect all this important stuff. Thanks for being understanding :)

nshahquinn-wmf renamed this task from Revise the Edit schema and its use to Overhaul and vet the Edit event log.Aug 21 2018, 6:11 PM
nshahquinn-wmf added a project: Epic.
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf removed the point value for this task.
nshahquinn-wmf renamed this task from Overhaul and vet the Edit event log to Reconsider the schema of the Edit event log.Aug 27 2018, 8:30 PM
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf updated the task description. (Show Details)

@Deskana I've split apart the two tasks about the Edit logs. T202437 is now about actual data quality problems, where we're not logging the data we expect. This task is about a broader reconsideration of the schema; there are important issues here, but they're less urgent than the data problems in the other task and don't block fixing them.

To add detail re T118063#4546661: @phuedx looked into that not long ago and I understand from him that mw.user.sessionId()may not persist through the entire browser session, but rather only for pageviews and other actions in a particular tab (and perhaps other tabs opened from that tab). I.e. the documentation is at least misleading in that regard ("This ID is ephemeral for everyone, staying in their browser only until they close their browsing session"). A more precise description of the token's behavior would be great.

To add detail re T118063#4546661: @phuedx looked into that not long ago and I understand from him that mw.user.sessionId()may not persist through the entire browser session, but rather only for pageviews and other actions in a particular tab (and perhaps other tabs opened from that tab). I.e. the documentation is at least misleading in that regard ("This ID is ephemeral for everyone, staying in their browser only until they close their browsing session"). A more precise description of the token's behavior would be great.

Ah, yeah, I was looking into this as well. The sessionId is stored in sessionStorage, which according to MDN behaves as follows:

Data stored in sessionStorage gets cleared when the page session ends. A page session lasts for as long as the browser is open and survives over page reloads and restores. Opening a page in a new tab or window will cause a new session to be initiated with the value of the top-level browsing context, which differs from how session cookies work.

I tested a bit, and this is what happens when you start on a wiki page and do the following in Firefox 62.0b20:

  • Follow a link to another wiki page: ID persists
  • Reload the page: ID persists
  • Hard reload the page (cmd shift R on Mac): ID persists
  • Navigate to another wiki page by typing the address in the same tab: ID persists
  • Navigate to another website, then back to the wiki, in the same tab: ID persists
  • Quit and reopen the browser with session restoration enabled: ID persists
  • Open a link to another page on the same wiki in a new tab (cmd click on Mac): ID changes

Also, naturally, the session ID is specific to a single MediaWiki site, so navigating between sites in the same results in an ID change.

I don't know where exactly this documentation should go, but yeah, it would be good to keep this straight.

I can confirm that I see the same behaviour described in T118063#4547350 – it's always good when browsers behave as specified!

In T118063#4547350, @Neil_P._Quinn_WMF wrote:

I don't know where exactly this documentation should go, but yeah, it would be good to keep this straight.

My advice is to always start with the source (https://doc.wikimedia.org/mediawiki-core/master/js/source/mediawiki.user.html#mw-user-method-sessionId) as it's closest to the truth.

I can't find any reference to "sessionid" or "page token" on mediawikiwiki, so I'd also recommend that we create a high-level documentation page there that covers both mw.user.sessionId and the pageview token generated by EventLogging.

I can confirm that I see the same behaviour described in T118063#4547350 – it's always good when browsers behave as specified!

In T118063#4547350, @Neil_P._Quinn_WMF wrote:

I don't know where exactly this documentation should go, but yeah, it would be good to keep this straight.

My advice is to always start with the source (https://doc.wikimedia.org/mediawiki-core/master/js/source/mediawiki.user.html#mw-user-method-sessionId) as it's closest to the truth.

I can't find any reference to "sessionid" or "page token" on mediawikiwiki, so I'd also recommend that we create a high-level documentation page there that covers both mw.user.sessionId and the pageview token generated by EventLogging.

Sounds like a good idea! In the meantime, I have submitted a patch to at least add a caveat to the existing documentation.

By the way, we have some data on how often links are being opened in a new tab (or window), i.e. how frequently a new mw.user.sessionId()is generated in course of a browser session (in the usual sense that aligns with session cookie storage).

It looks like tabbed browsing is not very popular, with around 90%[1] of clicks on internal links on desktop opening them in the same tab (probably even a bit more than 90%[2]). Presumably this rate is even higher on mobile web, but I don't know whether we have data there too.

[1]
SELECT event_action, COUNT(*)
FROM log.Popups_15906495
WHERE wiki IN ('huwiki', 'itwiki', 'ruwiki')
AND event_isAnon = 1
AND event_popupEnabled = 0
AND LEFT(timestamp, 8) >= '20160925'
AND LEFT(timestamp, 8) < '20161030'
AND event_action LIKE 'opened%'
GROUP BY event_action;

 ---------------------- ---------- 
| event_action         | COUNT(*) |
 ---------------------- ---------- 
| opened in new tab    |   170203 |
| opened in new window |      406 |
| opened in same tab   |  1496078 |
 ---------------------- ---------- 
3 rows in set (10 min 1.30 sec)

[2] This comes from an old version of the Popups schema, which had several bugs that we fixed afterwards. In particular, T175918 meant that data from pageviews that were the first in a (sessionId-based) session are over-represented in this data. Assuming that users are more likely to open links in a new tab during the first pageview than on later pageviews, that would mean that the true percentage of same-tab clicks is even higher than 90%. (Also it was limited to these three wikis and modern - sendBeacon-capable - browsers.)

PS: My apologies regarding the abuse of this ticket for this topic - if someone wants to open a new one focusing on sessionId and document it all clearly there, please do.

nshahquinn-wmf renamed this task from Reconsider the schema of the Edit event log to Clean up the EditAttemptStep schema and its implementations.Nov 30 2018, 11:51 PM
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf added a subscriber: phuedx.
nshahquinn-wmf removed a subscriber: phuedx.

Sorry @phuedx, I don't know why you keep getting resubscribed.

Oh, I see, it's because his username was in the task description.

We had a meeting about this earlier this week (notes). Generally, people felt that this wasn't urgent but would still be good to do relatively soon (perhaps Jan-Mar 2020) while we have a lot of accumulated experience with the data stream.

I believe there will be more discussions soon, but it's already clear I should put the plan that's already formed in my mind down into writing. That's T234535.

nshahquinn-wmf changed the task status from Stalled to Open.Oct 3 2019, 3:27 PM