Page MenuHomePhabricator

[A/B Test] Run an A/B test to evaluate impact of Usability Improvements
Closed, ResolvedPublic

Assigned To
Authored By
ppelberg
Feb 23 2022, 12:44 AM
Referenced Files
F37122362: time_to_engage_sc_test.png
Jun 28 2023, 7:54 PM
F37122360: time_to_engage_sc_control.png
Jun 28 2023, 7:54 PM
F37122348: time_to_engage_test.png
Jun 28 2023, 7:54 PM
F37122346: time_to_engage_control.png
Jun 28 2023, 7:54 PM
F37122258: pct_engagement_overall.png
Jun 28 2023, 6:10 PM
F37122256: ua_revert_rate_exp.png
Jun 28 2023, 6:10 PM
Restricted File
Jun 21 2023, 6:55 PM
Restricted File
Jun 21 2023, 6:55 PM

Description

Topic Containers (T269950), Clearer Talking Affordances (T255560 T267444), and Page Frame (T269963) are all interventions designed to help:

  1. Junior Contributors to quickly recognize talk pages as places to communicate with other volunteers and locate the tools available to do so
  2. Senior Contributors to be able to quickly assess which conversations on a given talk page are worth focusing on

This task involves the work with running an A/B test to evaluate the extent to which this set of Usability Improvements have been effective at impacting Junior and Senior Contributors in the ways described above.

Decision to be Made

This A/B test will help us make the following decision: Are the set of Usability Improvements on desktop fit to be made available to everyone, at all Wikimedia wikis, by default?

Hypotheses

IDHypothesisMetric(s) for Evaluation
KPIVolunteers across experience levels and account states (logged in out) will use talk pages in ways that align with their consented upon purpose (read: collaborate to improve Wikipedia)1) Proportion of published talk page edits, of all types, that are reverted, 2) Proportion of talk page edits that are started and successfully published and 3) Proportion of people who publish talk page edits and are subsequently blocked
Curiosity #1Junior Contributors will intuitively understand talk pages, across namespaces, as tools they can use to communicate with other volunteers because they will recognize other people using these pages to talk and identify the affordances they can use to do the same.1) Of all Junior Contributors that post on a talk page, the average time duration from when a Junior Contributor views a talk page to when they engage on the page in some way (for example, click an affordance to comment or start a conversation), 2) Of all Junior Contributors that visit a talk page, the proportion who engage with the page in some way
Curiosity #2A greater percentage of Junior Contributors who visit talk pages will publish at least one non-reverted comment or new discussion because they will be more clear and confident about what talk pages are used for and how to use them.1) The proportion of Junior Contributors who click an affordance on a talk page (defined as an init event being emitted) and successfully publish at least one comment or new discussion (defined as saveSuccess event being emitted) that is not reverted within 48 hours, 2) The proportion of Junior Contributors that visit a talk page and successfully publish at least one comment or new discussion (not reverted)
Curiosity #3Senior Contributors will be able to more quickly and easily decide where to focus their attention when arriving on a talk page because they will be able to see, at a glance, what new comments and/or discussions have been added since they last visited.Of all Senior Contributors that post on a talk page, the average time from when a Senior Contributor visit a talk page to starting any new edit on a talk page.

Guardrails

IDNameMetric(s) for Evaluation
Guardrail #1Regressions in bounce rateBounce rate

Decision Matrix

IDScenarioPlan of Action

Done

Related Objects

Event Timeline

ppelberg moved this task from Backlog to Triaged on the DiscussionTools board.
ppelberg moved this task from Untriaged to This Fiscal Year on the Editing-team board.
ppelberg updated the task description. (Show Details)

Per what @MNeisler and I talked about offline today, I've updated the task description's "Hypotheses" section to reflect the purpose this A/B test: to evaluate the extent to which the suite of Usability Improvements are causing any regressions in how people understand and use talk pages.

Previously, we'd positioned this A/B test as an effort to prove the usefulness of these design changes.

The change to the scope of this A/B test is built on the Editing Team taking the results of the usability tests we ran [i][ii][iii][iv][v], and the way people who are already using the new design are reacting to them, to mean that these changes are in fact useful and the key question that remains is whether they have any unintended negative consequences which this test is now scoped to investigate.


i. T307840: Run Test A (senior visits Junior's empty user talk) and synthesize feedback
ii. T307842: Run Test B (Senior visits Junior's empty user talk to share guidance) and synthesize findings
iii. T307843: Run Test C (Empty Article Talk) and synthesize findings
iv. T307845: Run Test D (a busy article talk page) and synthesize findings
v. T307846: Run Test E (Senior User Talk Page) and Synthesize Findings

MNeisler triaged this task as Medium priority.
MNeisler added a project: Product-Analytics.
MNeisler moved this task from Triage to Current Quarter on the Product-Analytics board.

Change 916903 had a related patch set uploaded (by DLynch; author: DLynch):

[operations/mediawiki-config@master] Enable DiscussionTools visual enhancements a/b test

https://gerrit.wikimedia.org/r/916903

Change 916903 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable DiscussionTools visual enhancements a/b test

https://gerrit.wikimedia.org/r/916903

Mentioned in SAL (#wikimedia-operations) [2023-05-08T20:08:19Z] <taavi@deploy1002> Started scap: Backport for [[gerrit:917160|Update a/b test code for visual enhancements a/b test (T333715)]], [[gerrit:916903|Enable DiscussionTools visual enhancements a/b test (T302358)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-08T20:09:49Z] <taavi@deploy1002> kemayo and taavi: Backport for [[gerrit:917160|Update a/b test code for visual enhancements a/b test (T333715)]], [[gerrit:916903|Enable DiscussionTools visual enhancements a/b test (T302358)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-08T20:20:13Z] <taavi@deploy1002> Finished scap: Backport for [[gerrit:917160|Update a/b test code for visual enhancements a/b test (T333715)]], [[gerrit:916903|Enable DiscussionTools visual enhancements a/b test (T302358)]] (duration: 11m 54s)

The AB test has now been running for a little over 3 weeks (it was started on 8 May 2023). I ran a check to confirm that a sufficient number of events have been logged to complete the analysis including per wiki and per desktop skin analyses.

@ppelberg Based on the number of events logged to date (as summarized below), I can confirm we have sufficient data to begin the analysis and the AB test can be turned off.

AB test talk page edit attempts and distinct users by experiment group

Experiment GroupNumber of edit attemptsNumber of users
control423344863
test420564671

There are also sufficient events logged to complete a per-wiki analysis for all the wikis included in the AB test (Azwiki, bnwiki, hiwiki, kowiki and thwiki are some of the smaller wikis in the test but each still have had over 100 distinct users that completed at least one desktop talk page edit).

AB test talk page edit attempts by experiment group and desktop skin
Note: To obtain info on the editor's desktop skin, we currently need to join to the DestkopWebUIActionsTracking schema, which is sampled. As a result, there are fewer logged edit attempts to review when looking at AB test data by desktop skin type (as shown below). However, there's still enough data recorded to complete the analysis. Note: There was a patch submitted to add a skin field to EditAttemptStep but there's an issue causing only NULL events to be returned (this is being looked at in T337270).

Experiment groupSkinNumber of edit attempts
controlvector13529
controlvector-20228788
testvector14699
testvector-202211064

Based on the number of events logged to date (as summarized below), I can confirm we have sufficient data to begin the analysis and the AB test can be turned off.

There are also sufficient events logged to complete a per-wiki analysis for all the wikis included in the AB test (Azwiki, bnwiki, hiwiki, kowiki and thwiki are some of the smaller wikis in the test but each still have had over 100 distinct users that completed at least one desktop talk page edit).

Excellent and understood.

AB test talk page edit attempts by experiment group and desktop skin
Note: To obtain info on the editor's desktop skin, we currently need to join to the DestkopWebUIActionsTracking schema, which is sampled. As a result, there are fewer logged edit attempts to review when looking at AB test data by desktop skin type (as shown below). However, there's still enough data recorded to complete the analysis. Note: There was a patch submitted to add a skin field to EditAttemptStep but there's an issue causing only NULL events to be returned (this is being looked at in T337270).

Resulting question:

  • Would it be accurate for me to understand the above to mean: "While the issue T337270 describes does mean we need to join to the DestkopWebUIActionsTracking schema and thus, cope with the sampling implications that result. The issue T337270 describes does NOT impact our ability to split peoples' behavior based on the skin they were using to access the site."

Assuming what I described above is accurate, then let's do what you proposed, "...begin the analysis and the AB test can be turned off."

Resulting question:

Would it be accurate for me to understand the above to mean: "While the issue T337270 describes does mean we need to join to the DestkopWebUIActionsTracking schema and thus, cope with the sampling implications that result. The issue T337270 describes does NOT impact our ability to split peoples' behavior based on the skin they were using to access the site."

@ppelberg Yes that's accurate

@ppelberg
Here is a summary of the initial results and key findings from the AB test analysis for review. Please let me know if you have any questions.

Methdology
We reviewed AB test data recorded from 08 May 2023 through 28 May 2023 for this analysis. Data was limited to desktop talk page edits completed by logged-in users bucketed in the AB test.

KPI: Volunteers across experience levels will use talk pages in ways that align with their consented upon purpose

Revert Rate
Defined as: Proportion of published talk page edits, of all types, that are reverted within 48 hours

  • We observed an 11% decrease (-0.4 percentage points) in the revert rate of talk page edits by users shown the set of usability improvements across all participating Wikipedias and all editor experience levels.
  • There was a slightly higher decrease (-12.5% decrease) in revert rate for talk page edits by Junior Contributors compared to talk page edits by Senior Contributors (-7.4% decrease in revert rate)

ua_revert_rate_exp.png (559×963 px, 48 KB)

  • Different trends were observed when broken down by desktop skin type.
    • Vector: 6% increase in revert rate (3.2% → 3.4%)
    • Vector 2022: -30% decrease in revert rate (3.9% → 2.7%)
  • We also observed a decrease on each of the Wikis except Azerbaijani, Bangia, Dutch and Hindi (all of these increases were under 1.5 percentage point increases) except for Hindi which had a significant increase in revert rate (2.2% →26.6% ). This seems likely caused by another event that occurred during the AB test on that wiki but further investigation is needed to confirm. Note: In T332946, hiwiki was identified as a wiki that might switch default desktop skins during the test. Need to confirm if this did occur and the date.

Edit Completion Rate
Defined as the proportion of talk page edits that are started and successfully published (not reverted within 48 hours).

  • People shown the set of usability improvements were slightly more likely to complete an edit that they started. Overall, there was a 3.3% increase ( 2 percentage points) in edit completion rate across all participating Wikipedias.

Overall Edit Completion Rate by Experiment Group

Experiment GroupNumber of Edit AttemptsNumber of Edit SavesCompletion Rate
control11487704961.5%
test13252841063.5%
  • We observed increases in edit completion rate on talk pages for both Junior and Senior contributors. Increases are shown below:
    • Junior Contributors: 1.6% increase (43.8%→44.5%)
    • Senior contributors: 3.6% increase (66.8%→69.2%)
  • Similar increases were also observed for each desktop skin type and editing workflow (discussiontools and page)

Block Rate
Defined as the proportion of people who publish talk page edits and are subsequently blocked

  • No significant changes in block rate.
  • Overall, there were 11 total editors (0.6% of talk page editors) blocked in the control group after making a talk page edit vs 10 total users (0.5% of talk page editors) in the test group.
  • No significant changes observed for a particular wiki, editing interface, or desktop skin type
Curiosity 1: Junior Contributors will intuitively understand talk pages, across namespaces, as tools they can use to communicate with other volunteers because they will recognize other people using these pages to talk and identify the affordances they can use to do the same.

The proportion of talk page views by Junior Contributors that includes at least one edit attempt

  • Overall, more talk page views by Junior Contributors in the test group included at least one attempt to engage with the page. We observed a 16.7% (1.4 percentage point increase) increase in talk page engagement by Junior Contributors.

pct_engagement_overall.png (560×949 px, 50 KB)

  • We observed increases for users of both desktop skin types, with a slightly higher increase observed for users of Vector 2022.
  • We observed increases across all wikis except for 5: Azerbaijani, Dutch, Hebrew, Persian, and Ukrainian Wikipedias.
Curiosity 2: A greater percentage of Junior Contributors who visit talk pages will publish at least one non-reverted comment or new discussion because they will be more clear and confident about what talk pages are used for and how to use them.

The proportion of Junior Contributors that start and successfully complete a talk page edit

  • There was no significant change in the overall proportion of Junior Contributors that started and completed an edit across all participating Wikipedias.
  • No significant changes by various breakdowns.

The proportion of talk page views by Junior Contributors that included at least one successfully completed talk page edit (not reverted)

  • Junior Contributors were slightly more likely to save an edit after visiting a talk page when shown usability improvements. We observed an 19% (0.7 percentage point) in the proportion of talk page views by Junior Contributors that included a saved edit.
  • This was consistent for each desktop skin type.
    • Vector: 537 edits saved in the control group; 730 edits saved in the test group
    • Vector 2022: 623 edits saved in the control group; 643 edits saved in the test group

Remaining to dos:

  • Investigate Hindi Wikipedia revert rate
  • Prepare full report for publishing

A couple updates:

Revert Rate at Hindi Wikipedia

As mentioned in T302358#8953722, we observed a significantly high desktop talk page edit revert rate (26.6%) at Hindi Wikipedia for the test group. I investigated this revert rate further to confirm if this increase was due to single outlier incident or indicative of disruption caused by the test.

Results
Data indicates that this revert rate was due to 30 reverted edits (out of 33 total talk page edits) by only two distinct users that occurred on May 19th and May 20th during the AB test. These edits occurred on the Vector desktop. Except for these dates, there was an average of 6 desktop talk page edits per day on Hindi Wikipedia with only 0 or 1 edit reverted per day. We also have not seen any spikes in daily revert rate for the duration of the test

Findings/Impact
Based on this, I think we can confirm that this 26.6% revert rate at Hindi Wikipedia was due to a single outlier event and likely not indicative of general disruption caused by the test. Additionally, there were no spikes identified at any of the other 14 participating wikis.

Even with this outlier event included, we observed an 11% decrease (-0.4 percentage points) in revert rate overall across all participating Wikipedias. If this outlier event at Hindi Wikipedia is excluded, we observed a 20% decrease [-0.7 percentage points] in the revert rate.

Time to Talk Page Engagement

As part of the analysis, I also reviewed the average time duration from when a Junior Contributor views a talk page to when they engage on the page in some way (for example, click an affordance to comment or start a conversation). Note: For the purpose of this analysis, I am defining engagement as any attempt to start either a reply or new topic on a talk page.

We observed similar distributions in time to engagement for both test groups indicating no significant changes in the overall time to engagement. Overall, there was just a 2-second increase in the median time to engage from 7 to 9 seconds.

time_to_engage_control.png (547×903 px, 56 KB)

time_to_engage_test.png (539×919 px, 57 KB)

The median time to start a new topic was 2 seconds for both groups, while the median time to add a comment using the reply tool increased by just one second from 21 to 22 seconds.

We observed very similar trends when reviewing the duration to talk page engagement for Senior Contributors as well.

time_to_engage_sc_control.png (544×915 px, 56 KB)

time_to_engage_sc_test.png (547×840 px, 55 KB)

Note: I'll provide a link to the full report with more details once finalized

@ppelberg
Here is a link to the full AB test report which includes additional details on the queries used in the analysis and results.

This has been reviewed by legal and can be shared on the project page.

@ppelberg
Here is a link to the full AB test report which includes additional details on the queries used in the analysis and results.

This has been reviewed by legal and can be shared on the project page.

Wonderful – thank you, @MNeisler.

A draft of these findings are now available on the project page [i] and ready for you to review. Please boldly edit the project page as you see fit.

In the meantime, I'm going to mark this task as resolved.


i. https://www.mediawiki.org/wiki/Talk_pages_project/Usability#Analysis_#2:_Impact