Page MenuHomePhabricator

Implement A/B test bucketing
Closed, ResolvedPublic

Description

This task is about implementing the changes necessary to ensure the "right people" are are included in the Reply Tool A/B test in the "right way."

Bucketing criteria

People who meet all of the "Conditions" listed below, ought to have a 50% chance of being bucketed into the A/B test's control or test group.
Conditions

  • Editing at the following Wikipedias (source: T267379):
WikiCode
French Wikipediafrwiki
Spanish Wikipediaeswiki
Italian Wikipediaitwiki
Japanese Wikipediajawiki
Persian Wikipediafawiki
Polish Wikipediaplwiki
Hebrew Wikipediahewiki
Dutch Wikipedianlwiki
Hindi Wikipediahiwiki
Korean Wikipediakowiki
Vietnamese Wikipediaviwiki
Thai Wikipediathwiki
Portuguese Wikipediaptwiki
Bengali Wikipediabnwiki
Egyptian Wikipediaarzwiki
Swahili Wikipediaswwiki
Chinese Wikipediazhwiki
Ukrainian Wikipediaukwiki
Indonesia Wikipediaidwiki
Amharic Wikipediaamwiki
Oromo Wikipediaomwiki
Afrikaans Wikipediaafwiki
  • Have not used the Reply Tool before.
    • In this context, "not used" is being defined as people whose discussiontools-editmode preference is empty.
  • People who are logged in

Two additional notes:

  1. Bucketing ought to be done a per wiki basis.
  2. The software will need to remember that someone was included in the A/B test so they are not mistakenly removed from it.

Open questions

  • 1. Does this task need to be blocked on having defined the list of Wikipedias that will be participating in the test? See: T267379.
    • Moot point as T267379 has been resolved.
  • 2. What should happen when someone from the control group manually enables the Reply Tool in Special:Preferences? Should them changing their preference cause them to be added to the test group? Should their usage of the Reply Tool, that they would have manually enabled, be considered part of the control group?
    • When someone from the control group manually enables the Reply Tool in Special:Preferences, they should remain in the control group. See: T268191#6746113.
  • 3. Can the bucketing be deployed to test.wikipedia.org to conduct the QA that will happen in T268193. Additional context: T268191#6746113.

Done

  • The "Bucketing criteria" above have been implemented
  • Verify on Beta that the code required to assign people to the test and control groups is working as expected.

Note: verification that people are being bucketed as expected will happen in T268193.

Event Timeline

Task description update
Today, @MNeisler and I talked about the A/B test's bucketing criteria [i] which are now reflected in the task description's ===Bucketing criteria section.


i. Bucketing criteria

  • People editing at the following Wikipedias: TBD; see T267379.
  • Have not used the Reply Tool before.
    • In this context, "used" is being defined as people who have initiated (read: reached init) the Reply Tool before.
  • People who are logged in

Task description update

Have not used the Reply Tool before.

  • In this context, "used" is being defined as people who have initiated (read: reached init) the Reply Tool before.

During yesterday's team meeting, @DLynch and @Esanders shared that it is not possible to use whether someone has caused an init event to be emitted as a proxy for whether they have used the Reply Tool before or not.

The reason: these init events are stored in such a way that it is difficult for the software to "look up" whether a given person has triggered said init event in a performant way.

Instead, David and Ed shared we could look at whether someone has a discussiontools-editmode preference set. If said preference is empty, we can infer a given account has not opened the tool on that wiki before. Note: this only applies to people who are logged in.

I've updated the task description to reflect the above.

Task description update

  • Changes to ===Done section:
    • REMOVED @MNeisler's data check as she will verify test buckets are balanced in T268193.
    • ADDED a check to verify that the code required to assign people to the test and control groups is working as expected as on the Beta cluster.

Meeting notes: 21-December meeting notes
During today's team standup, @DLynch raised the following points...

  1. We need to be sure the software remembers that someone was included in the A/B test so they are not mistakenly removed from it.
  2. @MNeisler: What should happen when someone from the control group manually enables the Reply Tool in Special:Preferences? Should them changing their preference cause them to be added to the test group? Our instinct: "No." Should their usage of the Reply Tool, that they would have manually enabled, be considered part of the control group? Our instinct: "Yes."

"1." and "2." above have been added to the task description's ===Bucketing criteria and Open questions sections respectively.

A few limitations to bear in mind:

We're going to be remembering what bucket someone is in based on their cookies. If use the reply tool and then clear their cookies they're going to stop being in the test and so won't get re-bucketed.

If they use the reply tool and then log in on a different machine, they won't be in the a/b test on that machine, but will be on their original machine.

We could make it into a user-option instead, and thus stored in the db, but we'd want to clean that up after the test is done.

We could make it into a user-option instead, and thus stored in the db, but we'd want to clean that up after the test is done.

@DLynch would bucketing people based on a user-option, as you're describing above, relieve us of the limitations you are describing below? If so, what – if any – tradeoffs should we be mindful of before committing to bucketing people based on user-options instead of cookies?

We're going to be remembering what bucket someone is in based on their cookies. If use the reply tool and then clear their cookies they're going to stop being in the test and so won't get re-bucketed.

If they use the reply tool and then log in on a different machine, they won't be in the a/b test on that machine, but will be on their original machine.

If they use the reply tool and then log in on a different machine, they won't be in the a/b test on that machine, but will be on their original machine.

I hate when a software behaves differently based on the device (or browser profile) I’m using. It happened for me quite a number of times on YouTube, where I intentionally use different profiles on the same machine (to limit how much Google tracks me). It’s annoying for me, but it may seem to be a bug for someone who doesn’t know what A/B testing is. One of the worst ways a software can be designed is making something intentionally that seems like a bug.

Change 655861 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/DiscussionTools@master] A/B test bucketing for beta enrollment

https://gerrit.wikimedia.org/r/655861

Meeting notes

These are notes from the conversation @MNeisler and I had today.

Deployment

  • @DLynch: are we able to deploy the bucketing patch to https://test.wikipedia.org/wiki/Main_Page during Monday's (18-Jan) backport window instead of deploying it to the candidate wikis (T267379) which are now listed in the task description?
    • Reason: we'd rather QA the bucketing (T268193) on a test wiki rather than risk us needing to stop the test in production to address potential bug(s) and then re-start it.

Remaining ===Open questions

  • We confirmed that when someone from the control group manually enables the Reply Tool in Special:Preferences, they should remain in the control group. I've updated the task description to reflect this.
    • Related: if we notice during analysis (see: T252057) we notice people in the test and control groups engaging with talk pages in similar ways, we might explore the percentage of people within the control group who manually enabled the Reply Tool in Special:Preferences.

Change 655861 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@master] A/B test bucketing for beta enrollment

https://gerrit.wikimedia.org/r/655861

@ppelberg you mean backport the entire A/B-test patch to every-wiki, and then deploy the config to the test wiki? It might be easier to just put the config patch on beta, since it already has the bucketing.

@ppelberg you mean backport the entire A/B-test patch to every-wiki, and then deploy the config to the test wiki? It might be easier to just put the config patch on beta, since it already has the bucketing.

@MNeisler: a question for you came up as we were talking about the point @DLynch is raising above and the QA we have planned in T268193:

  • Would you feel comfortable verifying the A/B test bucketing is working as expected on the beta cluster [i] instead of the test wiki [ii]? Engineering shared the former would be more straightforward and we all suspected usage of the two non-production wikis would be comparable [iii].

i. https://en.wikipedia.beta.wmflabs.org/
ii. https://test.wikipedia.org/
iii. Note: if more scale is needed than what we think we'll have access to on a non-production wiki, @Whatamidoing-WMF talked and think id.wiki would be a good place to deploy and test the bucketing in production.

Would you feel comfortable verifying the A/B test bucketing is working as expected on the beta cluster [i] instead of the test wiki [ii]? Engineering shared the former would be more straightforward and we all suspected usage of the two non-production wikis would be comparable [iii].

Unlike the test wiki, there's a not a good way for me to query the eventlogging data from the beta cluster to confirm the buckets are balanced. I can check the log file but it is restricted to a certain size and events usually only stay there for a couple hours, which does not provide the scale we need to confirm the buckets are balanced.

Based on this and given the complexity of trying to change deployment to the test wiki, I'd recommend using id.wiki to test the bucketing in production. That should provide the scale and data accessibility I need to confirm the buckets are balanced.

Based on this and given the complexity of trying to change deployment to the test wiki, I'd recommend using id.wiki to test the bucketing in production. That should provide the scale and data accessibility I need to confirm the buckets are balanced.

Thank you for thinking this through, @MNeisler; let's do as you are suggesting above and deploy the A/B test to id.wiki for the purposes of verifying people are being bucketed in way we expect.

I'm going to consult the team on the next steps needed to make the above happen; I will follow up here once I know what they are.

Change 657690 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/DiscussionTools@master] A/B test output when a specific feature is being tested

https://gerrit.wikimedia.org/r/657690

Change 657691 had a related patch set uploaded (by DLynch; owner: DLynch):
[operations/mediawiki-config@master] Enroll idwiki in the DiscussionTools a/b test

https://gerrit.wikimedia.org/r/657691

Change 657690 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@master] A/B test output when a specific feature is being tested

https://gerrit.wikimedia.org/r/657690

Change 657653 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/DiscussionTools@wmf/1.36.0-wmf.27] A/B test output when a specific feature is being tested

https://gerrit.wikimedia.org/r/657653

Based on this and given the complexity of trying to change deployment to the test wiki, I'd recommend using id.wiki to test the bucketing in production. That should provide the scale and data accessibility I need to confirm the buckets are balanced.

Thank you for thinking this through, @MNeisler; let's do as you are suggesting above and deploy the A/B test to id.wiki for the purposes of verifying people are being bucketed in way we expect.

I'm going to consult the team on the next steps needed to make the above happen; I will follow up here once I know what they are.

To close this loop, @MNeisler – we have everything ready on our end to start the A/B test on id.wiki. Tho, this happening depends on the train rolling which is currently blocked on T272638. Handing off to @DLynch to comment on this ticket when the A/B test reaches id.wiki

Change 657653 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@wmf/1.36.0-wmf.27] A/B test output when a specific feature is being tested

https://gerrit.wikimedia.org/r/657653

Change 657691 merged by jenkins-bot:
[operations/mediawiki-config@master] Enroll idwiki in the DiscussionTools a/b test

https://gerrit.wikimedia.org/r/657691

Mentioned in SAL (#wikimedia-operations) [2021-01-22T01:14:53Z] <urbanecm@deploy1001> Synchronized php-1.36.0-wmf.27/extensions/DiscussionTools/: 513a7861bbcf06a8ac5c29e1b9838640cbd7c628: A/B test output when a specific feature is being tested (T268191) (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2021-01-22T01:16:39Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: 376cba1b33dd68d40490a1498c59a4d430318ab1: Enroll idwiki in the DiscussionTools a/b test (T268191) (duration: 00m 55s)

This is deployed to idwiki, and @MNeisler will be verifying the test/control split looks even once data comes in over the weekend. I did verify with a new account that I was assigned appropriately to a bucket as part of the deploy process.

ppelberg updated the task description. (Show Details)