Page MenuHomePhabricator

[EPIC] Run A/B test on page issues (Farsi, Japanese, Russian, English)
Closed, ResolvedPublic5 Estimated Story Points

Description

Background

We would like to run an A/B test on the page issues feature for two weeks. The test will display the new page issues code to one group and the older version to the remainder (as per the bucketing code built in T193584: Prepare feature flagging gateway for mobile issues work and the instrumentation built in T191532: Mobile page issues - instrument page issues. We will collect data for 4 weeks, plus a run-up time of 1-2 days to limit caching and novelty effects. A separate task has been set up for turning the test off: T200793 . Suggested sampling ratio: 20% (per T200792#4489268)

Questions we are trying to answer

See T200794: Analyze results of page issues A/B test

Acceptance criteria

  • Notify the Analytics Engineering team as described here to have both ReadingDepth and PageIssues blacklisted from being stored in MariaDB (see T200792#4489268).

(done, see corresponding AC in T191532)

Japanese: https://phabricator.wikimedia.org/T204090#4626354
Russian: https://phabricator.wikimedia.org/T204090#4626381
Persian: https://phabricator.wikimedia.org/T204090#4626510

  • Roll out using several deploy windows in a single day - take care rolling this out in case there are any unexpected spikes in errors or event logging:
  • Progressively roll out page issues to the target projects at 5%
  • Progressively roll out page issues to the target projects at 10%
  • Set A/B test for page issues for select projects, at the A/B test sampling ratio of 20% for enwiki (per T200792#4489268) and 100% for the other three, smaller wikis (per T200792#4632005 ).
  • Increase sampling rate on jawiki, ruwiki, fawiki to 100%

Tentative dates:

  • Latvian wiki: sept 19
  • English, Russian, Japanese, Farsi: oct 1 (Switch from Catalan to Farsi due to some smaller issues on Catalan and so that we can also have an rtl language included in the test)

Sign off steps

  • Wait a few days to get confirmation from Tilman that the A/B test instrumentation working as expected.
  • If necessary, create follow-up tasks (for bugs, inconsistencies, etc)

--> T204143 in particular

  • Setup task to analyse A-B test

--> T200794

  • Prepare deploy/remove code task that will be carried out based on A-B test results.
  • Add a note to the project page
  • Add a note to the release timeline

TODO

  1. Use @phuedx or @pmiazga's "bucket breaker" script to find a session ID that'll put you in the correct bucket for testing

Run this code to opt into the PageIssues A/B test:

M = mw.mobileFrontend;
AB = M.require( 'skins.minerva.scripts/AB' ); 

var t = 0, abTest;
function check() {
    abTest = new AB( {
        testName: 'WME.PageIssuesAB',
        // Run AB only on article namespace, otherwise set samplingRate to 0,
        // forcing user into control (i.e. ignored/not logged) group.
        samplingRate: mw.config.get( 'wgMinervaABSamplingRate', 0 ),
        sessionId: t
    } );
}
check();
while (!abTest.isB()) {
    t =1;
    check();
}
mw.storage.session.set('mwuser-sessionId',t);

Related Objects

StatusSubtypeAssignedTask
ResolvedJdlrobson
ResolvedJdlrobson
Resolvedovasileva
Resolved alexhollender_WMF
DuplicateNone
Resolved Niedzielski
Resolvedovasileva
Resolvedovasileva
Resolvedphuedx
Resolved Tbayer
Resolvedovasileva
Resolved Niedzielski
Resolved Tbayer
Resolved Tbayer
Resolved Tbayer
Resolved Tbayer
Resolvedovasileva
Resolved Tbayer
Resolvedphuedx
Resolvedovasileva
ResolvedJdlrobson
ResolvedJdlrobson
Resolved Tbayer
Resolved Tbayer
Resolved Niedzielski
DeclinedNone
Resolvedovasileva

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Jdlrobson renamed this task from Run A/B test on page issues to Run A/B test on page issues (Catalan, Japanese, Russian, English).Sep 26 2018, 9:40 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson updated the task description. (Show Details)
ovasileva renamed this task from Run A/B test on page issues (Catalan, Japanese, Russian, English) to Run A/B test on page issues (Farsi, Japanese, Russian, English).Sep 28 2018, 9:51 PM
ovasileva updated the task description. (Show Details)

Change 463805 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] Enable Page issues A/B test set rate to 5%

https://gerrit.wikimedia.org/r/463805

Change 463806 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] Minerv page issues A/B test to 10%

https://gerrit.wikimedia.org/r/463806

Change 463807 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] Page issues A/B test to 20% of users (Start the a/b test!)

https://gerrit.wikimedia.org/r/463807

Change 463805 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable Page issues A/B test set rate to 5%

https://gerrit.wikimedia.org/r/463805

Mentioned in SAL (#wikimedia-operations) [2018-10-01T18:12:53Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Enable page issues A/B test at 5% rate (T200792) (duration: 00m 59s)

Change 463806 merged by jenkins-bot:
[operations/mediawiki-config@master] Minerv page issues A/B test to 20%

https://gerrit.wikimedia.org/r/463806

Mentioned in SAL (#wikimedia-operations) [2018-10-01T18:33:40Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Enable page issues A/B test at 20% rate (T200792) (duration: 00m 56s)

A/B test is up and running for 20% of users. Let's verify the data coming in is sound.
We're seeing around 60 events per second (and no errors[1])

[1] kafkacat -C -b kafka-jumbo1001.eqiad.wmnet -t eventlogging_EventError | grep PageIssues

We now have two full hours of data. As a first check, here is the ratio of pageloaded events to all mobile web pageviews for enwiki (analogously to T204609#4607546 for lvwiki). 2.8% at a sampling ratio of 20% would extrapolate to 16% of pageviews, which is still consistent with 19% of enwiki mainspace pages having Ambox issues (T201123#4494446).

datehourpageloaded_eventsall_pageviewsratio
2018-10-01 18h7353156449690.013
2018-10-01 19h15852856935310.0278
2018-10-01 20h15618055989230.0279

Cool!

16% of pageviews, which is still consistent with 19% of enwiki mainspace pages having Ambox issues

I assume the missing 3% is users without EventLogging support/grade C browsers?

How is ReadingDepth impacted?

And to follow up on T204609#4630216, the newly added wikis appear to exhibit an issues clickthrough rate that is similarly low as on lvwiki (it's a bit higher on fawiki with 0.69% so far). This looks like a good reason to increase the sampling ratio to 100% on the smaller (non-enwiki) wikis, in order to have a better chance to detect changes (if any) with statistical significance. E.g. jawiki will have about 5 million mobile views during the two weeks of the test; if perhaps 1 million of these views will be to pages with issues, that already would not be enough to detect a 5% increase at 0.3% clickthrough rate.

wikiissuesclickratiopageloaded_events
enwiki0.0024778551356252204388239
fawiki0.00688307273963290311768
jawiki0.003087535812633862735951
lvwiki0.0035829009142574758094
ruwiki0.003095712650014842347162

Data via

SELECT wiki, 
SUM(IF(event.action = 'issueClicked', 1, 0)) / SUM(IF(event.action = 'pageLoaded', 1, 0)) AS issuesclickratio,
SUM(IF(event.action = 'pageLoaded', 1, 0)) AS pageloaded_events
FROM event.pageissues 
WHERE year = 2018 AND month = 10 AND day = 1 AND ((hour >= 19) OR (hour <= 20))
GROUP BY wiki;

Change 463875 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] smaller wiki a/b tests are bumped to 100%

https://gerrit.wikimedia.org/r/463875

Change 463807 abandoned by Jdlrobson:
Page issues A/B test to 20% of users (Start the a/b test!)

Reason:
not needed

https://gerrit.wikimedia.org/r/463807

There was a lack of clarity about the expected event increase from https://gerrit.wikimedia.org/r/463875 , causing some misunderstanding with Analytics Engineering and the postponing of the deployment earlier today:

[11:15:24] <elukey> raynor: from https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&panelId=5&fullscreen ReadingDepth is ~668 events/second
[11:15:58] <elukey> so it should jump to something like 900/s ?
[11:16:29] <raynor> yup, but that is most probably overall reading depth
[11:16:53] <elukey> oh yes this is the number that I care about basically

It looks like @JAllemandou already figured out things afterwards:

[12:26:52] <joal> elukey: just checked the patches - The bump to 20% included enwiki while the bump to 100% doesn't - I feel safe now :)

...but to further clarify just in case: The bulk of the events in ReadingDepth comes from a different sample that is not affected by the change here. The events contributed by this A/B test only make up roughly 10% of the rate currently (could vary a bit by time of day).[1] Also, as Joseph pointed out, the change only affects the smaller wikis in this experiment, which currently contribute roughly 30% of that sample (or 3% of the overall ReadingDepth events).[2] Thus we would expect the total rate of ReadingDepth events to increase by about 12%.

[1]

SELECT 
SUM(IF(event.page_issues_a_sample OR event.page_issues_b_sample, 1, 0))/SUM(1) AS pageissues_ratio,
SUM(IF(event.default_sample, 1, 0))/SUM(1) AS default_ratio,
SUM(1) AS all_events
FROM event.readingdepth
WHERE year = 2018 AND month = 10 AND day = 2;

pageissues_ratio	default_ratio	all_events
0.098912038873353	0.9110088650385093	63154830
[2]

SET hive.mapred.mode=nonstrict;
WITH allevents AS
  (SELECT COUNT(*) AS total_pi_events FROM event.readingdepth 
  WHERE year = 2018 AND month = 10 AND day = 2
  AND (event.page_issues_a_sample OR event.page_issues_b_sample))
SELECT wiki, COUNT(*) AS events, COUNT(*)/allevents.total_pi_events AS share_of_events
FROM (SELECT wiki FROM event.readingdepth 
  WHERE year = 2018 AND month = 10 AND day = 2
  AND (event.page_issues_a_sample OR event.page_issues_b_sample)) AS pi_events_by_wiki
JOIN allevents 
GROUP BY total_pi_events, wiki;

wiki	events	share_of_events
enwiki	4463002	0.7144492044132227
fawiki	88175	0.014115288005503001
jawiki	1222014	0.19562324419344196
lvwiki	12255	0.0019618129232485317
ruwiki	461327	0.07385045046458387

We basically care about the following rates:

https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&panelId=5&fullscreen

Before each sampling change it would be great to alert us beforehand with some numbers about the increase that one or more schema will get so we (analytics) can estimate if our current capacity is enough to sustain traffic. Eventlogging is currently deployed only on one host (eventlog1002) and I think that we may end up overloading it if we keep doing these kind of tests in the future (without thinking about hw limits). Knowing numbers beforehand help us figure out if we need to scale out Kafka/Eventlogging and plan accordingly :)

@elukey can we proceed with Swatting this patch or do you need anything more?

Thanks for the technical background, @elukey! I think it would be useful to add some guidance to the documentation. Developers might find concrete rate limits particularly useful (like the one we stated earlier about the old MySQL system). Especially since there was a sense earlier that the new Hadoop infrastructure would basically relieve us of worrying about throughput limitations.

@elukey can we proceed with Swatting this patch or do you need anything more?

Yep I think we are fine, but please alert us on IRC first :)

Especially since there was a sense earlier that the new Hadoop infrastructure would basically relieve us of worrying about throughput limitations.

That we can scale horizontally when handling events does not mean that it happens automagically, communicating about changes ensures that capacity is in place to handle them.

Change 463875 merged by jenkins-bot:
[operations/mediawiki-config@master] smaller wiki Minerva a/b tests are bumped to 100%

https://gerrit.wikimedia.org/r/463875

Mentioned in SAL (#wikimedia-operations) [2018-10-03T23:12:28Z] <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Bump Minerva A/B test rates to 100% on jawiki, ruwiki, fawiki (T200792) (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2018-10-22T22:12:09Z] <pmiazga@deploy1001> Synchronized wmf-config//InitialiseSettings-labs.php: SWAT: [[gerrit:469121|beta: Disable page issues A/B test on beta cluster only (T200792)]] (duration: 00m 46s)

For the record: We extended this test to run four weeks (at the current sampling rate) instead of two weeks, i.e. until the end of this week. This enables us to get more data in particular regarding the questions added recently in T200794#4661887 - it might still not be enough to detect changes reliably for these small groups, but we'll have a better chance.

Jdlrobson renamed this task from Run A/B test on page issues (Farsi, Japanese, Russian, English) to [EPIC] Run A/B test on page issues (Farsi, Japanese, Russian, English).Oct 31 2018, 5:13 PM
Jdlrobson added a project: Epic.
ovasileva updated the task description. (Show Details)

All done. Resolving!