⚓ T369847 Setup basic send and receive wiring between a MW instance and a Statsig cloud instance

	Subject	Repo	Branch	Lines /-
	DNM: Example Statsig integration	mediawiki/extensions/MetricsPlatform	master	388 -2

Status	Assigned	Task
Open	None	T368163 [EPIC] FY 24/25 SDS 2.1.1 POC Integration tests of 3rd Party experimentation engine solutions
Invalid	None	T368048 [EPIC] Develop cross team plan for PoC integration testing of Experimentation Platform 3rd party solutions
Resolved	VirginiaPoundstone	T372587 [Sprint 18 GOAL] Deliver 3rd party decision brief to senior leadership
Resolved	phuedx	T372911 Draft Decision Record: Selection of Third-Party Experimentation Platform
Resolved	phuedx	T369847 Setup basic send and receive wiring between a MW instance and a Statsig cloud instance

• WDoranWMF triaged this task as High priority.Jul 11 2024, 4:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 11 2024, 4:30 PM

phuedx added a parent task: T368048: [EPIC] Develop cross team plan for PoC integration testing of Experimentation Platform 3rd party solutions.Jul 12 2024, 3:15 PM

Change #1054352 had a related patch set uploaded (by Phuedx; author: Phuedx):

[mediawiki/extensions/MetricsPlatform@master] DNM: Example Statsig integration

https://gerrit.wikimedia.org/r/1054352

gerritbot added a project: Patch-For-Review.Jul 15 2024, 1:54 PM

phuedx updated the task description. (Show Details)Jul 15 2024, 2:15 PM

phuedx moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 16) board.Jul 15 2024, 3:21 PM

phuedx updated the task description. (Show Details)Jul 18 2024, 2:13 PM

phuedx updated the task description. (Show Details)Jul 23 2024, 9:22 AM

A RESTful API is up and running that receive events and proxies these to a Statsig instance

For now, this is a simple proxy with a RESTful-looking URL, /rest.php/v1/metrics-platform/statsig/rgstr. It forwards the request, preserving the Statsig- headers but not the X-Client-IP or X-Forwarded-For headers.

Notes

I've had to use the maintenance version of the JavaScript SDK because:
1. It allows you to inject the Statsig config fetched by the PHP SDKs directly rather than fetching it itself. During testing I found that this wasn't the case for the newer SDK
2. The new SDK requires that you use a specific build to do local on-device evaluation. Per https://docs.statsig.com/client/js-on-device-eval-client, you require a Enterprise or Pro license to do on-device evaluation with this SDK
The "standard" API provided by the PHP SDK returns null when the Statsig config has expired. This is not configurable. I've had to use an internal API to fetch the Statsig config directly
The Statsig config is delivered in a separate RL virtual file. The PHP SDK returns null under certain conditions (see above), which makes RL throw an exception. In the case where the PHP SDK returns null, it's converted to the empty string to satisfy RL. Fortunately, the JavaScript SDK can handle this

Re. 1: Generally, it's important that we minimise the number of requests made by the browser. Specifically, it's important that we make no requests prior to showing features to users to avoid Flash(es) of Unstyled Content or features popping in, both of which could introduce bias. It's also important that we don't make requests from the browser to Statsig, which would lead to PII (the user's IP and current page) being sent to Statsig unintentionally.

@mpopov: I'm currently thinking about how to wire up the Statsig SDK and a general mechanism for T368326: Update Metrics Platform Client Libraries to accept experiment membership and would appreciate your input.

The Statsig and GrowthBook SDKs operate in the same way – they evaluate whether a feature is enabled for a user/a user should be enrolled in an experiment when asked, e.g. mw.experiments.isFeatureEnabled( 'foo' );. Because various JavaScript modules execute at different times and their execution order isn't strictly guaranteed between pageviews, we could very well have one or more instruments send an event with N enrollments only to send an event later with M enrollments. Would this be problematic or am I overthinking things?

My Current Mental Model

{F56639742}

https://mermaid.live/edit#pako:eNqNkk1Ow0AMha8y8pZygSzKohQpCyREdigbk3HKiPkJjicqqnp3nKSlLSDAq7H13mdbnh00yRIU0NNbptjQrcMNY6ij0eiQ58cYd4SSmcxKDeZ6ubwy621H7AJFMQ8epU0cClP2pj0oKeKzJ3tzYvzgmFGVoPRu84f9oPql-z9arYdLyyM1ia124-T9qJ8ZGO0JVsZeOE-ws_W_gCqKihk-CdMUw-UAZ8YyCr6SunhwDX23awoLCMQBndUL7cZyDfJCgWrQVcFSi9lLDXXcqxSzpOo9NlDorLSA3FmU40GPxQ7jU0qatuh7zck6SXw_f4LpL-w_AET6rc8

(The file attachment – a diagram I suspect? – is missing.)

I think this is only problematic if a metric depends on data collected from multiple independently, lazily loaded instruments. Like if there's an instrument that sends an event when user sees content (a page load?) and then the instrument for feature initializes and sends user interactions (say, clicks) with said feature. If we're calculating a clickthrough rate for the experiment but the impressions aren't tagged as being in the experiment, then we have a problem. But if the same instrument that's locked behind the flag sends both the click AND the impression event, then we're good.

Any pair of impression & click events coming from the same instrument would be guaranteed to be experiment-tagged correctly. It would mean we would have multiple impression (or init, whatever) events per pageview (and I think we already do with some instruments/features?), but the origin (the instrument ID?) would be critical here.

So a potential "standalone" guideline for analysts and developers could be: when instrumenting user interactions for measuring interaction-type metric, the instrument should be sufficient on its own and should collect all, not parts of, the data that's necessary to calculate the metric.

In T369847#10009180, @mpopov wrote:

(The file attachment – a diagram I suspect? – is missing.)

I've corrected the permissions.

So a potential "standalone" guideline for analysts and developers could be: when instrumenting user interactions for measuring interaction-type metric, the instrument should be sufficient on its own and should collect all, not parts of, the data that's necessary to calculate the metric.

Perfect!

phuedx updated the task description. (Show Details)Jul 24 2024, 1:33 PM

phuedx edited projects, added Data Products (Data Products Sprint 17); removed Data Products (Data Products Sprint 16).Jul 26 2024, 2:12 PM

phuedx moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 17) board.

phuedx moved this task from In Process to Code Review / Tech Input on the Data Products (Data Products Sprint 17) board.Aug 6 2024, 11:16 AM

I think I need to speak with Sam and understand this N / M enrollments problem some more because I'm not getting it right now. With that caveat, thoughts:

I'm assuming we have to build the Statsig Config Fetcher you mentioned in the diagram? I was thinking we could kind of tack it onto the GrowthBook proxy. It would just be another route that doesn't have to know much about GrowthBook, we'd just be reusing that setup to save SRE work? And maybe we can more tightly integrate it with the way the GrowthBook proxy works, but that wouldn't be necessary. We'd just be using the Statsig Node.js SDK
I'm not sure what you were thinking beyond the Event Intake phase in the sequence diagram, but I was looking at the Data Warehouse integrations and of all of them I think the only non-proprietary would be Minio with the S3 compatibility. But that would be a big thing to set up. So I think if we're biting off this bit of proprietary, it might require us to bite more to do that integration.

p.s. I'm not sure I follow how we're meant to send data to Statsig, we usually collect stuff that we don't want leaving our servers

phuedx moved this task from Code Review / Tech Input to In Process on the Data Products (Data Products Sprint 17) board.Aug 7 2024, 11:27 AM

☝️ I've hit some snags (plural).

I can send Statsig events via a proxy but I can only initialise the SDK via Statsig's endpoint after the page has loaded.

The snags are:

The JS SDK can't be initialized with the config fetched and stored by the PHP SDK (by mimicking what the PHP SDK does) and silently fails. You can check feature gates etc and events are sent by the Statsig SDK but they are all marked as "unrecognized" (see https://docs.statsig.com/sdk/debugging)
1. I was able to initialize the JS SDK from Statsig's API directly. However, the version of the SDK I'm using is expecting initialization values evaluated for a user rather than for all users.

In T369847#10006308, @phuedx wrote:

I've had to use the maintenance version of the JavaScript SDK because:

It allows you to inject the Statsig config fetched by the PHP SDKs directly rather than fetching it itself. During testing I found that this wasn't the case for the newer SDK

The new SDK requires that you use a specific build to do local on-device evaluation. Per https://docs.statsig.com/client/js-on-device-eval-client, you require a Enterprise or Pro license to do on-device evaluation with this SDK

I'm going to attempt to use the new SDK and see how far I can get with this.

I tried to skirt around doing the above by getting the SDK to initialize via a proxy. However, the request that the SDK sends is invalid and confuses MediaWiki's REST framework – it has Content-Type: application/json but the body isn't JSON…

phuedx updated the task description. (Show Details)Aug 7 2024, 4:09 PM

phuedx added a subscriber: VirginiaPoundstone.

In T369847#10048270, @phuedx wrote:

I'm going to attempt to use the new SDK and see how far I can get with this.

At the end of last week, I was able to initialise the newer version of the Local Evaluation JS SDK from config fetched on the server, have it evaluate a feature gate, and submit an event to Statsig via a proxy 🎉

Notes

The Statsig Local Evaluation SDK isn't as well-documented as its counterparts. The documentation that I'm referring to is here: https://docs.statsig.com/client/jsLocalEvaluationSDK. The documentation appears to reuse a section about the Statsig Options object used to initialize the SDK with, which doesn't quite correspond to the code. I say this because the documentation states that the on-device evaluation SDKs are only for Enterprise and Pro Tier accounts but I've been able to use it.

With the above in mind, I'll reach out to Statsig and query the following con:

Does not support IP or User Agent based checks (Browser Version/Name, OS Version/Name, IP, Country)

You never know…

You can't disable diagnostics logging. In the other SDK I attempted to integrate you could disable all logging or specific types of logging whereas the Statsig Local Evaluation SDK is all or nothing.

Diagnostics logging occurs when an error occurs within the SDK, including if it's provided with a faulty config. If we introduce an error in the config/the delivery of the config, then user information (including PII like their IP address) will be sent to Statsig.

Up until now, I've assumed that minimize requests to the server is paramount. If I relax that assumption a little, I can see an alternative that's more flexible (it uses the remote-evaluation SDK, which has fewer cons) with the same control over what gets sent where and the same separation from Statsig:

We introduce a private (uncacheable), URL-loadable RL module, which: generates an ID, uses it to bootstrap the client SDK, and sets the ID in a cookie for reuse. The ext.metricsPlatform RL module, which is public (cacheable) and URL-loadable, would depend on this module.

However, this codepath would have to withstand a request per pageview (~6,000 requests/s?) and so would Statsig, although they were confident that this request rate would be acceptable previously. That said, the codepath is stateless and therefore can be scaled horizontally.

In T369847#10057294, @phuedx wrote:

We introduce a private (uncacheable), URL-loadable RL module, which: generates an ID, uses it to bootstrap the client SDK, and sets the ID in a cookie for reuse. The ext.metricsPlatform RL module, which is public (cacheable) and URL-loadable, would depend on this module.

{F57271197}

https://mermaid.live/edit#pako:eNqNUsFqwzAM_RXh67p9QA6FwbZTC6PdGIxchK2kYraVOjLrKP33pcvSdIW200kyz-_pibc1VhyZwrS0zhQtPTDWCUMZoasGU9_s67WlBPc1RYXb6fQG5uQY3_iDC5gJOlgqass1WIkV1-M_m1jZoocnUrs6C9vXgbJX-IUW8Bc1MJxucZFr3P4YKI2yRHjhQJL13wQLahuJDj5ZVxCz92ddUXT9gENz5ZT9kWijd4E0sW2fPWolKcBiBkFc9jTyXPV4kD9VPIY-bshmpcuiZmICpYDsurBs96Sl0RUFKk0nZhxVmL2Wpoy7DopZZfkVrSk0ZZqY3DjUIVumqNC33WuD8V1knDs3KmneB_Inl7tvWDXRLw

phuedx moved this task from In Process to Code Review / Tech Input on the Data Products (Data Products Sprint 17) board.Aug 12 2024, 3:34 PM

phuedx moved this task from Code Review / Tech Input to In Process on the Data Products (Data Products Sprint 17) board.Aug 13 2024, 11:18 AM

After a little more thought, we can combine the initial approach (fetching and caching the config in a per-DC cache on a loop) with the above approach to ensure that we don't contact Statsig whilst serving the page.

mermaid-diagram-2024-08-15-150623.png (784×1 px, 114 KB)

https://mermaid.live/edit#pako:eNqVVEtrwzAM_ivC13WDXXMolK5jh3aMlVEYuQhHSc0cO_NjtJT-9zmPNU3K0lQnS5Y-fZZkHRjXCbGIWfr2pDg9CcwM5rGCIAWa-nChlPJhycAsI-Xgfjq9gxUlAjfiS0Sw1JjA2qGzIgOuVSqybiw3wgmOsmst5YRSg25mrzBHvqUInsnx7SBqKaeAC1JdX104oRW8CHeVwxnG4gelR0c9GpBqAz7U4ypWW7QI1qSGq9RwXAlrbwdWXspB9ODUGvBcudLduhG0cw85OSO4fZPoQgVyeF9CrhMvqYs1QLV17NDpZz8PWeyI-9CCEQQ6r5JaF6GBZPbwCLlQAcOOncq_Os7rdlcVCIwqbs3dmAFtGlo7Ws85UWL_z3YO30vdH-chkv2PtAnPpPE8UxTS3phtqbNyC2RVsDc0MHknJRzYhOVkchRJWEmH0hwzt6WcYhYezBJK0UsXs1gdgyt6p9d7xVnkjKcJ80USvmazwViUorTBWqD61LrVwyw6bVb12qu23_EXlwN3aQ

phuedx updated the task description. (Show Details)Aug 15 2024, 2:34 PM

phuedx updated the task description. (Show Details)

Usecases examined

I chose the simplest usecase that I could think of: 75% of all sessions should see "Hello, World!" in their browser's console. This proves that the Statsig SDK is at least being initialized with a valid config and allowed me to validate whether gate exposure events are being logged.

I also explored a server-side equivalent: Conditionally showing a button on a special page.

Can we successfully bucket users based on traffic?

In short, yes.

Pre-Varnish-provided idenfier (Edge Unique ID (EUID)):

We update the MetricsPlatform extension to read a UID from a session cookie or generate one and write it to a session cookie. The UID would be used on both the server and the client as the Statsig SDK's stable ID.

This would allow us to evaluate feature gates and experiments on the server for logged-in users on the server and the client, and for logging-out users on the client.

"Based on traffic" here would mean per-session. It's worth noting that a session can last… a while: https://wikitech.wikimedia.org/wiki/Data_Platform/Sessions#Web

Post-EUID:

We can update the MetricsPlatform extension to use the EUID both on the server and on the client.

Can we successfully generate data based on variant assignment to those user buckets?

Yes.

I have been able to configure both the remote- and local-evaluation versions of the Statsig SDK to generate and transmit events via a MediaWiki REST API-based proxy.

phuedx updated the task description. (Show Details)Aug 16 2024, 3:15 PM

cjming moved this task from In Process to Wormhole To Sprint 18 on the Data Products (Data Products Sprint 17) board.Aug 19 2024, 4:22 PM

cjming edited projects, added Data Products (Data products Sprint 18); removed Data Products (Data Products Sprint 17).

phuedx moved this task from Sprint Backlog to Sign Off on the Data Products (Data products Sprint 18) board.Aug 21 2024, 1:03 PM

Shortcomings

plus performance, and scalability evaluation to the best of our ability.

Performance is where things get interesting and the shortcomings of the different solutions are highlighted.

We have to think about minimising Time To Load (TTL) overall (which includes minimising TTL for Metrics Platform) and minimising bytes transferred.

If we choose to use the Local Evaluation SDK, then we will increase page weight overall because the config includes all feature gate and experiment definitions for all users everywhere. As the WMF begins to run more and more experiments, the size of this config will increase. We can partially mitigate this by only sending the definitions for those feature gates and experiment definitions that need to be evaluated on the client. However, because the config can be used for any user, it can be cached at the edge, keeping TTL for Metrics Platform low.

If we choose to use remote evaluation (the default), then we increase the TTL for Metrics Platform and anything that depends on it, which could lead to flashes of unstyled content and a reduction in user-perceived performance. We can partially mitigate this by caching the result in the user agent for ~5 minutes, which would decrease the TLL for Metrics Platform on subsequent pageviews. However, because the config is for a specific user, it only contains the feature gate and experiment definitions for that user, which will decrease page weight.

Other considerations that are related to performance:

There is no way of preventing tampering with the config after it has been received. We can make it difficult to do so but it would require work on the server and the client, e.g. we could sign a digest of the config with a private key and send the signature with the config for the client to verify during initialisation
There is no way of totally obfuscating the contents of the config that we send to the client. The PHP and JS SDKs support hashing the names of the feature gates and experiments for transport but this requires additional work in the browser every time a feature gate or experiment is checked

Note well that all of the above also relates to how we would deploy other systems (e.g. GrowthBook) as well as a system that we built ourselves. Regardless of the direction we take, we must commit to measuing and monitoring these performance metrics (TTL, time to initialised, config size, etc) for real users.

SGupta-WMF moved this task from Sign Off to Done on the Data Products (Data products Sprint 18) board.Aug 27 2024, 11:10 AM

phuedx added a parent task: T372911: Draft Decision Record: Selection of Third-Party Experimentation Platform .Aug 27 2024, 11:12 AM

phuedx updated the task description. (Show Details)Sep 4 2024, 1:00 PM

Change #1054352 abandoned by Phuedx:

[mediawiki/extensions/MetricsPlatform@master] DNM: Example Statsig integration

https://gerrit.wikimedia.org/r/1054352

Maintenance_bot removed a project: Patch-For-Review.Sep 5 2024, 1:31 PM

VirginiaPoundstone closed this task as Resolved.Sep 6 2024, 12:07 PM

Setup basic send and receive wiring between a MW instance and a Statsig cloud instance
Closed, ResolvedPublic8 Estimated Story Points
Actions

Description

Description

Acceptance Criteria

Sequence Diagrams

Details

Related Objects
Search...

Event Timeline

Notes

	F57275316: mermaid-diagram-2024-08-15-150623.png
	Aug 15 2024, 2:09 PM

	F57269614: mermaid-diagram-2024-08-12-163029.png
	Aug 12 2024, 3:30 PM

	F56642843: mermaid-diagram-2024-07-24-142913.png
	Jul 24 2024, 1:33 PM

Setup basic send and receive wiring between a MW instance and a Statsig cloud instanceClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Description

Acceptance Criteria

Sequence Diagrams

Details

Related ObjectsSearch...

Event Timeline

Notes

Setup basic send and receive wiring between a MW instance and a Statsig cloud instance
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...