Page MenuHomePhabricator

Setup basic send and receive wiring between a MW instance and a Statsig cloud instance
Closed, ResolvedPublic8 Estimated Story Points

Assigned To
Authored By
WDoranWMF
Jul 11 2024, 4:30 PM
Referenced Files
F57275316: mermaid-diagram-2024-08-15-150623.png
Aug 15 2024, 2:09 PM
Restricted File
Aug 13 2024, 10:43 AM
F57269614: mermaid-diagram-2024-08-12-163029.png
Aug 12 2024, 3:30 PM
F56642843: mermaid-diagram-2024-07-24-142913.png
Jul 24 2024, 1:33 PM
Restricted File
Jul 24 2024, 10:13 AM
Restricted File
Jul 23 2024, 4:20 PM

Description

Description

In order to determine, possible blockers and caveats to setting up and integrating Statsig for PoC User Testing, we need to first establish a connection to receive configuration from a Statsig instance and then to constructive a RESTful proxy that allows us to send event data to a Statsig instance

Acceptance Criteria

  • A MW instance setup in Toolforge is able to receive configruation data from a Statsig instance
  • A RESTful API is up and running that receive events and proxies these to a Statsig instance
  • Sequence diagrams
  • Usecases examined
  • Shortcomings

Per a private conversation with @VirginiaPoundstone and @WDoranWMF:

  • Can we successfully bucket users based on traffic?
  • Can we successfully generate data based on variant assignment to those user buckets?
  • Is the data of good quality, does it contain errors?
  • plus performance, and scalability evaluation to the best of our ability.

Sequence Diagrams

mermaid-diagram-2024-07-24-142913.png (822×1 px, 92 KB)

https://mermaid.live/edit#pako:eNqVVMFOwzAM_RUrV8aBaw-Tpg0OaEOICU1CvZjE7SLapCQuaJr276Ttuq5F3cCnOHl5fvFTvBfSKhKR8PRZkpG00Jg6zGMDIQp0zaKKV08OZikZhtvp9AZWpDRu9IeOYGlRhZydlh6eM-TEuhwe1yAzHfAdh3SatcQMHojlFtaM7HUK0ppEpx2sihN9U20ze4I5yi1F0MedDn6p6nC2YG3NEZZr7y-WmltHZ3sLZHxHT35YeQx3Qcig2vBljcCxrpBRsenSgerOngheyBfWKPjWvL3qC1a0LWlmbQH3X-R2cBc6ZUqmQbNaC_u7VbS657XuxuKgqVZ3PIv-YPzRrAboSymJlB-vdk4_KD3s_DWR48ZvwqPpoupL3EObr7P1epCgzvw_qy1tWn3LtL5cOupfp3PLT0m9EBORk8tRqzAU9tVBLHhLOcUitFMoSrDMOBaxOQQolmzXOyNFxK6kiSgLhdzOkHazQPNmbUgTzHzIQ4PZulUzd-rxc_gB03dabg

Event Timeline

WDoranWMF created this task.

Change #1054352 had a related patch set uploaded (by Phuedx; author: Phuedx):

[mediawiki/extensions/MetricsPlatform@master] DNM: Example Statsig integration

https://gerrit.wikimedia.org/r/1054352

  • A RESTful API is up and running that receive events and proxies these to a Statsig instance

For now, this is a simple proxy with a RESTful-looking URL, /rest.php/v1/metrics-platform/statsig/rgstr. It forwards the request, preserving the Statsig- headers but not the X-Client-IP or X-Forwarded-For headers.

Notes

  1. I've had to use the maintenance version of the JavaScript SDK because:
    1. It allows you to inject the Statsig config fetched by the PHP SDKs directly rather than fetching it itself. During testing I found that this wasn't the case for the newer SDK
    2. The new SDK requires that you use a specific build to do local on-device evaluation. Per https://docs.statsig.com/client/js-on-device-eval-client, you require a Enterprise or Pro license to do on-device evaluation with this SDK
  2. The "standard" API provided by the PHP SDK returns null when the Statsig config has expired. This is not configurable. I've had to use an internal API to fetch the Statsig config directly
  3. The Statsig config is delivered in a separate RL virtual file. The PHP SDK returns null under certain conditions (see above), which makes RL throw an exception. In the case where the PHP SDK returns null, it's converted to the empty string to satisfy RL. Fortunately, the JavaScript SDK can handle this

Re. 1: Generally, it's important that we minimise the number of requests made by the browser. Specifically, it's important that we make no requests prior to showing features to users to avoid Flash(es) of Unstyled Content or features popping in, both of which could introduce bias. It's also important that we don't make requests from the browser to Statsig, which would lead to PII (the user's IP and current page) being sent to Statsig unintentionally.

@mpopov: I'm currently thinking about how to wire up the Statsig SDK and a general mechanism for T368326: Update Metrics Platform Client Libraries to accept experiment membership and would appreciate your input.

The Statsig and GrowthBook SDKs operate in the same way – they evaluate whether a feature is enabled for a user/a user should be enrolled in an experiment when asked, e.g. mw.experiments.isFeatureEnabled( 'foo' );. Because various JavaScript modules execute at different times and their execution order isn't strictly guaranteed between pageviews, we could very well have one or more instruments send an event with N enrollments only to send an event later with M enrollments. Would this be problematic or am I overthinking things?

My Current Mental Model

{F56639742}

https://mermaid.live/edit#pako:eNqNkk1Ow0AMha8y8pZygSzKohQpCyREdigbk3HKiPkJjicqqnp3nKSlLSDAq7H13mdbnh00yRIU0NNbptjQrcMNY6ij0eiQ58cYd4SSmcxKDeZ6ubwy621H7AJFMQ8epU0cClP2pj0oKeKzJ3tzYvzgmFGVoPRu84f9oPql-z9arYdLyyM1ia124-T9qJ8ZGO0JVsZeOE-ws_W_gCqKihk-CdMUw-UAZ8YyCr6SunhwDX23awoLCMQBndUL7cZyDfJCgWrQVcFSi9lLDXXcqxSzpOo9NlDorLSA3FmU40GPxQ7jU0qatuh7zck6SXw_f4LpL-w_AET6rc8

(The file attachment – a diagram I suspect? – is missing.)

I think this is only problematic if a metric depends on data collected from multiple independently, lazily loaded instruments. Like if there's an instrument that sends an event when user sees content (a page load?) and then the instrument for feature initializes and sends user interactions (say, clicks) with said feature. If we're calculating a clickthrough rate for the experiment but the impressions aren't tagged as being in the experiment, then we have a problem. But if the same instrument that's locked behind the flag sends both the click AND the impression event, then we're good.

Any pair of impression & click events coming from the same instrument would be guaranteed to be experiment-tagged correctly. It would mean we would have multiple impression (or init, whatever) events per pageview (and I think we already do with some instruments/features?), but the origin (the instrument ID?) would be critical here.

So a potential "standalone" guideline for analysts and developers could be: when instrumenting user interactions for measuring interaction-type metric, the instrument should be sufficient on its own and should collect all, not parts of, the data that's necessary to calculate the metric.

(The file attachment – a diagram I suspect? – is missing.)

I've corrected the permissions.

So a potential "standalone" guideline for analysts and developers could be: when instrumenting user interactions for measuring interaction-type metric, the instrument should be sufficient on its own and should collect all, not parts of, the data that's necessary to calculate the metric.

Perfect!

I think I need to speak with Sam and understand this N / M enrollments problem some more because I'm not getting it right now. With that caveat, thoughts:

  1. I'm assuming we have to build the Statsig Config Fetcher you mentioned in the diagram? I was thinking we could kind of tack it onto the GrowthBook proxy. It would just be another route that doesn't have to know much about GrowthBook, we'd just be reusing that setup to save SRE work? And maybe we can more tightly integrate it with the way the GrowthBook proxy works, but that wouldn't be necessary. We'd just be using the Statsig Node.js SDK
  2. I'm not sure what you were thinking beyond the Event Intake phase in the sequence diagram, but I was looking at the Data Warehouse integrations and of all of them I think the only non-proprietary would be Minio with the S3 compatibility. But that would be a big thing to set up. So I think if we're biting off this bit of proprietary, it might require us to bite more to do that integration.

p.s. I'm not sure I follow how we're meant to send data to Statsig, we usually collect stuff that we don't want leaving our servers

☝️ I've hit some snags (plural).

I can send Statsig events via a proxy but I can only initialise the SDK via Statsig's endpoint after the page has loaded.

The snags are:

  1. The JS SDK can't be initialized with the config fetched and stored by the PHP SDK (by mimicking what the PHP SDK does) and silently fails. You can check feature gates etc and events are sent by the Statsig SDK but they are all marked as "unrecognized" (see https://docs.statsig.com/sdk/debugging)
    1. I was able to initialize the JS SDK from Statsig's API directly. However, the version of the SDK I'm using is expecting initialization values evaluated for a user rather than for all users.
  1. I've had to use the maintenance version of the JavaScript SDK because:
    1. It allows you to inject the Statsig config fetched by the PHP SDKs directly rather than fetching it itself. During testing I found that this wasn't the case for the newer SDK
    2. The new SDK requires that you use a specific build to do local on-device evaluation. Per https://docs.statsig.com/client/js-on-device-eval-client, you require a Enterprise or Pro license to do on-device evaluation with this SDK

I'm going to attempt to use the new SDK and see how far I can get with this.

  1. I tried to skirt around doing the above by getting the SDK to initialize via a proxy. However, the request that the SDK sends is invalid and confuses MediaWiki's REST framework – it has Content-Type: application/json but the body isn't JSON…

I'm going to attempt to use the new SDK and see how far I can get with this.

At the end of last week, I was able to initialise the newer version of the Local Evaluation JS SDK from config fetched on the server, have it evaluate a feature gate, and submit an event to Statsig via a proxy 🎉


 Notes

  1. The Statsig Local Evaluation SDK isn't as well-documented as its counterparts. The documentation that I'm referring to is here: https://docs.statsig.com/client/jsLocalEvaluationSDK. The documentation appears to reuse a section about the Statsig Options object used to initialize the SDK with, which doesn't quite correspond to the code. I say this because the documentation states that the on-device evaluation SDKs are only for Enterprise and Pro Tier accounts but I've been able to use it.

With the above in mind, I'll reach out to Statsig and query the following con:

Does not support IP or User Agent based checks (Browser Version/Name, OS Version/Name, IP, Country)

You never know…

  1. You can't disable diagnostics logging. In the other SDK I attempted to integrate you could disable all logging or specific types of logging whereas the Statsig Local Evaluation SDK is all or nothing.

Diagnostics logging occurs when an error occurs within the SDK, including if it's provided with a faulty config. If we introduce an error in the config/the delivery of the config, then user information (including PII like their IP address) will be sent to Statsig.

Up until now, I've assumed that minimize requests to the server is paramount. If I relax that assumption a little, I can see an alternative that's more flexible (it uses the remote-evaluation SDK, which has fewer cons) with the same control over what gets sent where and the same separation from Statsig:

We introduce a private (uncacheable), URL-loadable RL module, which: generates an ID, uses it to bootstrap the client SDK, and sets the ID in a cookie for reuse. The ext.metricsPlatform RL module, which is public (cacheable) and URL-loadable, would depend on this module.

However, this codepath would have to withstand a request per pageview (~6,000 requests/s?) and so would Statsig, although they were confident that this request rate would be acceptable previously. That said, the codepath is stateless and therefore can be scaled horizontally.

phuedx updated the task description. (Show Details)

Usecases examined

I chose the simplest usecase that I could think of: 75% of all sessions should see "Hello, World!" in their browser's console. This proves that the Statsig SDK is at least being initialized with a valid config and allowed me to validate whether gate exposure events are being logged.

I also explored a server-side equivalent: Conditionally showing a button on a special page.

  • Can we successfully bucket users based on traffic?

In short, yes.

Pre-Varnish-provided idenfier (Edge Unique ID (EUID)):

We update the MetricsPlatform extension to read a UID from a session cookie or generate one and write it to a session cookie. The UID would be used on both the server and the client as the Statsig SDK's stable ID.

This would allow us to evaluate feature gates and experiments on the server for logged-in users on the server and the client, and for logging-out users on the client.

"Based on traffic" here would mean per-session. It's worth noting that a session can last… a while: https://wikitech.wikimedia.org/wiki/Data_Platform/Sessions#Web

Post-EUID:

We can update the MetricsPlatform extension to use the EUID both on the server and on the client.

  • Can we successfully generate data based on variant assignment to those user buckets?

Yes.

I have been able to configure both the remote- and local-evaluation versions of the Statsig SDK to generate and transmit events via a MediaWiki REST API-based proxy.

  • Shortcomings
  • plus performance, and scalability evaluation to the best of our ability.

Performance is where things get interesting and the shortcomings of the different solutions are highlighted.

We have to think about minimising Time To Load (TTL) overall (which includes minimising TTL for Metrics Platform) and minimising bytes transferred.

If we choose to use the Local Evaluation SDK, then we will increase page weight overall because the config includes all feature gate and experiment definitions for all users everywhere. As the WMF begins to run more and more experiments, the size of this config will increase. We can partially mitigate this by only sending the definitions for those feature gates and experiment definitions that need to be evaluated on the client. However, because the config can be used for any user, it can be cached at the edge, keeping TTL for Metrics Platform low.

If we choose to use remote evaluation (the default), then we increase the TTL for Metrics Platform and anything that depends on it, which could lead to flashes of unstyled content and a reduction in user-perceived performance. We can partially mitigate this by caching the result in the user agent for ~5 minutes, which would decrease the TLL for Metrics Platform on subsequent pageviews. However, because the config is for a specific user, it only contains the feature gate and experiment definitions for that user, which will decrease page weight.

Other considerations that are related to performance:

  • There is no way of preventing tampering with the config after it has been received. We can make it difficult to do so but it would require work on the server and the client, e.g. we could sign a digest of the config with a private key and send the signature with the config for the client to verify during initialisation
  • There is no way of totally obfuscating the contents of the config that we send to the client. The PHP and JS SDKs support hashing the names of the feature gates and experiments for transport but this requires additional work in the browser every time a feature gate or experiment is checked

Note well that all of the above also relates to how we would deploy other systems (e.g. GrowthBook) as well as a system that we built ourselves. Regardless of the direction we take, we must commit to measuing and monitoring these performance metrics (TTL, time to initialised, config size, etc) for real users.

Change #1054352 abandoned by Phuedx:

[mediawiki/extensions/MetricsPlatform@master] DNM: Example Statsig integration

https://gerrit.wikimedia.org/r/1054352