Page MenuHomePhabricator

Report client-side JavaScript errors in MobileFrontend practically
Closed, ResolvedPublic8 Estimated Story Points

Description

This task encompasses the work to report client-side JavaScript failures in Minerva MobileFrontend in a stopgap way that's practical in today's system albeit not ideal.

Precursors

  • Ensure a first version of the schema has been setup for MinervaErrorLog

Schema

Create a new MinervaErrorLog schema. The schema should include the following fields:

PropertyTypeRequiredDescription
urlstringtrueComplete URI including query parameters and fragment.
oldidnumbertrueThe revision ID for the page accessed when the report was generated.
msgstringtrueA short but informative summary.
linenumbertrueThe line number of where the report originated
columnnumbertrueThe character offset of where the report originated
stackstringtrueA partial stacktrace that hopefully includes the module and function name where the report originated in MobileFrontend or MinervaNeue.
anonbooleantrueFalse if the user is logged in, true otherwise.
uselangstringtrueThe wiki user interface language code.
useskinstringtrueThe skin used such as "minerva", "vector", or "monobook".
metaObjecttrueA JSON dictionary of extra information that was considered pertinent at the time of the report.
gitMobileFrontendstringtrueThe MobileFrontend short Git commit hash. E.g., "deadc0de".
gitSkinstringtrueThe skin's short Git commit hash. E.g., "deadc0de".

The fields will be transmitted in the standard event capsule.

Note: this data is added by the server and doesn't count against the URL limit.

Code

  • A new ErrorLogger module is built as a distinct file.
  • The new ErrorLogger module is just EventLogging logger that sends MobileFrontendErrorLogs.
  • The ErrorLogger subscribes to global.error as early into loading as possible. The logger should not interfere with any other listeners such as the upload error reporter.
  • All uncaught errors thrown should be reported.
  • It should be possible to report an error without throwing. E.g., instead of throw new Error('foo') use ErrorLogger.log('foo').
  • Reports that exceed 2000 characters should trim stack and then url properties until within limits. Creative abbreviation or compression may be used to make the fields more applicable and concise.

Configs

The following new configs shall be added:

ConfigTypeDefault valueDescription
mfErrorLoggerEnabledboolfalseWhether or not to enable ErrorLogger. When false, ErrorLogger is replaced with a console.log implementation. This config defaults to a sensible value for third-party installations, false.
mfMobileFrontendErrorLogSamplingRatenumber1The rate at which reports are sampled. 1 means report all, .5 means report half, 0 means report none. This throttling mechanism is provided to ensure the backend isn't overwhelmed. The config defaults to a sensible value when mfErrorLoggerEnabled is enabled, 1.

Notes

Acceptance criteria

  • The Minerva skin should install a single error handler inside skins.minerva.scripts that forwards any client side errors to EventLogging. Alternatively, we might put this error handler inside mobile.init (which will also log events for other skins operating in mobile mode - currently theoretical)
  • Configuration exists to change sampling rate and disable/enable the feature
  • Configuration is not subject to caching - can be enabled/disabled within 5s (any config variables should make use of onResourceLoaderGetConfigVars to avoid being baked into the HTML)
  • By default the error logger is disabled and sampling rate set to 0
  • Errors are not logged to the schema on desktop Vector or any other skin.

Sign off steps

  • Failures are to be reported in Grafana and usable for investigation via SSH. Set up a task to make that happen.

Failures reported in https://grafana-labs-admin.wikimedia.org/dashboard/db/reading-web-beta-dashboard?orgId=1

  • Chores will need to be updated to include checking the new Grafana reports.
  • A task is created to evaluate the future of JavaScript error reporting at the end of the MobileFrontend investment project: do we keep our hacky solution, graduate to Sentry, delete the hack, or something else?

JR: Tracked in T106915

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Won't all skins want error reporting abilities? I thought MobileFrontend was supposed to eventually be a generic extension that could support multiple skins. Based on prior discourse, I thought that MobileFrontend kindof sortof had to support at least Vector so we should report when a MobileFrontend Vector error occurs. Also, most of our dicey JavaScript changes over the next year will be in MobileFrontend, not Minerva. mw.trackSubscribe() will queue global.error failures that occurred prior to invocation but don't you think that subscribing earlier is safer to ensure the invocation is actually run? mobile.init or mobile.startup seem like good possibilities.

Stephen and I discussed MobileFrontend or Minerva - I've updated task description to include this distinction/decision.

ovasileva set the point value for this task to 8.Aug 21 2018, 4:29 PM

We estimated this as a pessimistic 8. This might be a 5 as most of the risk can be deferred to the deploy task, but given some of the team's inexperience with EventLogging stack we settled higher.

Change 455064 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] Client side error reporting for Minerva MobileFrontend

https://gerrit.wikimedia.org/r/455064

Time to get excited! First patch is up and mostly done. I just need to wire up the configuration.

I've defined https://meta.wikimedia.org/wiki/Schema:ClientSideError and I've made a few editorial changes to the names provided in the description.

Given gitMobileFrontend and gitSkin are difficult to obtain, I'm sending wgVersion which should be good enough for production error logging.

uselang

I'm not sure what kind of errors we are looking to obtain here. I'm not aware of any client side errors that would be localised to a certain language. Can you elaborate why this is important?

meta

I'm not sure what was intended here. Does somebody (@Niedzielski ?) want to elaborate on what this is?

mfErrorLoggerEnabled - Whether or not to enable ErrorLogger. When false, ErrorLogger is replaced with a console.log implementation. This config defaults to a sensible value for third-party installations, false.

This doesn't really make sense given default mediawiki behaviour is to console.warn these errors.

I've defined https://meta.wikimedia.org/wiki/Schema:ClientSideError and I've made a few editorial changes to the names provided in the description.

Any reason why you chose that name over MinervaErrorLog, suggested in the description?


By including the URL as data in the schema, do we run the risk of generating events that are too long for the EventLogging processors to process like we did with the VirtualPageview schema (see T196904)?

ClientSideError

Isn't all EventLogging client-side?

gitMobileFrontend and gitSkin

I think we'll have an easier time injecting these post-Webpack.

uselang

It seems like it would be very useful to easily filter errors by language codes. My recollection was that a similar field in the Android app crash reporter was invaluable for reproduction given the number of wikis we support.

meta

This is an arbitrary map we can use to report more detail as needed, particularly when an error is known to occur but devs are unable to repro. For example, let's say we're consistently failing in WatchstarGateway but we're not sure what the conditions are. We can add meta.watchstarPageIndex and meta.watchstarPageIDs or something like that to better characterize the error.

By including the URL as data in the schema, do we run the risk of generating events that are too long for the EventLogging processors to process

This is hidden deep in the task description but is considered:

Reports that exceed 2000 characters should trim stack and then url properties until within limits. Creative abbreviation or compression may be used to make the fields more applicable and concise.

Note that uselang does not have much to do with the language of the error, unless you are throwing error messages you get from the API. The error message language is determined by navigator.language, probably.

Per language, the event logging capsule contains domain and project so default language is available. Since it's not possible to change UI language I dont think this is needed.

Per name: by using something generic I thought it would be useful since we could reserve the right to add it into other places e.g. Popups
There is nothing Minerva specific inside it.

Thanks for additional information on meta. I'll add it to schema but sounds like don't need to make use of it right now.

In terms of length I think the stacktrace is going to be more problematic here. Trimming the uri is problematic here as the hash fragment is important information (more so than the rest). Personally I would not want to trim this until it's clear what the error rate is - remember these errors are not on every page view like VirtualPageViews.

Change 455064 merged by jenkins-bot:
[mediawiki/skins/MinervaNeue@master] Client side error reporting for Minerva MobileFrontend

https://gerrit.wikimedia.org/r/455064

Skipping design and QA as the code should be inert.

We've merged a first version of the error handling. Next steps would be to enable it to a tiny % of users. No rush on that front, just yet. We should also explore putting this on the beta cluster which traffic is smaller. I'm not sure if that gets piped to Grafana however..

I've left feedback in the original patch which I missed since I was out. @Jdlrobson, please take a look when you can.

Change 458588 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] Hygiene: Move constants and explicit dependencies to top of errorLogging

https://gerrit.wikimedia.org/r/458588

Change 458589 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] Hygiene: Consistent error naming

https://gerrit.wikimedia.org/r/458589

Change 458590 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] Hygiene: Load errorLogger earlier in stack

https://gerrit.wikimedia.org/r/458590

Change 458591 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] Log a character limited stack trace in error reports

https://gerrit.wikimedia.org/r/458591

Change 458590 merged by jenkins-bot:
[mediawiki/skins/MinervaNeue@master] Hygiene: Load errorLogger earlier in stack

https://gerrit.wikimedia.org/r/458590

Change 458589 merged by jenkins-bot:
[mediawiki/skins/MinervaNeue@master] Hygiene: Consistent error naming

https://gerrit.wikimedia.org/r/458589

Change 458588 abandoned by Jdlrobson:
Hygiene: Move constants and explicit dependencies to top of errorLogging

Reason:
not working on this (toomanythings)

https://gerrit.wikimedia.org/r/458588

Change 458591 abandoned by Jdlrobson:
Log a character limited stack trace in error reports

Reason:
Let's continue this conversation in https://phabricator.wikimedia.org/T203812

https://gerrit.wikimedia.org/r/458591

I suggest we create tasks for any further follow ups, given we still need to check the viability of error reporting and whether it's useful to use EventLogging.

I've setup T203812 to track the issue with stack traces.
I'd like us to get a sense of the number of errors we are getting in production before throwing more energy into this.

Failures are to be reported in Grafana and usable for investigation via SSH. Set up a task to make that happen.

I think this AC is incorrect and that reports generated should immediately be visible in Grafana without a new task. However, I'm not seeing them! Please see notes below.

I've tested this on readers-web-master and copy and pasted the EventLogging URL generated into a curl command pointed at the beta cluster.

That is, I changed this URL:

https://readers-web-master.wmflabs.org/event.gif?{"event":{"userUrl":"https://readers-web-master.wmflabs.org/wiki/Pharmacovigilance","errorUrl":"","errorMessage":"Script error.","errorStackTrace":"","errorLineNumber":0,"errorColumnNumber":0,"sessionToken":"8bdfd3d953188fe33881","skin":"minerva","wgVersion":"1.32.0-alpha","mobileMode":"beta","isAnon":false,"revision":0},"revision":18340282,"schema":"WebClientError","webHost":"readers-web-master.wmflabs.org","wiki":"wiki"};

To the following command:

curl -A "WikipediaApp/22.0.22-alpha-2017-11-01 (Android 8.0.0; Phone) Alpha Channel" 'https://en.wikipedia.beta.wmflabs.org/beacon/event.gif?{"event":{"userUrl":"https://readers-web-master.wmflabs.org/wiki/Pharmacovigilance","errorUrl":"","errorMessage":"Script error.","errorStackTrace":"","errorLineNumber":0,"errorColumnNumber":0,"sessionToken":"8bdfd3d953188fe33881","skin":"minerva","wgVersion":"1.32.0-alpha","mobileMode":"beta","isAnon":false,"revision":0},"revision":18340282,"schema":"WebClientError","webHost":"readers-web-master.wmflabs.org","wiki":"wiki"};'

Which was output on deployment-eventlogging05.eqiad.wmflabs /srv/log/eventlogging/all-events.log:

{
  "dt": "2018-09-10T20:04:06Z",
  "event": {
    "errorColumnNumber": 0,
    "errorLineNumber": 0,
    "errorMessage": "Script error.",
    "errorStackTrace": "",
    "errorUrl": "",
    "isAnon": false,
    "mobileMode": "beta",
    "revision": 0,
    "sessionToken": "8bdfd3d953188fe33881",
    "skin": "minerva",
    "userUrl": "https://readers-web-master.wmflabs.org/wiki/Pharmacovigilance",
    "wgVersion": "1.32.0-alpha"
  },
  "ip": "73.181.4.128",
  "recvFrom": "deployment-cache-text04.deployment-prep.eqiad.wmflabs",
  "revision": 18340282,
  "schema": "WebClientError",
  "seqId": 45215402,
  "userAgent": {
    "browser_family": "Android",
    "browser_major": "8",
    "browser_minor": "0",
    "device_family": "Generic Smartphone",
    "is_bot": false,
    "is_mediawiki": false,
    "os_family": "Android",
    "os_major": "8",
    "os_minor": "0",
    "wmf_app_version": "22.0.22-alpha-2017-11-01"
  },
  "uuid": "3e074052bb99509f875f5b92a4c4cf0f",
  "webHost": "readers-web-master.wmflabs.org",
  "wiki": "wiki"
}

However, the event was not shown on Grafana:

grafana.wikimedia.org_dashboard_db_eventlogging-schema_orgId=1&var-schema=WebClientError&from=now-30d&to=now.png (864×597 px, 60 KB)


Chores will need to be updated to include checking the new Grafana reports.

I didn't see an updates in the chore wheel template and chore list page.

A task is created to evaluate the future of JavaScript error reporting at the end of the MobileFrontend investment project: do we keep our hacky solution, graduate to Sentry, delete the hack, or something else?

I'm sorry if I've missed this task but I don't believe one has been created yet

Lastly, infinite looping and MobileFrontend logging concerns identified in code review were not addressed and seem like blockers to me.

That’s production Grafana :-)

You’ll want to check want to check Beta Cluster’s own Graphite/Grafana.

You’ll want to check want to check Beta Cluster’s own Graphite/Grafana.

Thanks @Krinkle.

I tried creating a dashboard on https://grafana-labs.wikimedia.org/dashboard/db/niedzielski-eventlogging-schema?var-schema=VirtualPageView with help from @Ottomata but ran out of time. Getting alerts on beta before it hits prod would be a wonderful improvement.


I retested using prod EventLogging and dropped the .gif from the URL:

curl -A "WikipediaApp/22.0.22-alpha-2017-11-01 (Android 8.0.0; Phone) Alpha Channel" 'https://en.wikipedia.beta.wmflabs.org/beacon/event?{"event":{"userUrl":"https://readers-web-master.wmflabs.org/wiki/Pharmacovigilance","errorUrl":"","errorMessage":"Script error.","errorStackTrace":"","errorLineNumber":0,"errorColumnNumber":0,"sessionToken":"8bdfd3d953188fe33881","skin":"minerva","wgVersion":"1.32.0-alpha","mobileMode":"beta","isAnon":false,"revision":0},"revision":18340282,"schema":"WebClientError","webHost":"readers-web-master.wmflabs.org","wiki":"wiki"};'

This worked:

grafana.wikimedia.org_dashboard_db_eventlogging-schema_orgId=1&var-schema=WebClientError (1).png (463×577 px, 31 KB)

\o/


Lastly, infinite looping and MobileFrontend logging concerns identified in code review were not addressed and seem like blockers to me.

@Jdlrobson asked for more clarity on these points in standup:

  • I'm guessing that calling M.require() within the error handler in the case that the file it requires has an error does not create an infinite loop but I would like that to be verified.
  • I'm concerned that most of our JavaScript changes over the next year will occur in MobileFrontend but may not be reported by this mechanism due to load order. Correct me if I'm wrong but if there's an error that occurs when loading mobile.startup, we'll never hear about it through this new report mechanism.

I'm guessing that calling M.require() within the error handler in the case that the file it requires has an error does not create an infinite loop but I would like that to be verified.

Correct. No infinite loop. Throwing an error in the error handler doesn't invoke the hook.

I'm concerned that most of our JavaScript changes over the next year will occur in MobileFrontend but may not be reported by this mechanism due to load order. Correct me if I'm wrong but if there's an error that occurs when loading mobile.startup, we'll never hear about it through this new report mechanism.

It depend where it occurs. When loading yes. Issues inside the top level closure could be missed yes. mobile.init is also too late. The only way to fix this would be to insert the handler at the front of mobile.startup, which would also load error reporting for anything else that happens to use it e.g. gadgets. It would also mean the loading of mobile.startup has side effects.

I'd argue those kind of errors are also going to be very visible to us and reproducible as if something fails in mobile.startup load the entire mobile site JS fails to load. If this indeed happens this could be a problem as it would be spammy for all UA's impacted.

We'd still catch issues inside the code - for instance if new Overlay() throws an error we'd hear about it

We can revisit this later, but I'm fine missing those kind of issues from the start until we feel more confident with the sampling rate.

Correct. No infinite loop. Throwing an error in the error handler doesn't invoke the hook.

Thanks for verifying!

The only way to fix this would be to insert the handler at the front of mobile.startup, which would also load error reporting for anything else that happens to use it e.g. gadgets. It would also mean the loading of mobile.startup has side effects.

Broken gadgets can cause problems at any time and will always need to be considered when viewing trends. Covering our eyes while gadgets _and_ MobileFrontend are loading isn't a solution in my opinion.

I'd argue those kind of errors are also going to be very visible to us and reproducible as if something fails in mobile.startup load the entire mobile site JS fails to load. If this indeed happens this could be a problem as it would be spammy for all UA's impacted.

I agree that that's generally true for mobile.startup code but what about less frequented modules of which there is a significant portion of code? For example, I don't know when the last time I checked nearby was and loading errors there would only be reported if we move where this handler is executed. I would rather file tickets to fix gadgets or filter out entire misbehaving wikis than miss out on legit MobileFrontend regressions we introduce when we actually introduce them.

Change 464373 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] Log client errors on beta cluster at 100%

https://gerrit.wikimedia.org/r/464373

Change 464373 merged by jenkins-bot:
[operations/mediawiki-config@master] Beta: Log client errors on beta cluster at 100%

https://gerrit.wikimedia.org/r/464373

I've verified that errors are reported in grafana.
You can trigger an error using the following code:

window.setTimeout(function () {
	mw.stub.call();
}, 3000 );
$.ajax({url: '/' }).then( function() {
	mw.stub.call();
})

but not

mw.loader.using( 'mobile.startup' ).then(function() {
	mw.stub.call();
})

@Krinkle is there any reason errors inside a mw.loader.using callback are not trigger global.error or resourceloader.exception ?

Some replies to earlier discussions:

Broken gadgets can cause problems at any time and will always need to be considered when viewing trends. Covering our eyes while gadgets _and_ MobileFrontend are loading isn't a solution in my opinion.

Sorry, I have been a little confusing here. Gadgets are no different from our code, so gadgets that run after our code (which is always the case from my understanding) and will lead to errors being logged in our system. However, what I was concerned about here was desktop gadgets using our code and triggering errors outside the context of our own ecosystem. The error logger right now is enabled only in Minerva, meaning it cannot be triggered inside Vector or any other skin. I think this is a good thing as it keeps the environment predictable.

don't know when the last time I checked nearby was and loading errors there would only be reported if we move where this handler is executed.

That's not true. Nearby Minerva errors would be logged. I'm not sure we want to track errors inside Vector Minerva do we (since we don't officially support it)?

. I would rather file tickets to fix gadgets or filter out entire misbehaving wikis than miss out on legit MobileFrontend regressions we introduce when we actually introduce them.

You can't file tickets against gadgets. If a gadget is throwing an error the only way is to hope that gadget is still being maintained and/or edit the gadget itself :/

I added a breakpoint to the search bar click listener. When it paused, I caused a TypeError by changing a variable to null. The exception was thrown from:

load.php?debug=false&lang=en&modules=ext.centralNotice.choiceData,display,geoIP,impressionEventsSampleRate,kvStore,kvStoreMaintenance,startUp|ext.centralauth.centralautologin.clearcookie|ext.eventLogging,navigationTiming,wikimediaEvents|ext.eventLogging.debug,subscriber|ext.quicksurveys.init,lib|ext.visualEditor.targetLoader,track,ve|ext.wikimediaEvents.loggedin|jquery,oojs,oojs-router|jquery.accessKeyLabel,client,cookie,textSelection,throttle-debounce|mediawiki.RegExp,String,Title,Uri,api,base,cldr,cookie,experiments,jqueryMsg,language,notify,router,storage,template,user,util,viewport|mediawiki.legacy.wikibits|mediawiki.libs.pluralruleparser|mediawiki.page.startup|mediawiki.template.hogan,regexp|mediawiki.ui.anchor,input|mobile.ajax.styles|mobile.editor.api|mobile.init,references,search,site,startup,toggle,watchstar|mobile.pagelist.scripts,styles|mobile.pagesummary.styles|mobile.references.gateway,images|mobile.search.api,images,util|mobile.site.styles|mobile.startup.images|mobile.startup.images.variants|schema.QuickSurveyInitiation|skins.minerva.editor,mainMenu,notifications,scripts,toggling|skins.minerva.icons.images.scripts,variants|skins.minerva.icons.images.scripts.misc|skins.minerva.icons.page.issues.default.color|skins.minerva.icons.page.issues.medium.color|skins.minerva.icons.page.issues.uncolored|skins.minerva.mainMenu.icons,styles|skins.minerva.notifications.badge|skins.minerva.scripts.top|user.defaults&skin=minerva&version=1hi88f9:formatted:9823

The new handler properly intercepts the error but is unable to notify anyone about it: "Url exceeds maximum length". We should instead the report the error with a truncated URL in this case so that we see a spike of errors in EventLogging when they're occurring whether or not we're able to report exactly where they occur.


I tried the again by adding a breakpoint to doPrintBeforeTimeout which is invoked by clicking the download button. When the breakpoint hit, I nulled out this.showSpinner and saw the browser 204 the event:

https://en.wikipedia.beta.wmflabs.org/beacon/event?{"event":{"userUrl":"https://en.m.wikipedia.beta.wmflabs.org/wiki/Barack_Obama","errorUrl":"","errorMessage":"Uncaught TypeError: this.showSpinner is not a function","errorStackTrace":"","errorLineNumber":189,"errorColumnNumber":22,"sessionToken":"0466bfe4645502de05dc","skin":"minerva","wgVersion":"1.32.0-alpha","mobileMode":"stable","isAnon":false,"revision":377511},"revision":18340282,"schema":"WebClientError","webHost":"en.m.wikipedia.beta.wmflabs.org","wiki":"enwiki"};

I see the error logged on the BC dashboard and my hacky dashboard.


I'd like to continue the discussion of when we should start listening for errors so I've opened T206248.

The new handler properly intercepts the error but is unable to notify anyone about it: "Url exceeds maximum length". We should instead the report the error with a truncated URL in this case so that we see a spike of errors in EventLogging when they're occurring whether or not we're able to report exactly where they occur.

We talked a lot about truncating things during code review and bike shedded a lot so we removed most of it. Which field is the issue here that is leading to the long URI ? Is it the errorMessage itself?

Good question. I was thinking maybe the URL itself? Even a simple truncation of the URL field seems worthwhile to me.

Change 464640 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/MinervaNeue@master] Simplify WebClientError code

https://gerrit.wikimedia.org/r/464640

mw.loader.using( 'mobile.startup' ).then(function() {
	mw.stub.call();
})

So investigating this some more the problem here, is that the callback is outside of the scope of mw.loader and you'd need to capture it like so:

mw.loader.using( 'mobile.startup' ).then(function() {
	mw.stub.call();
}).catch( function() {
} );

A reminder that we'll need to add explicit catch handlers if we want to track errors in any code that uses this pattern.

Good question. I was thinking maybe the URL itself? Even a simple truncation of the URL field seems worthwhile to me.

Have setup T206257 to investigate (good news is we still capture these via the EventError schema or can access these via logs so I don't think they are completely lost).

https://gerrit.wikimedia.org/r/464640 provides a simplified version of this code but I've setup T206258 to see that progress.

I think we can safely close out this task now.

Jdlrobson claimed this task.