Data Platform/Systems/EventLogging
This documentation is outdated. See Event Platform documentation. |
EventLogging (EL for short) is a platform for modelling, logging, and processing arbitrary analytic data. It consists of:
- a MediaWiki extension that provides JavaScript and PHP APIs for logging events
- a back-end written in Python which aggregates events, validates them, and streams them to analytics clients.
This documentation is about specific EventLogging instance that collects data on Wikimedia sites.
For users
Schemas
Here's the list of the existing schemas. Note that many of them are active, but not all. Some schemas are still in development (not active yet) and others may be obsolete and listed for historical reference.
https://meta.wikimedia.org/wiki/Research:Schemas
The schema's discussion page is the place to comment on the schema design and related topics. It contains a template that specifies the schema maintainer(s), the team and project the schema belongs to, its status (active, inactive, in development), and its purging strategy.
Creating a schema
There's thorough documentation on designing and creating a new schema here:
https://www.mediawiki.org/wiki/Extension:EventLogging/Guide#Creating_a_schema
These are some special guidelines to create a schema that Druid can digest easily: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines
Send events
See Extension:EventLogging/Programming for how to instrument your MediaWiki code.
Client-side events
Client-side events are logged using a web beacon with project's hostname (e.g. https://en.wikipedia.org
), the path beacon/event
, and query string containing all the event fields (with percent-encoded punctuation). For example:
https://en.wikipedia.org/beacon/event?{"event":{"version":1,"action":"abort"...
Decoding the punctuation, this looks like:
https://en.wikipedia.org/beacon/event? { "event": { "action": "abort", ... }, "schema": "Edit", "revision": 1234, "webHost": "en.wikipedia.org", "wiki": "enwiki" }
Because this data is sent through a URL, we can't use URLs that are longer than browsers can cope with. Therefore, EventLogging limits unencoded client-side events to 2000 characters.
Note that the beacon URL you choose does not actually affect the data logged; for simplicity, both the iOS app and the Android app log all their events to the meta.wikimedia.org beacon even when the events relate to other projects.
Note that anyone could send events to these endpoints, but in production only events whose webhost is a wikimedia one are processed. There are many clones of our sites running our code (like bad.wikipedia-withadds.com) that are, at this time, sending events to the existing beacon.
Accessing data
Privacy
Data stored by EventLogging for the various schemas has varying degrees of privacy, including personally identifiable information and sensitive information, hence access to it requires an NDA. Also, by default, EL data is only kept for 90 days, unless otherwise specified, see Analytics/Systems/EventLogging/Data retention.
See Analytics/EventLogging/Data representations for an explanation on where the data lives and how to access it.
Access
See: Analytics/Data access#EventLogging data and Analytics/Data access#Production_access.
Hadoop & Hive
Raw JSON data is imported into HDFS from Kafka, and then further refined into Parquet-backed Hive tables. These tables live in 2 Hive databases: event
and event_sanitized
, and are stored in HDFS at hdfs:///wmf/data/event
and hdfs:///wmf/data/event_sanitized
. event
stores the original data during 90 days (data older than 90 days is automatically deleted). event_sanitized
stores the sanitized data indefinitely. The sanitization process uses a whitelist that indicates which tables and fields can be stored indefinitely, see: Analytics/Systems/EventLogging/Data retention and auto-purging. You can access all this data through Hive, Spark, or other Hadoop methods.
Data from a given hourly period is only refined into Hive two hours after the end of the period, to allow for late arriving events.[1]
Notes on data in Hive
A UDF has been provided in Hive to convert the dt
field into a MediaWiki timestamp (phab:T186155). It can be used to join to mediawiki-style timestamp strings as follows:
ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar; CREATE TEMPORARY FUNCTION GetMediawikiTimestamp AS 'org.wikimedia.analytics.refinery.hive.GetMediawikiTimestampUDF'; SELECT GetMediaWikiTimestamp('2019-02-20T12:34:56Z') AS timestamp; OK timestamp 20190220123456
NOTE: Not all EventLogging analytics schemas are 'refinable'. Some schemas specify invalid field names, e.g. with dots '.' in them, or have field type changes between different records. If this happens, it will not be possible to be store the data in a Hive table and as such it won't appear in the list of refined tables. If your schema has this problem, you should fix it. (Dashes '-' in field names are automatically converted to underscores '_' during the refine process before the data is being ingested into Hive, cf. phab:T216096#4955417.)
NOTE: Hadoop and Hive (in the JVM) are strongly typed, whereas the source EventLogging JSON data is not. This can cause problems when importing into Hive, as the refinement step needs to figure out what to do if it encounters type changes. TYPE CHANGES ARE NOT SUPPORTED. Please do not ever change the type of an EventLogging field. You may add new fields as you need and stop using old ones, but do not change types. Some type changes will be partially supported during the refinement stage. E.g. if the schema contains an integer, but future data contains a decimal number, the refinement step will log a warning, but still finish refinement. The record with the offending type changed field have all its fields set to NULL (not just the offending field).
Hive
EventLogging analytics data is imported into event
and event_sanitized
databases in Hive.
Note that the EventLogging schema fields are within the event
column (struct). You can access them using dot notation, e.g. event.userID
.
Basic example:
SELECT
event.userID,
count(*) as cnt
FROM
event.MobileWikiAppEdit
WHERE
year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID
ORDER BY cnt DESC
LIMIT 10;
...
event.userid cnt
NULL 1848
333333 87
222229 59
111113 29
111125 21
466534 17
433542 10
754324 7
121346 7
123452 6
Cross-schema example:
SELECT
nav.event.origincountry,
srv.event.description,
PERCENTILE(nav.event.responsestart, 0.50) AS responsestart_p50,
PERCENTILE(nav.event.responsestart, 0.75) AS responsestart_p75,
COUNT(*) AS count
FROM event.navigationtiming AS nav
JOIN event.servertiming AS srv ON nav.event.pageviewtoken = srv.event.pageviewtoken
WHERE
nav.year = 2020 AND
srv.year = 2020 AND
nav.month = 1 AND
srv.month = 1 AND
nav.day = 28 AND
srv.day = 28 AND
nav.event.isoversample = false
GROUP BY nav.event.origincountry,srv.event.description
HAVING count > 1000;
Errors for schemas
Errors are available on eventerror table on events database: Sample select:
select * from eventerror where event.schema like 'MobileWikiApp%' and year=2018 and month=11 and day=1 limit 10;
Spark
Spark can access data directly through HDFS, or as SQL tables in Hive. Refer to the Spark documentation for how to do so. Examples:
Spark 2 Scala SQL & Hive:
// spark2-shell
val query = """
SELECT
event.userID,
count(*) as cnt
FROM
event.MobileWikiAppEdit
WHERE
year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID
ORDER BY cnt DESC
"""
val result = spark.sql(query)
result.limit(10).show()
...
-------- ----
| userID| cnt|
-------- ----
| null|1848|
| 333333| 87|
| 222229| 59|
| 111113| 29|
| 111125| 21|
| 466534| 17|
| 433542| 10|
| 754324| 7|
| 121346| 7|
| 123452| 6|
-------- ----
Spark 2 Python SQL & Hive:
# pyspark2
query = """
SELECT
event.userID,
count(*) as cnt
FROM
event.MobileWikiAppEdit
WHERE
year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID
ORDER BY cnt DESC
"""
result = spark.sql(query)
result.limit(10).show()
...
-------- ----
| userID| cnt|
-------- ----
| null|1848|
| 333333| 87|
| 222229| 59|
| 111113| 29|
| 111125| 21|
| 466534| 17|
| 433542| 10|
| 754324| 7|
| 121346| 7|
| 123452| 6|
-------- ----
Spark 2 R SQL & Hive:
# spark2R
query <- "
SELECT
event.userID,
count(*) as cnt
FROM
event.MobileWikiAppEdit
WHERE
year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID
ORDER BY cnt DESC
"
result <- collect(sql(query))
head(result,10)
...
userID cnt
1 NA 1848
2 333333 87
3 222229 59
4 111113 29
5 111125 21
6 466534 17
7 433542 10
8 754324 7
9 121346 7
10 123452 6
Hadoop. Archived Data
In 2017, some big EventLogging tables were archived from MariaDB to Hadoop. Tables were exported with sqoop into avro format files and tables were created according to the corresponding schema. Thus far we have the following tables archived in Hadoop, in the archive
database:
mobilewebuiclicktracking_10742159_15423246 Edit_13457736_15423246 MobileWikiAppToCInteraction_10375484_15423246 MediaViewer_10867062_15423246 MobileWikiAppToCInteraction_10375484_15423246 pagecontentsavecomplete_5588433_15423246 PageContentSaveComplete_5588433 PageCreation_7481635 PageCreation_7481635_15423246 PageDeletion_7481655 PageDeletion_7481655_15423246
You can query these tables just like any other table in hive. A tip regarding dealing with binary types:
select * from Some_tbl where (cast(uuid as string) )='ed663031e61452018531f45b4b5502cb';
Caveat: This process does not preserve the data type for e.g. bigint or boolean fields. The archived Hive table will contain them as strings instead, which will need to be converted back (e.g. CAST(field AS BIGINT)
).
Hadoop Raw Data
Raw EventLogging JSON data is imported hourly into Hadoop by Gobblin. It is unlikely that you will ever need to access this raw data directly. Instead, use the refined event
Hive tables as described above.
Raw data is written to directories named after each schema in hourly partitions in HDFS. /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_<schema>/hourly/<year>/<month>/<day>/<hour>. There are a myriad of ways to access this data, including Hive and Spark. Below are a few examples. There may be many (better!) ways to do this.
For backup purposes, we keep 90 days of events coming from the eventlogging-client-side topic in /mnt/hdfs/wmf/data/raw/eventlogging_client_side/hourly/<year>/<month>/<day>/<hour>
.
Note that all EventLogging data in Hadoop is automatically purged after 90 days; the whitelist of fields to retain is not used, but this feature could be added in the future if there is sufficient demand.
Hive
Hive has a couple of built in functions for parsing JSON. Since EventLogging records are stored as JSON strings, you can access this data by creating a Hive table with a single string column and then parsing that string in your queries:
ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
-- Make sure you don't create tables in the default Hive database.
USE otto;
-- Create a table with a single string field
CREATE EXTERNAL TABLE `CentralNoticeBannerHistory` (
`json_string` string
)
PARTITIONED BY (
year int,
month int,
day int,
hour int
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory';
-- Add a partition
ALTER TABLE CentralNoticeBannerHistory
ADD PARTITION (year=2015, month=9, day=17, hour=16)
LOCATION '/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2015/09/17/16';
-- Parse the single string field as JSON and select a nested key out of it
SELECT get_json_object(json_string, '$.event.l.b') as banner_name
FROM CentralNoticeBannerHistory
WHERE year=2015;
Spark
Spark Python (pyspark
):
import json
data = sc.sequenceFile("/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2015/09/17/07")
records = data.map(lambda x: json.loads(x[1]))
records.map(lambda x: (x['event']['l'][0]['b'], 1)).countByKey()
Out[33]: defaultdict(<class 'int'>, {'WMES_General_Assembly': 5})
MobileWikiAppFindInPage events with SparkSQL in Spark Python (pyspark 1
):
# Load the JSON string values out of the compressed sequence file.
# Note that this uses * globs to expand to all data in 2016.
data = sc.sequenceFile(
"/wmf/data/raw/eventlogging/eventlogging_MobileWikiAppFindInPage/hourly/2016/*/*/*"
).map(lambda x: x[1])
# parse the JSON strings into a DataFrame
json_data = sqlCtx.jsonRDD(data) # replace with sqlCtx.read.json(data) for pyspark 2
# Register this DataFrame as a temp table so we can use SparkSQL.
json_data.registerTempTable("MobileWikiAppFindInPage")
top_k_page_ids = sqlCtx.sql(
"""SELECT event.pageID, count(*) AS cnt
FROM MobileWikiAppFindInPage
GROUP BY event.pageID
ORDER BY cnt DESC
LIMIT 10"""
)
for r in top_k_page_ids.collect():
print "%s: %s" % (r.pageID, r.cnt)
Edit events with SparkSQL in Spark scala (spark-shell):
// Load the JSON string values out of the compressed sequence file
// and parse them as a DataFrame.
val rawDataPath = "/wmf/data/raw/eventlogging/eventlogging_Edit/hourly/2015/10/21/16"
val edits = spark.read.json(
spark.createDataset[String](
spark.sparkContext.sequenceFile[Long, String](rawDataPath).map(_._2)
)
)
// Register this DataFrame as a temp table so we can use SparkSQL.
edits.registerTempTable("edits")
// SELECT top 10 edited wikis
val top_k_edits = sqlContext.sql(
"""SELECT wiki, count(*) AS cnt
FROM edits
GROUP BY wiki
ORDER BY cnt DESC
LIMIT 10"""
)
// Print them out
top_k_edits.foreach(println)
Kafka
There are many Kafka tools with which you can read the EventLogging data streams. kafkacat is one that is installed on stat1007.
# Uses kafkacat CLI to print window ($1)
# seconds of data from $topic ($2)
function kafka_timed_subscribe {
timeout $1 kafkacat -C -b kafka-jumbo1001 -t $2
}
# Prints the top K most frequently
# occurring values from stdin.
function top_k {
sort |
uniq -c |
sort -nr |
head -n $1
}
while true; do
date; echo '------------------------------'
# Subscribe to eventlogging_Edit topic for 5 seconds
kafka_timed_subscribe 5 eventlogging_Edit |
# Filter for the "wiki" field
jq .wiki |
# Count the top 10 wikis that had the most edits
top_k 10
echo ''
done
Publishing data
See Analytics/EventLogging/Publishing for how to proceed if you want to publish reports based on EventLogging data, or datasets that contain EventLogging data.
Verify received events
Logstash has eventlogging EventError events. You can view all of these at https://logstash.wikimedia.org/goto/bda91f37481ae4970ee21e11810d49d3
Validation errors are visible on application logs located at
/srv/log/eventlogging/systemd
In production they also end up in the kafka topic
eventlogging_EventError
There is also a Hive table named event.eventerror
.
The processor is the one that handles validation, so, for example;
eventlogging_processor-client-side-<some>.log
will have an error like the following if events are invalid:
Unable to validate: ?{
"event": {
"pagename": "Recentchanges",
"namespace": null,
"invert": false,
"associated": false,
"hideminor": false,
"hidebots": true,
"hideanons": false,
"hideliu": false,
"hidepatrolled": false,
"hidemyself": false,
"hidecategorization": true,
"tagfilter": null
},
"schema": "ChangesListFilters",
"revision": 15876023,
"clientValidated": false,
"wiki": "nowikimedia",
"webHost": "no.wikimedia.org",
"userAgent": "Apple-PubSub/65.28"
}; cp1066.eqiad.wmnet 42402900 2016-09-26T07:01:42 -
This happens if client code has a bug and is sending events that are not valid according to the schema, we normally try to identify the schema at fault and pas that info back to the devs so they can fix it. See a ticket of how do we deal with these errors: https://phabricator.wikimedia.org/T146674
As of T205437, validation error logs are also available in Logstash for up to 30 days, i.e. https://logstash.wikimedia.org/goto/4882115feb72bdcfa812ace67b02e5bb. A handy link to the associated Kibana search is available on a schema's talk page, provided that it's documented using the SchemaDoc template.
Note well that access to Logstash requires a Wikimedia developer account with membership in a user group indicating that the user has signed an NDA.
User agent sanitization
Main article: Analytics/Systems/EventLogging/User agent sanitization
The userAgent
field is sanitized immediately upon storage; the content is replaced with a parsed version in JSON format.
Data retention and purging
By default, all EventLogging data is deleted after 90 days to comply with our data retention guidelines.
However, individual properties within schemas can be whitelisted so that the data is retained indefinitely; generally, all columns can be whitelisted, except the clientIp
and userAgent
fields. This whitelist is maintained in the analytics/refinery
repo as static_data/eventlogging/whitelist.yaml
.
Retiring a schema
When you no longer want to collect a particular data stream, there are a few cleanup steps you should take:
- Remove the instrumentation code
- Mark the schema inactive by editing the SchemaDoc template on its talk page.
- Remove its entries from the whitelist (so it's easy for others to review what's actively being retained).
- Request the deletion of any previously whitelisted data if it's no necessary
Operational support
Tier 2 support
Outages
Any outages that affect EventLogging will be tracked on Incident documentation (also listed below) and announced to the lists [email protected] and [email protected].
Alarms
Alarms at this time come to the Analytics team. We are working on being able to claim alarms in icinga.
Contact
You can contact the analytics team at: [email protected]
For developers
Codebase
The EventLogging python codebase can be found at https://gerrit.wikimedia.org/r/#/admin/projects/eventlogging
Architecture
See Analytics/EventLogging/Architecture for EventLogging architecture.
Performance
On this page you'll find information about Event Logging performance, such as load tests and benchmarks:
https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Performance
Size limitation
There is a limitation of the size of individual EventLogging events due the underlying infrastructure (limited size of urls in Varnish's varnishncsa/ varnishlog, as well as Wikimedia UDP packets). For the purpose of size limitation, an "entry" is a /beacon
request URL containing urlencoded JSON-stringified event data. Entries longer than 1014 bytes are truncated. When an entry is truncated, it will fail validation because of parsing (as the result is invalid JSON).
This should be taken into account when creating a schema. Large schemas should be avoided and schema fields with long keys and/or values, too. Consider splitting up a very large schema, or replacing long fields with shorter ones.
To aid with testing the length of schemas, EventLogging's dev-server logs a warning into the console for each event that exceeds the size limit.
Monitoring
You can use various tools to monitor operational metrics, read more in this dedicated page:
https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Monitoring
Testing
The Event Logging extension can be tested on vagrant easily and that is described on mediawiki.org at Extension:EventLogging. The server side of EventLogging (consumer of events) does not have a vagrant setup for testing but can be tested in the Beta Cluster:
https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/TestingOnBetaCluster
How do I ...?
Visit this EventLogging how to page. It contains some dev-ops tips and tricks for EventLogging like: deploying, troubleshooting, restarting, etc. Please, add here any step-by-step on EventLogging dev-ops tasks.
https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/How_to
Administration. On call
Here's a list of routine tasks to do when oncall for EventLogging.
https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall
Data Quality Issues
Changes and Known Problems with Dataset
Date from | Date until | Task | Details | |
---|---|---|---|---|
2020-06-18T20:00:00Z | 2019-06-19T22:00:00Z | Task T249261 | While attempting the first migration of legacy EventLogging steams to EventGate, Otto misconfigured the EventLogging extension's $wgEventLoggingServiceUri for non group0 wikis, effectively causing SearchSatisfaction events to be disable on all non group0 wikis. | |
2019-09-23 | 2019-09-29 | Task T233718 | Many events emitted by MediaWiki are missing in Hive refined event database tables, including events from mediawiki_revision_create, mediawiki_page_create, etc. This was caused by a problem when importing data from Kafka via Camus, but at the time was only known to affect mediawiki_api_request and mediawiki_cirrussearch_request. Data for other mediawiki_* tables was not backfilled, and the raw data has since been deleted. | |
2017-11 | 2017-11 | Task T179625 | Canonical EventLogging data (parsed and validated and stored in Kafka) did not match EventCapsule schema. This was fixed, and data was transformed before insertion into MySQL for backwards compatibility. This helped standardize all event data so that it could be refined and made available in Hive. | |
2017-07-10 | 2017-07-12 | task T170486 | Some data was not inserted in MySQL, but was backfilled for all schemas but page-create. During the backfill, bot events were also accidentally backfilled, resulting in extra data during this time. | |
2017-05-24 | onwards | task T67508 | Do not accept data from bots on eventlogging unless bot user agent matches "MediaWiki". | |
2017-03-29 | onwards | task T153207 | Change userAgent field in event capsule | |
2019-03-19 (14 to 22 hours) | task T218831 | Eventlogging mysql consumer was restarting for several hours in which it was not able to enter any data on database | ||
2019-04-01 | Task: T219842 | Kafka Jumbo outage since 22:00 to midnite. Data loss on those hours | ||
2019-09-12 | https://phabricator.wikimedia.org/T228557 | Third party domain data is not getting refined (so sites like w.upupming.site that run clones of our code do not send us their requests) |
Incidents
Here's a list of all related incidents and their post-mortems. To add a new page to this generated list, use the "EventLogging/Incident_documentation
" category.
For all the incidents (including ones not related to EventLogging) see: Incident documentation.
Limits of the eventlogging replication script
The log database is replicated to the eventlogging slave databases via a custom script, called eventlogging_sync.sh (script stored in operations/puppet for the curious). While working on https://phabricator.wikimedia.org/T174815 we realized that the script was not able to replicate high volume events in real time, showing a lot of replication lag (even days in the worst case scenario). Please review the task for more info or contact the Analytics team in case you have more questions.
Ad blockers
Our client-side analytics instrumentation is subject to interference by any ad blocking software the user has installed. See, for example, T240697/T251464, in which no-JS editor counts were skewed by unaccounted-for ad blockers. Ad blockers typically work by comparing outgoing requests to a list of disallowed URL domains, paths, or other patterns. For example, ad blockers using the popular EasyPrivacy block list block requests from page scripts to paths matching /beacon/event?
(affecting legacy EventLogging) as well as to the domain intake-analytics.wikimedia.org
(affecting requests to the new event platform intake service).
The following ad blockers are known as of February 2021 to interfere with WMF analytics instrumentation when using default settings. (Note that most if not all ad blockers allow users to add block lists and custom rules that could result in WMF analytics requests being blocked.)
Name | Client platforms affected | Analytics intake systems affected | Notes |
---|---|---|---|
uBlock Origin | Web (desktop mobile) | EventLogging, MEP | EasyPrivacy enabled by default |
Brave (web browser) | Web (desktop mobile) | MEP | Blocks requests to intake-analytics.wikimedia.org when using standard (default) privacy settings
|
See also
- Analytics/EventLogging/Outages
- Analytics/EventLogging/New pipeline
- Analytics/EventLogging/Sanitization vs Aggregation
- "EventLogging on Kafka". October 2015 lightning talk: slides, video