Move analytics log from Varnish to HAProxy
Open, In Progress, Needs TriagePublic
Actions

Description

Now that we have moved all traffic to HAProxy, we should replace the current pipeline

Varnish logs -> Varnishkafka -> Kafka

HAProxy -> RSyslog -> Kafka

The format of the messages sent to Kafka should remain the same, so for the content, to avoid breaking existing pipelines and tools.

Roughly, the actions needed are:

Investigate whether the log format can be transformed to match the VarnishKafka one (checkout the VarnishKafka configuration file for more information)
Configure RSyslog to send the data initially to a different Kafka topic(s)
Compare the existing topic(s) content to the new one, to spot eventual differences
Finalize configuring RSyslog to send data to the existing Kafka topic(s) and remove VarnishKafka

To test the actual feasibility we could use the deployment-prep environment, as there's already a kafka cluster in there, and varnishkafka configured on cache hosts (NB. investigate why the deployment-prep jumbo cluster isn't receiving any message from varnishkafka).

Another viable option (thanks to @brouberol ) could be use the kafka-test cluster in production, configuring one cp production host to send from rsyslog to this cluster on a disposable topic, and compare to the actual messages sent by VarnishKafka.

Details

Subject	Repo	Branch	Lines /-
webrequest: add error schema.	schemas/event/primary	master	112 -0
EventStreamConfig: Add webrequest.frontend.error	operations/mediawiki-config	master	21 -0
primary: add webrequest schema	schemas/event/primary	master	17 -33
hql: webrequest: add webrequest_frontend.	analytics/refinery	master	579 -6
EventStreamConfig: Add webrequest.frontend.v1.	operations/mediawiki-config	master	30 -0
cache:benthos: switch to production topic names	operations/puppet	production	3 -3
refinery-job: add webrequest instrumentation.	analytics/refinery/source	master	205 -6
gobblin: add webrequest_error pull job.	analytics/refinery	master	32 -0
development: webrequest: add haproxy_pid field	schemas/event/primary	master	14 -0
analytics: refinery: add webrequest_frontend timer	operations/puppet	production	12 -0
Add gobblin job webrequest_frontend_rc0	analytics/refinery	master	34 -0
haproxy: increase capture header length for UA	operations/puppet	production	2 -2
haproxy: remove timestamp from unique-id-format	operations/puppet	production	2 -2
webrequest: disable canary events.	operations/mediawiki-config	master	5 -0
development: add webrequest schema	schemas/event/primary	master	314 -1
Add webrequest.frontend.rc0 stream	operations/mediawiki-config	master	34 -0

Related Objects
Search...

Status	Assigned	Task
In Progress	Fabfur	T351117 Move analytics log from Varnish to HAProxy
Open	gmodena	T314956 [Event Platform] Declare webrequest as an Event Platform stream
Resolved	Fabfur	T358105 Change HAProxy log-format to support missing information
Resolved	Fabfur	T358107 Change mtail configuration to ignore new fields in HAProxy logs
Declined	Fabfur	T358109 Install new Benthos instance on cp hosts
Resolved	Fabfur	T359172 Create a new temporary topic on Kafka jumbo cluster to test Benthos
Resolved	Fabfur	T360450 Add $schema key to Benthos payload
Declined	Fabfur	T360454 Better Benthos performances
Declined	Fabfur	T364379 Benthos loses messages when under high load
Declined	Fabfur	T365968 Install benthos on single esams host to check performances under higher load
Resolved	Fabfur	T365566 HAProxy should not log information we don't actually need
Resolved	Fabfur	T365718 Switch HAProxy/Benthos to rfc5424
Declined	Fabfur	T366031 Upgrade Benthos package on cp hosts
Resolved	Fabfur	T367756 Upgrade hosts to haproxy 2.8.10
Resolved	Vgutierrez	T367963 Investigate increase in CD termination state after upgrading eqsin/ulsfo to HAProxy 2.8.10
Resolved	Fabfur	T360642 Remove extra fields currently sent to Kafka
Resolved	Fabfur	T363420 Use definitive field naming during grok parsing instead of renaming later (in mapping)
Declined	Fabfur	T365441 Umbrella task for Benthos parsing error
Declined	Fabfur	T365117 HAProxy log format doesn't support "invalid" request path
Declined	Fabfur	T359178 Check statsv and eventlogging VarnishKafka instances
Resolved	Fabfur	T359626 Benthos: add specific unit tests for different logs
Resolved	Fabfur	T359627 Benthos: better management for unparsable logs
Resolved	Fabfur	T360415 Review haproxy captured header lengths
Declined	Fabfur	T361845 Add metrics to Benthos
Resolved	Joe	T342577 Data Quality - requestctl not getting set
Open	Fabfur	T370668 New software: haproxykafka
Resolved	Fabfur	T372338 haproxykafka: feature: Add prometheus metrics
Declined	Fabfur	T372341 haproxykafka: feature: Read from TCP UDS socket
Resolved	Fabfur	T372339 haproxykafka: feature: add ability to add/remove/modify fields
Resolved	Fabfur	T372342 haproxykafka: feature: Configuration file
Open	Fabfur	T372344 haproxykafka: feature: Ability to print structured messages to stdout
Open	Fabfur	T374128 haproxykafka features
Open	Fabfur	T374473 Prepare puppet configuration to send haproxy logs to haproxykafka socket
Open	Fabfur	T374696 Enable prometheus metrics scraping for haproxykafka
Resolved	Fabfur	T370741 Remove Benthos from ulsfo hosts

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Both approaches are feasible (also at the same time if we do accept to increase the payload a little)...

Nice. From my side, increasing payload size with meta should not be an issue (and AFAIK not significantly impact jumbo). It's a field we expect anyway. I'm implementation agnostic, but very much curious to see how Benthos handles things.

If that's ok we can consider this settled...

Sounds good to me. Modulo maybe fine tuning naming and id formats.

"dt": "2024-02-07T12:04:41.512972244Z",

I think that this more precise timestamp would be parseable by our ingestion system just fine, but we should verify. If we can get this precise I suppose...why not? I see that existent varnish dt is only seconds, which doesn't seem very precise, especially for webrequest. Perhaps we should take this opportunity to increase the precision a bit. If we can, we should strive for at least millisecond. Not a blocker for this task though.

Both request_id and id can be generated by Benthos

Nice! request_id is usually populated from the X-Request-Id header for tracing purposes. Can we set that from haproxy?

stream: "webrequest_text", # value set by Benthos

TBD on final stream name in T314956: [Event Platform] Declare webrequest as an Event Platform stream, but the currently suggested one is webrequest.frontend. @gmodena, the idea there is to group all webrequest topics into the same stream, by setting topics manually in stream config. Gobblin will ingest the topics configured in stream config.

We might want to a new field to indicate the cache cluster / webrequest source the request is from. The webrequest refine job will pull the Hive webrequest_source partition from the topic name; but I think it might be best to have this info in the event data too. (I suppose the Refine job could do that, but it's probably more future proof to let the haproxy producer set the value explicitly).

TBD on final stream name in T314956: [Event Platform] Declare webrequest as an Event Platform stream, but the currently suggested one is webrequest.frontend

Open question: do we want webrequest.frontent (or whatever we settle on) to be a versioned stream? https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning

That's a config setting that mostly concerns the EP side of tooling (kafka topics would not necessarily have to follow the same versioning scheme).

the currently suggested one is webrequest.frontend. @gmodena, the idea there is to group all webrequest topics into the same stream, by setting topics manually in stream config. Gobblin will ingest the topics configured in stream config.

We discussed this OTR in slack, and this should be feasible with the current gobblin config.

We might want to a new field to indicate the cache cluster / webrequest source the request is from. The webrequest refine job will pull the Hive webrequest_source partition from the topic name; but I think it might be best to have this info in the event data too. (I suppose the Refine job could do that, but it's probably more future proof to let the haproxy producer set the value explicitly).

Good point @Ottomata
I'd also prefer if this was set upstream by the haproxy producer. Would be useful if / when accessing the kafka topics (instead of refined data in HDFS).

cc / @Fabfur

Open question: do we want webrequest.frontent (or whatever we settle on) to be a versioned stream? https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning

xpost for visbility: https://phabricator.wikimedia.org/T314956#9539679

I tagged the current WIP stream as webrequest.frontentd.rc0. The expected json payload would look something like this:

{
  "meta": {
      dt: "2023-11-23T16:04:17Z", 
      stream: "webrequest.frontentd.rc0", # set by prometheus 
      domain: "en.wikipedia.org",
      request_id: request-uuid ,
      id: "event-uuid",
   },
   ...
}

This versioning is useful for experimentation, and for (eventually) introducing breaking changes in GA event schemas. The drawback is that we need to keep track of version in multiple places:

mediawiki-config: where the stream config is declared
puppet: where the haproxy produced gets the stream value to set.

Since webrequest.frontend is not stricly and Event Platform (produced) stream, it might make more sense to drop the version suffix.

In T351117#9545687, @gmodena wrote:

the currently suggested one is webrequest.frontend. @gmodena, the idea there is to group all webrequest topics into the same stream, by setting topics manually in stream config. Gobblin will ingest the topics configured in stream config.

We discussed this OTR in slack, and this should be feasible with the current gobblin config.

We might want to a new field to indicate the cache cluster / webrequest source the request is from. The webrequest refine job will pull the Hive webrequest_source partition from the topic name; but I think it might be best to have this info in the event data too. (I suppose the Refine job could do that, but it's probably more future proof to let the haproxy producer set the value explicitly).

Good point @Ottomata
I'd also prefer if this was set upstream by the haproxy producer. Would be useful if / when accessing the kafka topics (instead of refined data in HDFS).

cc / @Fabfur

Ok I've added a webrequest_source field in the produced output that can assume the text|upload values (set by puppet, as we set the kafka topic)

• lbowmaker moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Feb 19 2024, 1:12 PM

Update: Benthos is installed on cp4037 and after some minor fixes, is finally ready to ingest, process and send HAProxy logs to the temporary Kafka topic (webrequest_text_test) on the jumbo cluster. Considering that, when we'll switch on the new log destination in HAProxy and repool the host, a very large amount of data will be processed and sent, I prefer starting "manually" sending single log lines to the Benthos socket and checking that all is correct on the Kafka side.
This means that the first messages on the topic will be "artificial" and can be safely discarded.

I'll let you know the results of this test and when we can actually start processing "real" logs (maybe a topic cleanup could be necessary to avoid ingesting in the analytics pipeline "fake" logs).

gmodena mentioned this in T359561: Add user fabfur to analytics-privatedata-users.Mar 7 2024, 4:19 PM

Update: yesterday we modified the HAProxy log destination to send them into Benthos and repooled cp4037 for a very short time to capture "real" logs. Benthos configuration sent them to two separate Kafka topics: webrequest_text_test for "regular" logs and webrequest_text_test_error for errored (unparsable) logs.

NOTE that currently cp4037.ulsfo.wmnet, the server we're using for testing purpose is depooled, downtimed and puppet is disabled on that specific host

While the great part of logs has been correctly parsed, we noticed some minor issues that needs to be addressed, mainly:

The "metadata field" in benthos output must be renamed from debug_metadata to meta. This has been already fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/ /1009722 and is waiting for merge
Some message types cannot be parsed correctly, two examples:
- <REDACTED IP>:52088 [07/Mar/2024:14:58:14.151] tls/2: SSL handshake failure
- <REDACTED IP>:59665 - [07/Mar/2024:14:58:14.237] http http/<NOSRV> 0/-1/-1/-1/0 301 164 - - LR-- 2059/9/0/0/0 0/0 {www.wikipedia.org} {int-tls} GET / HTTP/1.1 cee4b720-1709-45f3-913a-ba0dc6604454

In the first case the reason is obvious: such errors cannot really match our parsing pattern.

The second case required more investigation, and after some trials I think I've nailed the reason: this kind of requests (redirects) that hit the http frontend in HAProxy has not the same log format as the ones that hits the tls frontend, simply due to the fact that in the http frontend the captured headers are very different in number and type.

So IMHO we have two different ways to manage this:

Ignore entries that hits the http frontend (used only to redirect to https), eg. specifying a different log target in HAProxy. This way we'll loose important information? Are they passed down and collected by Varnish?
Edit the HAProxy configuration for the http frontend to capture the same headers (or use dummy placeholders in case not appliable) and keep the log-format consistent across the two frontends. This will allow Benthos to parse logs from both frontends. What the problem could be if capturing and logging different headers on the Varnish side?

I suggest considering the second approach, a possible CR would be https://gerrit.wikimedia.org/r/c/operations/puppet/ /1009724

Mentioned in SAL (#wikimedia-operations) [2024-03-08T15:32:09Z] <fabfur> repooling cp4037 for this weekend, all log-format changes are reverted (T351117)

Note: I've repooled cp4037 for the next days as I'll be busy on the SRE Summit to work on it.

All modifications to HAProxy logging has been reverted, we re-enabling them back reverting https://gerrit.wikimedia.org/r/c/operations/puppet/ /1009769 and keep working on the fixes in https://gerrit.wikimedia.org/r/c/operations/puppet/ /1009722 and https://gerrit.wikimedia.org/r/c/operations/puppet/ /1009724

Fabfur closed subtask T358105: Change HAProxy log-format to support missing information as Resolved.Mar 18 2024, 11:11 AM

I've fixed some errors (now metadata field name should be correct) in the Benthos configuration. I've repooled cp4037 for about one minute (https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2024-03-18) to collect some logs.

ATM the logs sent to the deadletter queue (webrequest_text_test_error) are all messages that could be safely ignored, from the Analytics perspective (while we could use some metrics about those, but let's see).

@gmodena you should have some more data to play with now, while I work on the performance optimization and on Benthos internal metrics...

Fabfur changed the status of subtask T359627: Benthos: better management for unparsable logs from Open to In Progress.Mar 18 2024, 6:19 PM

Fabfur closed subtask T359626: Benthos: add specific unit tests for different logs as Resolved.Mar 19 2024, 10:54 AM

Fabfur changed the status of subtask T358109: Install new Benthos instance on cp hosts from Open to In Progress.

Fabfur closed subtask T360415: Review haproxy captured header lengths as Resolved.Mar 19 2024, 11:26 PM

Fabfur closed subtask T359627: Benthos: better management for unparsable logs as Resolved.

Change 1012656 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] Add webrequest_frontent raw schema.

https://gerrit.wikimedia.org/r/1012656

Change #983905 merged by jenkins-bot:

[operations/mediawiki-config@master] Add webrequest.frontend.rc0 stream

https://gerrit.wikimedia.org/r/983905

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:16:47Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]]

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:20:33Z] <hashar@deploy1002> otto and hashar: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:37:47Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] (duration: 20m 59s)

Change #983898 merged by jenkins-bot:

[schemas/event/primary@master] development: add webrequest schema

https://gerrit.wikimedia.org/r/983898

Change #1015260 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] webrequest: disable canary events.

https://gerrit.wikimedia.org/r/1015260

Change #1015260 merged by jenkins-bot:

[operations/mediawiki-config@master] webrequest: disable canary events.

https://gerrit.wikimedia.org/r/1015260

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:12:44Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:28:20Z] <hashar@deploy1002> gmodena and hashar: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:46:48Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] (duration: 34m 03s)

gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/643

Draft: analytics: webrequest: add webrequest_frontend refine dag.

Change #1017041 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/puppet@production] analytics: refinery: add webrequest_frontend timer

https://gerrit.wikimedia.org/r/1017041

In T351117#9638903, @Fabfur wrote:

@gmodena you should have some more data to play with now, while I work on the performance optimization and on Benthos internal metrics...

f/up to some threads we had in irc/slack.

The composite webrequest.frontend.rc0 stream is now declared in mw stream config.
All events currently being produced by benthos to kafka (upload/text) validate against the jsonschema.

I setup an end to end pipeline (gobblin airflow) to load data from kafka, and generate the downstream (postprocessed) webrequest dataset (and error loss request). It implements the same logic that currently processes the varnish feed. The new dataset will be available in superset as gmodena.webrequest (you can already play with a small sample). To ease analysis, it is exposed with the same schema as the current (varnish based) table.

I have some patches in flight, that once deployed will automate ingestion and etl of the kafka topics.

Next steps: now that we are starting to collect more logs, we can start comparing current / new webrequest records.

Very cool!

Change #1017913 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] haproxy: remove timestamp from unique-id-format

https://gerrit.wikimedia.org/r/1017913

Change #1017913 merged by Fabfur:

[operations/puppet@production] haproxy: remove timestamp from unique-id-format

https://gerrit.wikimedia.org/r/1017913

Change #1018188 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] haproxy: increase capture header length for UA

https://gerrit.wikimedia.org/r/1018188

Change #1018188 merged by Fabfur:

[operations/puppet@production] haproxy: increase capture header length for UA

https://gerrit.wikimedia.org/r/1018188

Change #983926 merged by Gmodena:

[analytics/refinery@master] Add gobblin job webrequest_frontend_rc0

https://gerrit.wikimedia.org/r/983926

dr0ptp4kt subscribed.Apr 11 2024, 4:44 PM

Change #1017041 merged by Btullis:

[operations/puppet@production] analytics: refinery: add webrequest_frontend timer

https://gerrit.wikimedia.org/r/1017041

Change #1019867 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery/source@master] refinery-job: add webrequest instrumentation.

https://gerrit.wikimedia.org/r/1019867

gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/650

Draft: analytics: webrequest: actor add dq job

In T351117#9688466, @gmodena wrote:

Next steps: now that we are starting to collect more logs, we can start comparing current / new webrequest records.

An update after extensive rounds of validation that @Fabfur and I did in the past two weeks. @Fabfur holler if I'm missing something.

The haproxy and varnishkafka generates looks seem to be in agreement both in terms of volume (requests by status code and hostname),
as well as field (discrete values) overlap. All ulsfo traffic should be now served through haproxy.

A summary of our validation plan and findings can be found in this HAProxy log validation plan.

The only difference compared to varnishkafka logs is sequence number. With haproxy we observe the following behaviour:

upon haproxy (config) reloads, a new process is spawned that share the req count. As a result we’ll log a bunch of requests with duplicated sequence.
upon haproxy restarts, we’ll rest (and log) the counter to 0

There's an internal Slack thread wrt how this could impact ops, and how address it (cc / @xcollazo @JAllemandou ).

Pipeline status

The new Kafka topics are ingested at 10-minute intervals by Gobblin. Raw data is available in Hive at wmf_raw.webrequest_frontend_rc0.
A refine pipeline is currently running on a dev environment and producing post-processed data in gmodena.webrequest. Data loss support tables are available in the gmodena namespace. Note that this pipeline is orchestrated from a dev Airflow instance, and data might lag.
Traffic volume metrics (requests by status code and host) are available in gmodena.haproxy_requests. Note that they are pre-computed in incremental batches and may lag a few hours behind live traffic.
We added instrumentation to webrequest (via DQ framework) to try and spot eventual regressions post-migration.

Next steps

I believe the next steps would be agreeing on acceptance criteria (discussed with @Ahoelzl) and coming up with a rollout timeline.

From the data side, before switching off varnishkafka, we'll need to:

Plan the webrequest_raw sources switch in the production refine pipeline.
Bump the Event Platform stream name from webrequest_frontend_rc0 to webrequest_frontend, and promote its event schema to the primary registry (this should not impact the timeline).

I agree with @gmodena on all topics, more specifically:

About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.
The X-Analytics-TLS header keyx parameter should be now uppercase, like in Varnish with change https://gerrit.wikimedia.org/r/c/operations/puppet/ /1021412

About next steps, we have a couple of options here:

We should start sending Benthos logs to the actual "production" kafka topic(s), same as VarnishKafka
We should turn off VarnishKafka on those cp hosts.

Ideally the two things could be done at the same time, especially if we don't have a way to differentiate downstream which logs comes from Benthos and which one comes from VarnishKafka. If we can differentiate we could first send the logs to the production topic and in a second time switch off VarnishKafka (in this time interval we'll have "duplicate" logs on the topic).

We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates

Instead of prepending this to the number, what about just adding this as a new field? We are currently using hostname sequence for uniqueness, why not hostname pid sequence?

About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.

Ack and 1 to your proposal. IMHO it's easier to be resilient to reloads, than working around non-monotonicity. Do you maybe a feel for how often haproxy reloads are expected to happen once it prod? Could we assume they are sporadic events?

The X-Analytics-TLS header keyx parameter should be now uppercase, like in Varnish with change https://gerrit.wikimedia.org/r/c/operations/puppet/ /1021412

I'm seeing uppercase keyx in both topics.

We should start sending Benthos logs to the actual "production" kafka topic(s), same as VarnishKafka

Do you mean reusing the current varnishkafka webrequest topics, or creating new production topics for webrequest_text_test and webrequest_upload_test (changing the _test suffix) ?

I'm afraid mixing varnishkafka and benthos payloads would break ingestion piepelines, since old/new events have a different schema. We could reuse the current topics, but we'd have to drain them first.

In T351117#9726287, @gmodena wrote:

About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.

Ack and 1 to your proposal. IMHO it's easier to be resilient to reloads, than working around non-monotonicity. Do you maybe a feel for how often haproxy reloads are expected to happen once it prod? Could we assume they are sporadic events?

HAProxy reload could be frequent, considering that every time someone edits any configuration bits it triggers a reload. Also the @Ottomata idea of adding another (separate) field to use as discriminator for uniqueness is perfectly doable on our side.

The X-Analytics-TLS header keyx parameter should be now uppercase, like in Varnish with change https://gerrit.wikimedia.org/r/c/operations/puppet/ /1021412

I'm seeing uppercase keyx in both topics.

We should start sending Benthos logs to the actual "production" kafka topic(s), same as VarnishKafka

Do you mean reusing the current varnishkafka webrequest topics, or creating new production topics for webrequest_text_test and webrequest_upload_test (changing the _test suffix) ?

I'm afraid mixing varnishkafka and benthos payloads would break ingestion piepelines, since old/new events have a different schema. We could reuse the current topics, but we'd have to drain them first.

We can do both, for us it's just a matter of changing a string on puppet. I think decision is more on your side, choose the easiest/best option for you and we'll implement!

I think @Ottomata 's idea is good: having another column makes it easy to keep the "monotonic" values, while still having a de-duplication key with the new field.

In T351117#9726548, @JAllemandou wrote:

I think @Ottomata 's idea is good: having another column makes it easy to keep the "monotonic" values, while still having a de-duplication key with the new field.

In T351117#9726516, @Fabfur wrote:

I'm afraid mixing varnishkafka and benthos payloads would break ingestion piepelines, since old/new events have a different schema. We could reuse the current topics, but we'd have to drain them first.

We can do both, for us it's just a matter of changing a string on puppet. I think decision is more on your side, choose the easiest/best option for you and we'll implement!

Terrific!
@brouberol would it ok to 2x webrequest data in jumbo (all topics have 7 days retention)? This will be temporary (no ETA yet though), till we fully migrate to the new haproxy benthos producers.

The haproxy_id field has been added to messages.

(PS. I'll keep this open as umbrella task)

Change #1021902 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] development: webrequest: add haproxy_pid field

https://gerrit.wikimedia.org/r/1021902

Change #1021902 merged by Gmodena:

[schemas/event/primary@master] development: webrequest: add haproxy_pid field

https://gerrit.wikimedia.org/r/1021902

The haproxy_id field has been added to messages.

This seems to have helped a lot with dedup, at least on the samples we collected so far.
@xcollazo @JAllemandou there is a sample in gmodena.webrequest_sequence_stats if you'd like to take a look.

As for next steps;

Schema finalization.

there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event schema naming conventions (cc / @Fabfur). We might want to drop the webrequest_source since we don't currently use in ETL (it's inferred from the HDFS path, not schema).
we'll need to change steam name from webrequest_frontend_rc0 to web request_frontend in EventStreamConfig. Possibly puppet too.
we need names for the new Kafka topics. @Fabfur would you have any preference? Historically Event Platform topics are prefixed with a datacenter id.

I was thinking about the follow rollout steps:

Deploy HAProxy Benthos producer across entire fleet.

Begin shipping logs to prod Kafka topics once ready. Update ingestion and analytics pipelines once topics are deemed ready for downstream users.

Run Benthos and VarnishKafka producers in parallel for 7 days.

Refine HAProxy data for 7 days, allocating a maintenance window for to update the webrequest feed (tables dag). HAProxy logs become source of truth for wmf.webrequest. Notify downstream consumers and await bug reports. Rollback to varnishkafka feed if critical issues arise.

If metrics stabilize after X days, disable varnishkafka.

@Fabfur thoughts on this high-level plan? How long can varnishkafka and benthos run together across the fleet?
@brouberol would it ok to 2x webrequest data (varnishkafka benthos producers) in kafka-jumbo (and keep 7 days retention)?

Additional data platform-specific implementation steps can be discussed separately.

there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event schema naming conventions (cc / @Fabfur). We might want to drop the webrequest_source since we don't currently use in ETL (it's inferred from the HDFS path, not schema).

No problem here, for us is just a matter of removing a line from the Benthos configuration. Let me know if I can proceed!

we need names for the new Kafka topics. @Fabfur would you have any preference? Historically Event Platform topics are prefixed with a datacenter id.

Currently IIUC we just use webrequest_text and webrequest_upload. Do you suggest to use something like uslfo_webrequest_text instead? This shouldn't be an issue for us, anyway.
I also suggest to keep the DLQ as <topic_name>_errors, as it has been an invaluable help for debugging.

I was thinking about the follow rollout steps:

Deploy HAProxy Benthos producer across entire fleet.

Begin shipping logs to prod Kafka topics once ready. Update ingestion and analytics pipelines once topics are deemed ready for downstream users.

Run Benthos and VarnishKafka producers in parallel for 7 days.

Refine HAProxy data for 7 days, allocating a maintenance window for to update the webrequest feed (tables dag). HAProxy logs become source of truth for wmf.webrequest. Notify downstream consumers and await bug reports. Rollback to varnishkafka feed if critical issues arise.

If metrics stabilize after X days, disable varnishkafka.

@Fabfur thoughts on this high-level plan? How long can varnishkafka and benthos run together across the fleet?

I like the overall idea, but I'd prefer to proceed DC-by-DC, in switching topics and shutting down VarnishKakfka when we will be sure about the correctness of data. I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...

An update: we currently noticed some messages drop at operating system level during high-messages spike. This means that HAProxy tried to send logs to Benthos but the UDP buffer was full and the network stack discarded those messages before they could be read by Benthos. Unfortunately this isn't exactly easy to monitor but here are some considerations:

The ulsfo DC where we are testing this is the lowest traffic one. Is reasonable to imagine that when Benthos will be deployed on other datacenters this kind of drops could be even more frequent.
We are considering different solutions here (thanks to @Vgutierrez that is helping me with this) to mitigate this issue.
I don't think this would block the overall activity but is something we need to address right now to avoid worse performance in the future.

Do you suggest to use something like uslfo_webrequest_text instead?

The current naming 'convention' is to use '.' as the concept separator in topic names. So: ulsfo.webrequest_text would be fine.

also suggest to keep the DLQ as <topic_name>_errors

Can we do <topic_name>.error, to follow our other error topic naming convention? E.g. ulsfo.webrequest_text.error ?

Actually, the convention we use isn't <topic_name>.error, but <producer_name>.error, because in this case the subject of the data in the topic isn't about webrequests, but about a problem with the producer. But, naming is hard, and e.g. ulsfo.webrequest_text.error is fine. :)

there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event schema naming conventions (cc / @Fabfur). We might want to drop the webrequest_source since we don't currently use in ETL (it's inferred from the HDFS path, not schema).

No problem here, for us is just a matter of removing a line from the Benthos configuration. Let me know if I can proceed!

Ok, I'll f/up here on phab with a wishlist of changes, so that they could be addressed in one go before moving to prod topics.

we need names for the new Kafka topics. @Fabfur would you have any preference? Historically Event Platform topics are prefixed with a datacenter id.

Currently IIUC we just use webrequest_text and webrequest_upload. Do you suggest to use something like uslfo_webrequest_text instead? This shouldn't be an issue for us, anyway.
I also suggest to keep the DLQ as <topic_name>_errors, as it has been an invaluable help for debugging.

1 for error topics. See comment from @Ottomata below for naming.

I like the overall idea, but I'd prefer to proceed DC-by-DC, in switching topics and shutting down VarnishKakfka when we will be sure about the correctness of data. I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...

Makes sense. This would require some work on our end to generate webrequest data from two "raw" sources at once, but I think as long as we can filter on dc / hostnames, we should manage. Let me take a better look at how this ETL is setup.

In HDFS, data is partitioned at hourly level. Could we assume that all varnishkafka instances in a given DC will be turned off within the same hour?

An update: we currently noticed some messages drop at operating system level during high-messages spike. This means that HAProxy tried to send logs to Benthos but the UDP buffer was full and the network stack discarded those messages before they could be read by Benthos. Unfortunately this isn't exactly easy to monitor but here are some considerations:

Ack. Thanks for the heads up.

I like the overall idea, but I'd prefer to proceed DC-by-DC, in switching topics and shutting down VarnishKakfka when we will be sure about the correctness of data. I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...

Makes sense. This would require some work on our end to generate webrequest data from two "raw" sources at once, but I think as long as we can filter on dc / hostnames, we should manage. Let me take a better look at how this ETL is setup.

One thought: one thing that we would lose if we turn off varnishkafka DC-by-DC is the capability of "easy" rollbacks. For example, pointing the ETL to the old varnishkafka logs in HDFS and regenerating web request data.

Right now, we don't partition wmf.webrequest data by DC; it's only partitioned by hour and request source (upload, text). This means that varnishkafka and haprox logs will end up in the same wmf.webrequest partition. So rollbacks would have to be a bit more surgical. If we turn off varnishkafka in one DC and later on detect issues in another, we won't be able to recover of analyze past data. I'm not sure about the likelihood of this being an issue, though.

I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...

Storing (Kafka, HDFS) twice the data should be okay (within reason). I do appreciate that bandwidth volumes (web -> Kafka -> HDFS) are not trivial. But how much of a blocker would it be to keep Varnishkafka lights on until the whole migration is completed? Could we compress timelines and keep Benthos and HAProxy producing in parallel?

@Fabfur f/up from our chat earlier; these would be the pending config bits that we'll the to finalize when moving to prod topics:

adopt topic names that follow EP conventions: <dc>.<topic_name> and <dc>.<topic_name>.error. E.g. ulsfo.webrequest_text and ulsfo.webrequest_text.error.
remove the webrequest_source field from the event schema / payload.
rename haproxy_pid to server_pid
Change stream name (ESC, benthos producer) from webrequest_frontend.rc0 to webrequest_frontend.v1 (suffix versions are an EP convention).

These changes will require puppet changes, schema repo and Hive DDL, and a new deployment of EventStreamConfig. I'll start prepping work-in-progess CRs to EP and DE code bases.

Change #1026498 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] primary: add webrequest schema

https://gerrit.wikimedia.org/r/1026498

Change #1026506 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] EventStreamConfig: Add webrequest.frontend.v1.

https://gerrit.wikimedia.org/r/1026506

VirginiaPoundstone added a subtask: T342577: Data Quality - requestctl not getting set.May 3 2024, 2:51 PM

VirginiaPoundstone mentioned this in T342577: Data Quality - requestctl not getting set.

adopt topic names that follow EP conventions: <dc>.<topic_name>

I'm sorry for not thinking about this earlier. There is a big of a design flaw in the use of data center as a topic prefix, and really, for topics that are never mirrored to other Kafka clusters, there is no need for topic prefixes at all.

I just added documentation about this here:
https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw

Given that, and the ever expanding list of data centers, and the fact that webrequest is the only stream we have that is produced to from non main data centers, I think we should not use topic prefixing for webrequest.

All producers should use the same topic name, independent of which data center they are in.

Discussion about this here in slack.

Fabfur closed subtask T358107: Change mtail configuration to ignore new fields in HAProxy logs as Resolved.May 9 2024, 9:03 AM

In T351117#9781136, @Ottomata wrote:

adopt topic names that follow EP conventions: <dc>.<topic_name>

I'm sorry for not thinking about this earlier. There is a big of a design flaw in the use of data center as a topic prefix, and really, for topics that are never mirrored to other Kafka clusters, there is no need for topic prefixes at all.

I just added documentation about this here:
https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw

Given that, and the ever expanding list of data centers, and the fact that webrequest is the only stream we have that is produced to from non main data centers, I think we should not use topic prefixing for webrequest.

All producers should use the same topic name, independent of which data center they are in.

Thanks for clarifying @Ottomata.

@Ottomata @Fabfur If we remove prefixing, there is a potential clash between varnishkafka and benthos topics.
How about we name the production Haproxy/benthos topics as follows?

webrequest_frontend_text
webrequest_frontend_text.error
webrequest_frontend_upload
webrequest_frontend_upload.error

In T351117#9793797, @gmodena wrote:

In T351117#9781136, @Ottomata wrote:

adopt topic names that follow EP conventions: <dc>.<topic_name>

I'm sorry for not thinking about this earlier. There is a big of a design flaw in the use of data center as a topic prefix, and really, for topics that are never mirrored to other Kafka clusters, there is no need for topic prefixes at all.

I just added documentation about this here:
https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw

Given that, and the ever expanding list of data centers, and the fact that webrequest is the only stream we have that is produced to from non main data centers, I think we should not use topic prefixing for webrequest.

All producers should use the same topic name, independent of which data center they are in.

Thanks for clarifying @Ottomata.

@Ottomata @Fabfur If we remove prefixing, there is a potential clash between varnishkafka and benthos topics.
How about we name the production Haproxy/benthos topics as follows?

webrequest_frontent_text

webrequest_frontent_text.error

webrequest_frontent_upload

webrequest_frontent_upload.error

No problem for us to rename these topics, even with or without "variable" part...

In T351117#9794313, @Fabfur wrote:

In T351117#9793797, @gmodena wrote:

In T351117#9781136, @Ottomata wrote:

adopt topic names that follow EP conventions: <dc>.<topic_name>

I'm sorry for not thinking about this earlier. There is a big of a design flaw in the use of data center as a topic prefix, and really, for topics that are never mirrored to other Kafka clusters, there is no need for topic prefixes at all.

I just added documentation about this here:
https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw

Given that, and the ever expanding list of data centers, and the fact that webrequest is the only stream we have that is produced to from non main data centers, I think we should not use topic prefixing for webrequest.

All producers should use the same topic name, independent of which data center they are in.

Thanks for clarifying @Ottomata.

@Ottomata @Fabfur If we remove prefixing, there is a potential clash between varnishkafka and benthos topics.
How about we name the production Haproxy/benthos topics as follows?

webrequest_frontent_text

webrequest_frontent_text.error

webrequest_frontent_upload

webrequest_frontent_upload.error

No problem for us to rename these topics, even with or without "variable" part...

ack - and thanks!
Oof s/frontent/frontend. Spelling that word is not my forte.

Change #1031762 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache:benthos: switch to production topic names

https://gerrit.wikimedia.org/r/1031762

Change #1036272 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] webrequest: add error schema.

https://gerrit.wikimedia.org/r/1036272

Change #1036299 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] EventStreamConfig: Add webrequest.frontend.error

https://gerrit.wikimedia.org/r/1036299

gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/643

analytics: webrequest: add webrequest_frontend refine dag.

Change #1037599 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] gobblin: add webrequest_error pull job.

https://gerrit.wikimedia.org/r/1037599

gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/729

Draft: analytics: webrequest: refine frontend in staging.

gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/729

analytics: webrequest: refine frontend in staging.

The haproxy / benthos feed is now available in raw form under wmf_staging.webrequest_frontend_rc0 and post-processed in wmf_staging.webrequest.

Ingestion and processing is coordinated by Gobbling and airflow jobs deployed to prod systems. However, the dataset themselves are still considered experimental / WIP.

gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/735

analytics: webrequest: frontend disable email

gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/735

analytics: webrequest: frontend disable email

Joe closed subtask T342577: Data Quality - requestctl not getting set as Resolved.Jul 9 2024, 7:57 AM

Vgutierrez mentioned this in T369606: Allow integrating requestctl rules into haproxy.Jul 9 2024, 9:56 AM

Change #1019867 merged by jenkins-bot:

[analytics/refinery/source@master] refinery-job: add webrequest instrumentation.

https://gerrit.wikimedia.org/r/1019867

milimetric merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/650

analytics: webrequest: actor: add dq job

Fabfur closed subtask T370741: Remove Benthos from ulsfo hosts as Resolved.Aug 7 2024, 1:54 PM