Page MenuHomePhabricator

High volume mediawiki analytics events camus import is lagging
Closed, ResolvedPublic

Description

Last weekend, there was an heap issue with the Camus mapper JVMs that caused the mediawiki_events to fail for a while. Luca fixed this, but it seems the cirrussearch-request topics never caught back up.

Andrew noticed that the mediawiki_events camus job was poorly configured since the addition of more and higher volume mediawiki event topics. He has created a new camus job 'mediawiki_analytics_events' for these topics: mediawiki.cirrussearch-request and mediawiki.api-request.

After running over night, this job is running, but isn't quite going fast enough.

Andrew will try to get the current job as is to catch up by running it manually and giving it more time to run.

We should also add some partitions to these topics to allow more camus mappers to run in parallel.

Event Timeline

Mentioned in SAL (#wikimedia-analytics) [2019-09-24T14:46:13Z] <ottomata> temporarily disabled camus-mediawiki_analytics_events systemd timer on an-coord1001 - T233718

Increased partitions to 12 for api-request and cirrussearch-request topics:

kafka topics --alter --topic codfw.mediawiki.cirrussearch-request --partitions 12
kafka topics --alter --topic eqiad.mediawiki.cirrussearch-request --partitions 12

kafka topics --alter --topic codfw.mediawiki.api-request --partitions 12
kafka topics --alter --topic eqiad.mediawiki.api-request --partitions 12

kafka topics --describe --topic '(eqiad|codfw).mediawiki.(cirrussearch|api)-request' | grep PartitionCount

Topic:codfw.mediawiki.api-request	PartitionCount:12	ReplicationFactor:3	Configs:
Topic:codfw.mediawiki.cirrussearch-request	PartitionCount:12	ReplicationFactor:3	Configs:
Topic:eqiad.mediawiki.api-request	PartitionCount:12	ReplicationFactor:3	Configs:
Topic:eqiad.mediawiki.cirrussearch-request	PartitionCount:12	ReplicationFactor:3	Configs:

Change 538905 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Disable specultative exec for camus jobs; increase map.tasks to 24 for analytics events

https://gerrit.wikimedia.org/r/538905

Change 538905 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Disable specultative exec for camus; bump map.tasks to 24 for analytics events

https://gerrit.wikimedia.org/r/538905

Change 538905 merged by Ottomata:
[operations/puppet@production] Disable specultative exec for camus; bump map.tasks to 24 for analytics events

https://gerrit.wikimedia.org/r/538905

Change 538966 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Temporarily disabling camus import of api-request

https://gerrit.wikimedia.org/r/538966

Change 538966 merged by Ottomata:
[operations/puppet@production] Temporarily disabling camus import of api-request

https://gerrit.wikimedia.org/r/538966

I've temporarily stopped importing api-request data to help cirrussearch-request catch up. AFAIK, know one uses api-request data in Hive.

Change 538968 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Temporiarily disable monitoring of camus api-request import

https://gerrit.wikimedia.org/r/538968

Change 538968 merged by Ottomata:
[operations/puppet@production] Temporiarily disable monitoring of camus api-request import

https://gerrit.wikimedia.org/r/538968

Doesn't look like this is catching up. New data is arriving again from the new partitions, but the previous data does not appear to be backfilling.

Ok, something is wrong with Camus and the eqiad.mediawiki.cirrussearch-request topic partition 0. Camus seems stuck on offset 23165504623. I'm pretty sure it was also stuck on this offset when we were consuming it from the mediawiki_events Camus job (before I created a dedicated mediawiki_analytics_events for this). (I'm pretty sure that was the offset I saw it stuck on before).

This is very strange! I don't see anything particular weird about that message. (To examine it, run: kafkacat -b kafka-jumbo1002.eqiad.wmnet:9092 -C -t eqiad.mediawiki.cirrussearch-request -o 23165504624 -c 1 -p 0).

I just found this in a Camus container log for this partition:

yarn-logs -u analytics application_1564562750409_220428

Container: container_e08_1564562750409_220428_01_000002 on an-worker1092.eqiad.wmnet_8041
===========================================================================================
LogType:stderr
Log Upload Time:Wed Sep 25 08:01:15  0000 2019
LogLength:577
Log Contents:
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8

LogType:stdout
Log Upload Time:Wed Sep 25 08:01:15  0000 2019
LogLength:92
Log Contents:
No more data left to process. Returning false
No more data left to process. Returning false

LogType:syslog
Log Upload Time:Wed Sep 25 08:01:15  0000 2019
LogLength:40374
Log Contents:
2019-09-25 08:00:25,324 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2019-09-25 08:00:25,927 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2019-09-25 08:00:26,455 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
2019-09-25 08:00:26,455 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2019-09-25 08:00:26,651 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlRecordReader: classloader: class sun.misc.Launcher$AppClassLoader
2019-09-25 08:00:26,651 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlRecordReader: org.apache.avro.Schema: jar:file:/usr/lib/avro/avro-1.7.6-cdh5.16.1.jar!/org/apache/avro/Schema.class
2019-09-25 08:00:26,719 INFO [main] org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2019-09-25 08:00:26,724 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlMultiOutputRecordWriter: beginTimeStamp set to: 0
2019-09-25 08:00:26,728 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlRecordReader: classloader: class sun.misc.Launcher$AppClassLoader
2019-09-25 08:00:26,728 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlRecordReader: org.apache.avro.Schema: jar:file:/usr/lib/avro/avro-1.7.6-cdh5.16.1.jar!/org/apache/avro/Schema.class
2019-09-25 08:00:26,728 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlRecordReader:

topic:eqiad.mediawiki.cirrussearch-request partition:0 beginOffset:23165504623 estimatedLastOffset:23835486456
2019-09-25 08:00:26,732 INFO [main] com.linkedin.camus.etl.kafka.common.KafkaReader: bufferSize=1048576
2019-09-25 08:00:26,732 INFO [main] com.linkedin.camus.etl.kafka.common.KafkaReader: timeout=30000
2019-09-25 08:00:26,945 INFO [main] com.linkedin.camus.etl.kafka.common.KafkaReader: Connected to leader tcp://kafka-jumbo1004.eqiad.wmnet:9092 beginning reading at offset 23165504623 latest offset=23835486456
2019-09-25 08:00:27,189 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlRecordReader: Records read : 0
2019-09-25 08:00:27,422 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlMultiOutputFormat: work path: hdfs://analytics-hadoop/wmf/camus/mediawiki_analytics_events-00/2019-09-25-08-00-08/_temporary/1/_temporary/attempt_1564562750409_220428_m_000000_0
2019-09-25 08:00:27,423 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlMultiOutputFormat: Destination base path: hdfs://analytics-hadoop/wmf/data/raw/event
2019-09-25 08:00:27,432 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlMultiOutputFormat: work file: errors-m-00000

Note the LogType:stdout 'No more data left to process. Returning false'. That comes from https://github.com/linkedin/camus/blob/master/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common/KafkaReader.java#L255.

I had really thought that this problem was originally caused by some heap errors last Saturday, which Luca responded to. Now I'm not so sure.

I'm at the Product Infrastructure offsite today so I won't be able to look at this more today, I'm sorry! This will be my #1 priority task tomorrow though.

Ottomata raised the priority of this task from High to Unbreak Now!.Sep 26 2019, 4:32 PM

Change 539408 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Start mediawiki_analytics camus as a with a new history, include api-request

https://gerrit.wikimedia.org/r/539408

Change 539408 merged by Ottomata:
[operations/puppet@production] Start mediawiki_analytics camus as a with a new history, include api-request

https://gerrit.wikimedia.org/r/539408

I've spent all day on this so far and am not much closer to understanding what is happening. As far as I can tell, the iterator returned by the Kafka MessageSet returns false for all calls to hasNext(). I can't see why this is happening, as far as I can tell the MessageSet returned has a non zero and normal looking size. I can't quite see what is happening beyond that, because any attempt to retrieve messages act as if it is an empty iterator.

Specifically, this hasNext() call fails, as does this one, which causes a false be bubbled up eventually to the EtlRecordReader class as a null for its getNext, which causes it to just stop doing anything with this partition at all. Thus, nothing is written to HDFS for this partition, and the the offsets don't progress.


To keep things moving, I've started a new history of offsets files for both api-request and cirrussearch-request. The camus mediawiki_analytics_events job is importing things just fine starting now.

I'm still working on what to do with the data I can't consume. I may try to manually consume data from Kafka and put it into the refined event database tables somehow. Still TBD.

I'm replaying the contents of partition 0 starting at offset 23165504624 until 23858962742 (an offset stored for a recent successful Camus run), that's 693458118 messages.

kafkacat -C -b kafka-jumbo1002.eqiad.wmnet:9092 -t eqiad.mediawiki.cirrussearch-request -p 0 -o 23165504624 -c 693458118 | kafkacat -P -b kafka-jumbo1002.eqiad.wmnet:9092 -t eqiad.mediawiki.cirrussearch-request

This should allow Camus to re-consume the old data as new messages, properly time bucketing them. Once it is done, I will run a manual Refine task to refine the newly written (old hours) partitions.

@elukey FYI, I'm running this kafkacat process on stat1007. I'm seeing max replica lag increase for kafka-jumbo1003, every other broker is remaining low or at least steady. Let's keep an eye on this. Feel free to kill the kafkacat process if you think we should.

I hope the backfill will finish within a day.

Looks like all events were replayed succesfully! Phew! Am re-refining stuff now.

Ottomata lowered the priority of this task from Unbreak Now! to High.Sep 27 2019, 3:16 PM

Everything looks back to normal! @EBernhardson how's it look to you?

Data looks to have backfilled appropriately, thanks!

Change 539945 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove api-request and cirrussearch-request from low volume mediawiki_events camus

https://gerrit.wikimedia.org/r/539945

Change 539945 merged by Ottomata:
[operations/puppet@production] Remove api-request and cirrussearch-request from mediawiki_events camus

https://gerrit.wikimedia.org/r/539945

@nettrom_WMF was wondering if backfill of these events happened as he found a gap cc @Ottomata

My past self commented that it did! I have a vague memory of doing this too. More info please @nettrom_WMF ?

@Ottomata : Sure thing! Last week @jwang and I identified missing data in several EventBus schemas in the Data Lake. I'm not sure what all the schemas are, but I did check and find data missing during the same timeframe from mediawiki_page_create, mediawiki_revision_create, and mediawiki_revision_tags_change.

Here's the Presto query I ran to identify what dates are affected:

SELECT substr(meta.dt, 1, 10) AS log_date, count(*) AS num_events FROM event.mediawiki_page_create WHERE year = 2019 AND month = 9 AND day >= 20 GROUP BY substr(meta.dt, 1, 10);
  log_date  | num_events 
------------ ------------
 2019-09-20 |     159659 
 2019-09-30 |     299637 
 2019-09-29 |     132596 
 2019-09-23 |     196981 
 2019-09-22 |     195660 
 2019-09-21 |     160752 
(6 rows)

2019-09-24 through 2019-09-28 are completely missing. I then ran the following two queries to identify when the outage started and ended:

SELECT substr(meta.dt, 1, 13) AS log_hour, count(*) AS num_events FROM event.mediawiki_page_create WHERE year = 2019 AND month = 9 AND day = 23 GROUP BY substr(meta.dt, 1, 13) ORDER BY substr(meta.dt, 1, 13); 
   log_hour    | num_events 
--------------- ------------
 2019-09-23T00 |       7102 
 2019-09-23T01 |       8509 
 2019-09-23T02 |       8071 
 2019-09-23T03 |       7598 
 2019-09-23T04 |       7337 
 2019-09-23T05 |       8672 
 2019-09-23T06 |       8719 
 2019-09-23T07 |       8551 
 2019-09-23T08 |       9514 
 2019-09-23T09 |      10075 
 2019-09-23T10 |      10212 
 2019-09-23T11 |      11174 
 2019-09-23T12 |      10559 
 2019-09-23T13 |      11280 
 2019-09-23T14 |      10378 
 2019-09-23T15 |      10446 
 2019-09-23T16 |      10109 
 2019-09-23T17 |      10469 
 2019-09-23T18 |      11941 
 2019-09-23T19 |      11406 
 2019-09-23T20 |       4859 
(21 rows)
SELECT substr(meta.dt, 1, 13) AS log_hour, count(*) AS num_events FROM event.mediawiki_page_create WHERE year = 2019 AND month = 9 AND day = 29 GROUP BY substr(meta.dt, 1, 13) ORDER BY substr(meta.dt, 1, 13);
   log_hour    | num_events 
--------------- ------------
 2019-09-29T18 |      16257 
 2019-09-29T19 |      16958 
 2019-09-29T20 |      17034 
 2019-09-29T21 |      16267 
 2019-09-29T22 |      28061 
 2019-09-29T23 |      38019 
(6 rows)

Given this, it looks like the outage starts some time after 20:00 UTC on 2019-09-23, and that data starts flowing in again some time after 18:00 UTC on 2019-09-29. I don't think I've confirmed that the hours are the same for all schemas, but I would suspect that they are. When I brought this up on IRC last week, Luca was kind enough to dig into the SAL logs and we found this phab task. From what was discussed earlier here and the overlap in dates, I thought these were all related.

👀 That is not good! I think I was not aware at the time that this issue affected anything except for mediawiki_api_request and mediawiki_cirrussearch_request.

This data is gone. We only keep 90 days of raw data for all events.

It'd be nice to have had some alerting about this missing data. It is hard to automate alerts like this, as there is no consistent threshold at which we would alert, but I guess if mediawiki_revision_create is ever missing any hours we should know about it.

I wonder, @mforns could we use the anomaly detection work you did for webrequest to get alerts for things like this?

could we use the anomaly detection work you did for webrequest to get alerts for things like this?

I think this is a much simpler problem cause issue is just volume, not shape of traffic. Any threshold base alarm will catch this. Now, there is something missing here cause we do have those alerts for eventlogging set up as we alarm if flow of data goes low. Let's reopen ticket and dig on this a bit deeper.

IIRC we have some basic alarms on thresholds in Kafka topics, but not in refined event tables.

The anomaly stuff could be nice to automate the alerts for all tables. If the volume today is different than before, alert.

Yeah, I kinda figured that the data was lost as this point. For the edit-related tables, it's not a huge concern because I think all of that is available in the MediaWiki History table as well, so it's not really lost. Is there a place where this should be documented, outside of this phab task?

IIRC we have some basic alarms on thresholds in Kafka topics, but not in refined event tables.

Did the partition checker not alarm when pulling partitions that are totally empty?

https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-camus/src/main/scala/org/wikimedia/analytics/refinery/camus/CamusPartitionChecker.scala#L120-L122

Yes, but I don't remember it alarming for e.g. eqiad.mediawiki.revision-create after I separated out the higher volume topics into a different Camus job. I (vaguely) remember starting the main mediawiki events job without the high volume topics and it consumed data just fine.

Closing parent task focusing on alarming children task

@nettrom_WMF we are in the process of joining eventlogging and eventbus docs, we will add this event for now to the eventlogging dataset page, which will become an 'event-data" page eventually

to clarify the camus partition checker did alram here, we missed that there was some data loss as well in the mediaiwki topics as when this event started we were mostly concerned with cirrussearch request.