Page MenuHomePhabricator

Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx}
Closed, ResolvedPublic8 Estimated Story Points

Description

Edit schema needs purging, table is too big
https://meta.wikimedia.org/wiki/Schema_talk:Edit

Talk to schema owners, the table has grown too large for feasible queries in our current system (250G in a month or so). We need to talk about how to deal with this going forward.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Nuria raised the priority of this task from to Needs Triage.Jan 25 2016, 5:47 PM
Nuria added a project: Analytics-Kanban.
Nuria subscribed.
Milimetric renamed this task from Edit schema needs purging, table is too big for queries to run (500G before conversion) to Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} [8 pts].Jan 25 2016, 5:57 PM
Milimetric updated the task description. (Show Details)

I imagine purging refers to the deletion of old rows? And I imagine you have the tools to do automatic purging already, so it's just a matter of us deciding how long we want to retain the logs?

it's just a matter of us deciding how long we want to retain the logs

Exactly that. Or filtering more (sampling), etc.

@Neil_P._Quinn_WMF : Correct, for most schemas we purge after 90 days

I talked with @madhuvishy today, and she suggested another possibility: storing all the data in Hive, and only a subset or none at all in MariaDB. This would require sacrificing the ability to easily join the Edit logs to MediaWiki tables, although we could likely work around that by importing subsets of data into MariaDB tables as needed.

It looks like we could reduce the logging rate to a quarter of what it is now, just by reducing the sampling rate for mobile edits from 100% to 6.25%, the same as it for desktop edits. I think I implicitly classed mobile editing as a minor factor, so I wasn't expecting it to have that kind of impact.

So that takes care of part of the issue. I'll keep thinking about purging policies and such.

@Nuria, given that we have a lot more mobile events than we want or need, what if (rather than purging the table in a time-based fashion) we go back through the mobile events and delete 15 out of every 16, in effect retroactively applying 6.25% sampling? I don't know how hard that would be to do, or how easily I could help. I could certainly make a list of rows to delete based on the primary key.

Also, I'm working on reworking the schema entirely so at some point in the next couple of weeks we should be deploying a new version of the schema. That'll freeze the table at its current size; I don't know how much that'll help.

@Neil_P._Quinn_WMF: if this was a flat file it will be real easy to remove 14 out of 15, in the db I am not so sure

cc-ing @jcrespo for good ideas of how to remove data from table on an uniform fashion.

We can delete in batches, using pt-archiver every nth row. For example, using the timestamp "seconds" column, which seems to be distributed uniformly. You only need to say which percentage of rows to delete. But I really recommend the sampling to be applied first so it is not a live table.

mysql> SELECT count(*) FROM Edit_13457736 WHERE timestamp BETWEEN '20160101' and '20160102';
 ---------- 
| count(*) |
 ---------- 
|  2440650 |
 ---------- 
1 row in set (2.40 sec)

mysql> SELECT count(*) FROM Edit_13457736 WHERE timestamp BETWEEN '20160101' and '20160102' AND timestamp like '%0';
 ---------- 
| count(*) |
 ---------- 
|   244829 |
 ---------- 
1 row in set (3.12 sec)

I think you'd want to delete a percentage of distinct editing sessions rather than rows.

@Neil_P._Quinn_WMF: can you chime in on whether @jcrespo approach would work?

I think the core idea is sound, but as Krenair points out it would make more sense to use the condition event_editingSessionId NOT LIKE '0%', which is the same condition we would use to sample events on the client.

But I really recommend the sampling to be applied first so it is not a live table.

@jcrespo, I'm working on implementing more sampling (T124845) right now. That won't cause a new table to be created but I'm also working on a more general revision of the schema (T118063) that will. Should this kind of purge just wait for the sampling, or should it wait for the general revision as well?

@Neil_P._Quinn_WMF: let's do purge before you are done with your schema revision. The purge is a solution for data volume, not correctness so both things can happen on parallel

@Nuria, I'm fine with that. But it sounded like Jaime was advising against it. Let's wait until we've clarified that.

Also, unless this is more urgent that I realize, I think it makes sense to wait until I've applied 6.25% sampling across all the event sources, so we only have to run the purge once. I'm hoping that will be done in next week's deploy.

Also, please see: https://phabricator.wikimedia.org/T124383

When issues with size are fixed these jobs can be re-enabled.

Change 268568 had a related patch set uploaded (by Neil P. Quinn-WMF):
Sample MobileFrontend edit events at 6.25%

https://gerrit.wikimedia.org/r/268568

Change 268568 merged by jenkins-bot:
Sample MobileFrontend edit events at 6.25%

https://gerrit.wikimedia.org/r/268568

Change 269293 had a related patch set uploaded (by Jforrester):
Sample MobileFrontend edit events at 6.25%

https://gerrit.wikimedia.org/r/269293

Change 269293 merged by jenkins-bot:
Sample MobileFrontend edit events at 6.25%

https://gerrit.wikimedia.org/r/269293

Jdforrester-WMF subscribed.

This is now good to go, from 2016-02-08 backwards (inclusive)

Yes, @jcrespo, @Nuria, we're ready to purge using the following logic. Let us know what your timeline looks like.

DELETE FROM Edit_13457736
WHERE 
event_editingSessionId NOT LIKE '0%' AND
timestamp < "20160209014000";

The time restriction is because that's when the new sampling rates took effect on the clients; we don't want to sample the same data twice.

Please do not run that, as is. I will take care of creating a batched job to do that.

@jcrespo, don't worry! I meant "we're ready [for you] to purge" using pt-archiver; just give me an idea of when you might get to it so I know when I can get to T126058.

I would want to promise this week, but if not, early next week.

This is running now (and during the weekend) on the analytics slave. I am doing it slowly so it doesn't affect the rest of the data being inserted.

I am doing it on the analytics slave, so if something goes wrong, we can revert it with data from the master. After it is complete, I will ask you to validate it for ~ a week, then I will irreversibly delete it (at least, from mysql) on all machines.

@Neil_P._Quinn_WMF This is done, it took 40 hours to purge those rows. Please check that this was done correctly, by checking against analytics slave dbstore1002, and I will perform it on the master and other slaves, too.

@jcrespo, I'm don't totally understand. If the purged data is in dbstore1002 (= analytics-store), what other data should I compare that with? The only databases I've ever accessed are analytics-store and x1-analytics-slave.

With "before"? Do you see anything strange on your graphs? Too many missing points? Errors, etc. Anything "strange" before I delete everything forever.

@jcrespo, oh, I see. I thought you wanted me to compare it with the master database.

I delete everything forever.

"Before I apply the same change to the other hosts that you cannot see, but that would otherwise allow me to recover if this had been done incorrectly." :-)

@jcrespo, I just found one problem: the data in the Edit logs (on analytics-store) stops on 29 January. I've spot checked some other event log tables, and they don't have the same issue. The number of events on 29 January is also much higher than on previous days.

29 January was about a week before we made the sampling changes in mobile and VE.

SELECT 
	LEFT(timestamp, 8) as day,
	COUNT(*)
FROM Edit_13457736
WHERE
	timestamp > "20160125"
GROUP BY day;

day		COUNT(*)
20160125	537607
20160126	524772
20160127	563500
20160128	612954
20160129	1184511

There is a problem I didn't realized: the automatic syncing process detected missing rows and has started refiling those like crazy, delaying proper replication of data. As you can see here:

Master
$ mysql -BN -A log -e "SHOW TABLES like 'Edit\_%'" | while read table; do echo -n "$table: "; mysql -BN -A log -e "SELECT MAX(timestamp) FROM $table"; done
Edit_10604157: NULL
Edit_10676603: NULL
Edit_11319708: NULL
Edit_11448630: 20160127011701
Edit_13457736: 20160217081942
Edit_5563071: NULL
Edit_5570274: NULL
Slave
$ mysql -BN -A log -e "SHOW TABLES like 'Edit\_%'" | while read table; do echo -n "$table: "; mysql -BN -A log -e "SELECT MAX(timestamp) FROM $table"; done
Edit_10604157: 20150312123640
Edit_10676603: 20150430164114
Edit_11319708: 20150506153533
Edit_11448630: 20160127011701
Edit_13457736: 20160129150029
Edit_5563071: 20130618000918
Edit_5570274: 20150506071731

Data up to the latest day is on the master, only delayed by the automatic backfilling and by the purging. So, aside from this, is there anything else? or can I delete this (again, on both the slave and the master this time). I can give you the result of that query on the master that, except for purging, will match the slave soon:

MariaDB EVENTLOGGING m4 localhost log > SELECT 
    -> LEFT(timestamp, 8) as day,
    -> COUNT(*)
    -> FROM Edit_13457736
    -> WHERE
    -> timestamp > "20160125"
    -> GROUP BY day;
 ---------- ---------- 
| day      | COUNT(*) |
 ---------- ---------- 
| 20160125 |  2798327 |
| 20160126 |  2712766 |
| 20160127 |  2796390 |
| 20160128 |  2768918 |
| 20160129 |  2560945 |
| 20160130 |  2662054 |
| 20160131 |  2917958 |
| 20160201 |  2785474 |
| 20160202 |  2716151 |
| 20160203 |  2711656 |
| 20160204 |  2627985 |
| 20160205 |  2493990 |
| 20160206 |  2548298 |
| 20160207 |  2819553 |
| 20160208 |  2747151 |
| 20160209 |   692748 |
| 20160210 |   538933 |
| 20160211 |   592494 |
| 20160212 |   539869 |
| 20160213 |   511462 |
| 20160214 |   550950 |
| 20160215 |   576825 |
| 20160216 |   582528 |
| 20160217 |   152598 |
 ---------- ---------- 

The numbers seem to match between the purged on the slave and the filtered ones?

@jcrespo, ignoring the cut-off data and spike on 29 January, the data on the slave looks reasonable to me (I'm attaching the spreadsheet I used to check for posterity's sake). You can go ahead and delete the old data permanently.

jcrespo moved this task from Blocked external/Not db team to In progress on the DBA board.

Applying to all servers now.

This is still ongoing on the master (it ended on the slaves). Purged 121 million rows out of approximately 265 million.

Milimetric renamed this task from Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} [8 pts] to Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx}.Feb 22 2016, 9:05 PM
Milimetric set the point value for this task to 8.
nshahquinn-wmf renamed this task from Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} to Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx}.Feb 22 2016, 10:32 PM

Thank you, @jcrespo! One question: it looks like the data on the slave still cuts off on 30 January. It seems to be refilling, but only very slowly. Do you how long it should take to catch up, or how I can check the replication lag for that table?

I will manually make sure to resync it- a side issue of purging is that is slows down the replication. However, if I manually resync it now, it may copy events that I have to repurge :-)

Let it finish ( it should be soon) and I promise it will go back to a good state, with the additional improvement of now going faster due the the lower amount of events.

Regarding replication lag, I shared a script I use (which is basically comparing the latest timestamp of each table) with one of you operators. I think he is (or will) work on maybe setting up something public for that or an alarm on T124306.

There are a few tables affected by this:

$ bash check_eventlogging_lag.sh dbstore1002[Warning] DeprecatedUsage_7906187 lag: 02:49:20
[CRITICAL] EchoInteraction_15180901 lag > 1 day (134661 seconds)
[CRITICAL] Edit_13457736 lag > 1 day (2073306 seconds)

@jcrespo, ah, okay, I misunderstood your message yesterday and thought the purge had finished on the master. I'll be patient :)

It should have finished now, after deleting 275047166 rows. I am reruning the script to make sure that is true, and tomorrow I will resync all tables to the master.

The purging has finished on the master, I am running it once more on all slaves now that deleted events cannot be sent anymore. I am also resyncing the tables that got behind so they are as much up to date as possible. Will ping one last time when the lag is close to 0 on all schemas, and then I will reenable the purging.

Lag should be ok now. I just need a last ok from you to resolve this:

mysql> SELECT 
    -> LEFT(timestamp, 8) as day,
    -> COUNT(*)
    -> FROM Edit_13457736
    -> WHERE
    -> timestamp > "20160125"
    -> GROUP BY day;
 ---------- ---------- 
| day      | COUNT(*) |
 ---------- ---------- 
| 20160125 |   537607 |
| 20160126 |   524772 |
| 20160127 |   595237 |
| 20160128 |   555843 |
| 20160129 |   513733 |
| 20160130 |   504448 |
| 20160131 |   556729 |
| 20160201 |   550647 |
| 20160202 |   535601 |
| 20160203 |   534936 |
| 20160204 |   521187 |
| 20160205 |   497161 |
| 20160206 |   471641 |
| 20160207 |   499208 |
| 20160208 |   534005 |
| 20160209 |   571428 |
| 20160210 |   538933 |
| 20160211 |   592494 |
| 20160212 |   539869 |
| 20160213 |   511462 |
| 20160214 |   550950 |
| 20160215 |   576825 |
| 20160216 |   582528 |
| 20160217 |   557550 |
| 20160218 |   532826 |
| 20160219 |   522700 |
| 20160220 |   516598 |
| 20160221 |   543912 |
| 20160222 |   553830 |
| 20160223 |   550432 |
| 20160224 |   554612 |
| 20160225 |   370185 |
 ---------- ---------- 
32 rows in set (35.25 sec)

@jcrespo, yes, the data on the slave looks good. I think you can close this!

nshahquinn-wmf raised the priority of this task from Medium to Needs Triage.Mar 29 2018, 9:06 AM
nshahquinn-wmf moved this task from Blocked to Radar on the Contributors-Analysis board.