Page MenuHomePhabricator

[Story] Clean up unused records
Closed, ResolvedPublic

Description

When terms are either updated or deleted in a terms store, respective records of deleted or old values in the database must be inspected for orphanism and cleaned up accordingly.


Old investigation stuff:

Best plan so far: create code in wikibase/term-store to clean up dead records and make it invokable from outside the library. Then create some maintenance script in Wikibase that invokes this code.

Because of the normalization we are ending up with unused records in the DB.

When the terms of a property are deleted, the records in property_terms are dropped. Referenced records in terms_in_lang are not dropped, etc.

How do we want to deal with these unused records?

1 Leave them be, no need to clean them up

  • Con: wasted space in db (do we care?)

2 Have a maintenance script that periodically cleans them up

  • Con: extra dev work
  • Con: need to run extra script

3 Clean them up immediately

  • Con: extra dev work
  • Con: Write performance penalty

4 Clean them up post request with a job

  • Con: most extra dev work
  • Con: Write performance penalty (perhaps less severe then 3)

5 ???

Event Timeline

As pointed out at some point we probably want to remove the strings from the table fairly sharpish, as content my be removed, revdeled, and should not continue to appear in public places such as labs dB replicas. So that rules out maintenance scripts run periodically .

Running the logic to cleanup tables post main request or in a job would probably make the most sense.

I was wondering about how much extra complexity the post request approach (4) would bring. In particular, which info do we need to give to the job. Giving the property id is not sufficient. You could give the ids of the text records and then in the job check if they are still unused and do the same for the higher level records that point to those text records. Thing is, if you already need to find the unused records in the request, then you can just as well delete them right away. Either way you have a performance penalty. So I think the simpler approach (immediate cleanup (3)) makes more sense as a starting point. Does that make sense to you?

As pointed out at some point we probably want to remove the strings from the table fairly sharpish, as content my be removed, revdeled, and should not continue to appear in public places such as labs dB replicas.

Interesting. Treating our internal storage as public even when it can't be accessed via the software is kinda weird. People that have access to a replica can just log/copy everything anyway, so this is a bit of a half measure. From an architecture PoV I'd also prefer the persistence details to be private and to have well defined interfaces. That's for when we talk about the labs use case(s) though :)

JeroenDeDauw updated the task description. (Show Details)
JeroenDeDauw updated the task description. (Show Details)
JeroenDeDauw updated the task description. (Show Details)

So this is semi-blocked on figuring out what we do for labs, since that impacts the reasons for immediate cleanup.

We can already think about this question though: if labs is not a concern, can we get away with not spending effort on this? (no cleanup (1))

What I was thinking in order to keep it simple is this way (hybrid of solutions 2 and 4 suggested above):

  • there is a general cleanup script, cron scheduled
  • on writes/deletion, we trigger the same script as a post-request (we need not pass it anything, it will go check and cleanup everything orphan)

If the script is run frequently enough, and written as an optimized SQL query maybe, it shouldn't be non-performant even on billions-scale in indexes, I believe.

Thinking about it again.

If what @JeroenDeDauw said is true, and orphan data existing in the table for a little while isn't really a privacy issue, then solution 2 should be sufficient.

@Ladsgroup I believe solution 3 isn't a production solution, am I right? I mean deleting in batches ought to be more performant than deleting separately?

@Ladsgroup I believe solution 3 isn't a production solution, am I right? I mean deleting in batches ought to be more performant than deleting separately?

If the number of deletions was low, it would be okay but given the sheer amount of changes on wikidata, it's better to be a job, it reduces the ACID for us but for this use case it doesn't matter.

@alaa_wmde seems to have a different idea of how the maintenance script would work then I do.

My thoughts where that the script would need to go through each record and then do some query to find out if it is unused. That means the script would run the whole time and not get rid of things very quickly. Which is quite different from a script that quickly finds and removes unused records and only needs to be run once in a while. The latter is definitely much better. How would that script work? What would the pseudo-SQL be?

Are things like foreign indexes and cascading deletes maybe relevant or useful to us here?

The updating optimization ticket is relevant for this cleanup. We now have two main approaches:

delete and insert everything
Downside: performance penalty for re-inserting things that did not change
Downside: deleted terms leave unused records, so we need to deal with this ticket

smart update using diff
Downside: performance penalty for finding which of the terms are already there
Note: since we are finding those terms, we likely already have the info we need to remove unused records

JeroenDeDauw renamed this task from Decide on initial cleanup strategy to Clean up unused records.Apr 9 2019, 10:06 AM
JeroenDeDauw updated the task description. (Show Details)

We figured we go with delete and insert everything. Task description updated to reflect this.

I'm slightly confused with the description 2 comments up.

delete and insert everything
Downside: performance penalty for re-inserting things that did not change
Downside: deleted terms leave unused records, so we need to deal with this ticket

Deleting everything shouldn't leave unused records?
Unless everything is not being deleted?

"delete everything" means deleting all terms for an item/property in the item/property_terms table, rather than just those that actually need to be removed.

@JeroenDeDauw here's a quick snippet on clean up sql script

DELETE FROM wbt_term_in_lang
WHERE
	wbtl_id IS NOT IN (SELECT wbpt_term_in_lang_id FROM wbt_item_terms)
	AND wbtl_id IS NOT IN (SELECT wbpt_term_in_lang_id FROM wbt_property_terms)
;

DELETE FROM wbt_text_in_lang
WHERE
	wbxl_id IS NOT IN (SELECT wbtl_text_in_lang_id from wbt_term_in_lang)
;

DELETE FROM wbt_text
WHERE
	wbx_id IS NOT IN (SELECT wbxl_text_id from wbt_text_in_lang)
;

@Ladsgroup @Addshore any concerns re indexes or sub-queries here ?

I have no idea how these queries would perform, but they are scary and I don't think we should be running them on the production master DB.
If cleanup is wanted from some sort of maintenance script it should probably use batches of deletes and or selects.

For point 4 in the task description, I don't see where the extra work really comes from? The logic for cleaning up the tables is the same and needs to be written no matter what, it will just be wrapped in some deferring code.

We decided at some point that we can't do 1, so the Dev work need to happen.

The maintenance script just sounds like another thing that we need to keep running, make sure it's deleting, and will be slow.

Deleting them as part of the save could be fine, but it will make saving slower, deferring this cleanup is just to keep saves as lean as possible.

oh yeah sure that won't be running in production like that .. I just was wondering if there are any extra optimization here that I could've missed re using indexes or using the sub-queries.

re running in batches, sure it should limit deleting to an acceptable amount.. and as long as we run frequently enough, then it shouldn't be an issue for both retaining deleted data for long nor for performance.

thanks @Addshore for the feedback, I'll submit a production-ready batch to get more feedback on a final solution there

My final thought here before heading to vacation is, when terms are removed from an entity, we already know what terms therefor might need to be removed from the other tables, so we should probably use that information rather than have some secondary process cycle through the millions of terms trying to find what to remove.

Explored in detail with @Ladsgroup .. a script that tries to identify all orphans will end up doing a full table scan. That's the case in the snippet approach in a previous comment.

As pointed out by @Addshore, we will have to go with a solution that uses the information about which terms have been updated/deleted as part of the write operation. The actual clean-up will happen in a triggered post-request job, but it should be passed the terms(ids) that have been touched.

This means we have to go with the "smart update using diff" approach, since otherwise we do not know which terms have been removed. Not clear to me it will make sense to do the cleanup in post-request, we might end up only delaying a few % of the cost. I suggest to first make it work on write and then see if we can gain a lot by moving stuff to a job.

This needs to be a job, it's expensive compared to the main parts and will cause deadlocks (all of transactions in a webrequst all are rolled in a transaction and gets reverted all together).

I second the post-request solution. Do these post-request jobs exist by default in all mediawiki instances? or do other wikibase instance need to set it up extra in that case? how are these jobs done in MW actually?

Sweet.. let's implement it as a post-request then. As planned, the actual clean up logic resides in doctrine-term-store (and the upcoming mediawiki-term-store) and then invoked in a job within wikibase.

alaa_wmde renamed this task from Clean up unused records to [Story] Clean up unused records.Apr 23 2019, 8:30 PM
alaa_wmde updated the task description. (Show Details)

I broke made this one a story and broke it down

I think something along these lines makes sense, please comment here: https://github.com/wmde/doctrine-term-store/pull/8/files

The diff tells us which terms to remove, which is the minimum info we need to give to the cleanup code.

As far as I can tell that is not sufficient though. If you unlink the terms in the property_terms table, then knowing which terms got removed does not allow you to find which lower level table records where linked, since the links are gone. So my understanding is that we need to determine the ids of all records that need to be cleaned up before we do the cleanup. Not clear to me how to best find those ids query wise. Since I am the least up to speed with SQL out of the team I rather not implement that straight away myself.

Maybe there is a way to do some performant queries that find what to clean up after the links in the property_terms table are gone? If there is that is extra incentive to have some post request job, as more stuff could be moved there.

Loose from the cleanup question, I have come to think we need to do the diff regardless after seeing how slow import of big properties is. If a property has 1000 terms and nothing changed, the Doctrine implementation is currently doing 4000 write queries and the MW implementation is doing 4000 write and 4000 read queries. Not sure how we ever figured that is acceptable :D

I second the post-request solution. Do these post-request jobs exist by default in all mediawiki instances? or do other wikibase instance need to set it up extra in that case? how are these jobs done in MW actually?

It probably makes sense to use a deferred update instead of a regular job, so that terms are cleaned up as soon as possible (since they may contain data that should no longer be public).