Page MenuHomePhabricator

Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair
Closed, ResolvedPublic5 Estimated Story Points

Details

Other Assignee
TJones

Event Timeline

I'll do a write up of the before-and-after impact of the reindexing and post a link here, but anyone can do the reindexing and finish the ticket without that.

Gehel triaged this task as Medium priority.Jul 24 2023, 3:34 PM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel set the point value for this task to 3.

Mentioned in SAL (#wikimedia-operations) [2023-08-15T20:36:47Z] <ebernhardson> T342444 start cirrussearch reindex of all wikis to enable new text analysis components from mwmaint1002

Reindex is still in progress, going a bit slower than my memory of the past. Each cluster has processed ~200 out of 1000 wikis, but the absolute number is fairly meaningless. The important question would be % of docs reindexed, which we don't have any easy to reference answer to. Taking into account that wikidata and commonswiki ran in parallel to this we've probably reindexed about half the docs, but it's probably going to be the rest of the week or into next for this to finish.

Still in progress, only up to enwiki. This may indicate something is wrong/more expensive about the new mappings, or it could be an unrelated issue.

Sounds like my local reindexing is insufficient for detecting non-egregious slow downs in indexing speed. (I know I have other overhead—I guess it's even more than I thought.) Should we pause the reindex and investigate more thoroughly on RelForge, with the possibility of reverting some changes after finding the slowest ones?

Decided we don't need to pause the reindex yet. Other than the indexing latency nothing appears to be complaining too loudly. To get an idea of how the mappings effect things I ran the following test, which imports 5k docs from mediawiki.org, to measure indexing time. This took ~55s with the older mappings and ~2m40s with the updating mappings.

Settings and Mappings collected from mediawiki api
Manually adjust settings.json files to prune settings elasticsearch owns like uuid, provided_name. Change shard count to 1.

Capture a dump to import:

  • curl https://dumps.wikimedia.org/other/cirrussearch/20230821/mediawikiwiki-20230821-cirrussearch-content.json.gz | zcat | head -n 10000 > content.json

Elasticsearch instance used, restarted once for each test category:

  • docker run --rm -p 9200:9200 -e ES_JAVA_OPTS="-Xms=4096m -Xmx=4096m" -e discovery.type=single-node docker-registry.wikimedia.org/repos/search-platform/cirrussearch-elasticsearch-image:v7.10.2-5

Test function:

test_latency() {
    index=enwiki_content
    index_url="http://localhost:9200/${index}"
    index_kind=bad
    curl -s -XDELETE "${index_url}" | jq -c
    curl -s -XPUT "${index_url}" \
        -H 'Content-Type: application/json' \
        --data-binary "@settings.${index_kind}.json" | jq -c
    curl -s -XPUT "${index_url}/_mapping" \
        -H 'Content-Type: application/json' \
        --data-binary "@mapping.${index_kind}.json" | jq -c
    cat content.json | time split -l 10 \
        --filter "curl -s -H 'Content-Type: application/json' --data-binary @- '${index_url}/_bulk' >/dev/null" 2>&1 \
        | awk '/elapsed/ {print substr($3, 0, length($3) - 7)}'
}

The current reindexing has been stopped. We will re-run the reindexing once we have some update available to address the latency regression. This is not being done due to a particular issue, production seems to be continuing on as normal, if a little slower at indexing. But since we plan on adjusting the mapping, continuing the current reindexing seems fairly pointless and restricts SRE from doing restarts.

TJones renamed this task from Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper to Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair.Feb 5 2024, 3:55 PM
TJones updated the task description. (Show Details)
TJones updated the task description. (Show Details)
TJones changed the point value for this task from 3 to 5.

Adding Erik as "other assignee" (never done that before) and increasing the points because Erik is doing more than usual for reindexing watching the cloudelastic reindex using the new update pipeline backfilling mechanism, and I've been doing more than usual gathering stats while fretting a bit over reindex speed.

TJones updated Other Assignee, added: TJones; removed: EBernhardson.

Swapping Assignee and Other Assignee with Erik, since he's still working on reindexing cloudelastic via the new update pipeline backfill mechanism.

All indexes on eqiad and codfw are reindexed.

Reindexing on eqiad and codfw took almost exactly two weeks, though I increased the parallelism along the way. I started off with two threads, one per cluster (codfw & eqiad). I ran commons and wikidata early to get them out of the way (though one of the commons reindexings failed). Starting around the letter f for slower eqiad, I divided wikis by subcluster (ports 9243, 9443, 9643), giving six reindexing threads. After the "small" subclusters finished, around letter h for the remaining subcluster on eqiad, I divided the remaining wikis into "low shard" (max 1 or 2 shards for any index) and "multi-shard" (the ones with 3 or more shards for at least one index) and ran those four threads in parallel.

Running two reindexing threads in one cluster didn't seem to have any effect on any of the obvious metrics on grafana, other than spikes in network traffic when the really big indexes got moved around (esp. commons and wikidata).

[Parallel reindexing stuff deleted. The new SUP/backfill process will require a different approach because some parts are parallelizable and some are not.]

All indices on cloudelastic look to be recreated now as well. It hasn't been running this whole time, it just took me awhile to get around to verifying the operation and finishing the couple wikis that failed the first two times through.