Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Jul 21 2023, 3:15 PM

Description

After code for all of these tickets is deployed, reindex all wikis to make the features available.

T315118: Handle variation in apostrophe-like characters better
T219108: Investigate applying aggressive_splitting everywhere, not just on English-language wikis
T170625: Smarter handling of acronyms for word_break_helper in language analyzers
T356643: Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair

[Incidentally take care of T353377: CirrusSearchIndexTooOld.]

Acceptance Criteria:

All wikis are successfully reindexed

Details

Other Assignee: TJones

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	TJones	T332337 Repair multi-script tokens split by the ICU tokenizer
Resolved	RKemper	T356651 Rebuild and deploy textify plugin
Resolved	TJones	T356643 Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair
Resolved	TJones	T353377 CirrusSearchIndexTooOld
Resolved	EBernhardson	T342444 Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair
Resolved	TJones	T359100 Analyze results of harmonization

Event Timeline

TJones created this task.Jul 21 2023, 3:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2023, 3:15 PM

I'll do a write up of the before-and-after impact of the reindexing and post a link here, but anyone can do the reindexing and finish the ticket without that.

Pols12 awarded a token.Jul 21 2023, 10:05 PM

Pols12 subscribed.

Gehel triaged this task as Medium priority.Jul 24 2023, 3:34 PM

Gehel moved this task from needs triage to Current work on the Discovery-Search board.

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Gehel moved this task from Incoming to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.Jul 24 2023, 3:53 PM

Gehel set the point value for this task to 3.

TJones updated the task description. (Show Details)Jul 26 2023, 1:56 PM

TJones updated the task description. (Show Details)Jul 31 2023, 2:47 AM

Mentioned in SAL (#wikimedia-operations) [2023-08-15T20:36:47Z] <ebernhardson> T342444 start cirrussearch reindex of all wikis to enable new text analysis components from mwmaint1002

Reindex is still in progress, going a bit slower than my memory of the past. Each cluster has processed ~200 out of 1000 wikis, but the absolute number is fairly meaningless. The important question would be % of docs reindexed, which we don't have any easy to reference answer to. Taking into account that wikidata and commonswiki ran in parallel to this we've probably reindexed about half the docs, but it's probably going to be the rest of the week or into next for this to finish.

Still in progress, only up to enwiki. This may indicate something is wrong/more expensive about the new mappings, or it could be an unrelated issue.

Sounds like my local reindexing is insufficient for detecting non-egregious slow downs in indexing speed. (I know I have other overhead—I guess it's even more than I thought.) Should we pause the reindex and investigate more thoroughly on RelForge, with the possibility of reverting some changes after finding the slowest ones?

Decided we don't need to pause the reindex yet. Other than the indexing latency nothing appears to be complaining too loudly. To get an idea of how the mappings effect things I ran the following test, which imports 5k docs from mediawiki.org, to measure indexing time. This took ~55s with the older mappings and ~2m40s with the updating mappings.

Settings and Mappings collected from mediawiki api
Manually adjust settings.json files to prune settings elasticsearch owns like uuid, provided_name. Change shard count to 1.

Capture a dump to import:

curl https://dumps.wikimedia.org/other/cirrussearch/20230821/mediawikiwiki-20230821-cirrussearch-content.json.gz | zcat | head -n 10000 > content.json

Elasticsearch instance used, restarted once for each test category:

docker run --rm -p 9200:9200 -e ES_JAVA_OPTS="-Xms=4096m -Xmx=4096m" -e discovery.type=single-node docker-registry.wikimedia.org/repos/search-platform/cirrussearch-elasticsearch-image:v7.10.2-5

Test function:

test_latency() {
    index=enwiki_content
    index_url="http://localhost:9200/${index}"
    index_kind=bad
    curl -s -XDELETE "${index_url}" | jq -c
    curl -s -XPUT "${index_url}" \
        -H 'Content-Type: application/json' \
        --data-binary "@settings.${index_kind}.json" | jq -c
    curl -s -XPUT "${index_url}/_mapping" \
        -H 'Content-Type: application/json' \
        --data-binary "@mapping.${index_kind}.json" | jq -c
    cat content.json | time split -l 10 \
        --filter "curl -s -H 'Content-Type: application/json' --data-binary @- '${index_url}/_bulk' >/dev/null" 2>&1 \
        | awk '/elapsed/ {print substr($3, 0, length($3) - 7)}'
}

The current reindexing has been stopped. We will re-run the reindexing once we have some update available to address the latency regression. This is not being done due to a particular issue, production seems to be continuing on as normal, if a little slower at indexing. But since we plan on adjusting the mapping, continuing the current reindexing seems fairly pointless and restricts SRE from doing restarts.

TJones moved this task from Ready for Dev -- SRE/Ops to Blocked/Waiting on the Discovery-Search (Current work) board.Sep 11 2023, 3:15 PM

TJones mentioned this in T346051: Refactor slow global analysis components.

Gehel mentioned this in T353377: CirrusSearchIndexTooOld.Jan 16 2024, 3:17 PM

Gehel added a parent task: T353377: CirrusSearchIndexTooOld.

TJones renamed this task from Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper to Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair.Feb 5 2024, 3:55 PM

TJones updated the task description. (Show Details)

TJones added a parent task: T356643: Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair.

TJones updated the task description. (Show Details)

TJones mentioned this in T358495: Enable dotted_I_fix (almost?) everywhere.Feb 26 2024, 3:38 PM

TJones claimed this task.Mar 4 2024, 4:10 PM

TJones moved this task from Blocked/Waiting to In Progress on the Discovery-Search (Current work) board.

TJones mentioned this in T359100: Analyze results of harmonization.Mar 4 2024, 7:50 PM

Adding Erik as "other assignee" (never done that before) and increasing the points because Erik is doing more than usual for reindexing watching the cloudelastic reindex using the new update pipeline backfilling mechanism, and I've been doing more than usual gathering stats while fretting a bit over reindex speed.

Swapping Assignee and Other Assignee with Erik, since he's still working on reindexing cloudelastic via the new update pipeline backfill mechanism.

All indexes on eqiad and codfw are reindexed.

Reindexing on eqiad and codfw took almost exactly two weeks, though I increased the parallelism along the way. I started off with two threads, one per cluster (codfw & eqiad). I ran commons and wikidata early to get them out of the way (though one of the commons reindexings failed). Starting around the letter f for slower eqiad, I divided wikis by subcluster (ports 9243, 9443, 9643), giving six reindexing threads. After the "small" subclusters finished, around letter h for the remaining subcluster on eqiad, I divided the remaining wikis into "low shard" (max 1 or 2 shards for any index) and "multi-shard" (the ones with 3 or more shards for at least one index) and ran those four threads in parallel.

Running two reindexing threads in one cluster didn't seem to have any effect on any of the obvious metrics on grafana, other than spikes in network traffic when the really big indexes got moved around (esp. commons and wikidata).

[Parallel reindexing stuff deleted. The new SUP/backfill process will require a different approach because some parts are parallelizable and some are not.]

Gehel closed subtask T359100: Analyze results of harmonization as Resolved.Apr 5 2024, 1:01 PM

All indices on cloudelastic look to be recreated now as well. It hasn't been running this whole time, it just took me awhile to get around to verifying the operation and finishing the couple wikis that failed the first two times through.

Gehel closed this task as Resolved.Apr 19 2024, 8:38 AM

TJones mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.Apr 29 2024, 4:45 PM

Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repairClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...