Wikipedia talk:Salting is usually a bad idea

Latest comment: 1 month ago by Alfa-ketosav in topic Titles

Untitled

edit

The rule of thumb I use is whether it seems more important to the repeat-creators to add their content at a specific title, or just anywhere they can. If someone's trying to get, say, their company's page on Wikipedia, salting and blacklisting usually works: they're not going to play l33tspeak games with the title like "Micr0søft Inc.", there's a finite limit of title variants they'll try at, and Special:Linksearch is usually pretty good at finding them at the stage between salting the first one or two titles and progressing to a blacklist regex. The other end is comparable to semi-protecting WP:AUTOBIO, which is just insane - every edit reverted from a page like that is one that was immediately seen, that never showed up in mainspace, and that usually didn't turn into a draft that someone had to review and decline and eventually delete. —Cryptic 22:41, 23 August 2022 (UTC)Reply

Edit filters

edit

Also I think edit filters work best for LTA, as they can't create but they may try to create the page. Thingofme (talk) 15:03, 28 August 2022 (UTC)Reply

Titles

edit

Actually, the titles cannot be longer than 255 bytes, not 256 characters. Given the bytes 00 to 1f (dec. 0 to 31) and 7f (dec. 127) are not valid characters in the titles, this means there are 223 possible bytes used.   This means that while there are still very many possibilities, their number is less than the square root of the one said in this essay. Alfa-ketosav (talk) 17:16, 2 April 2024 (UTC)Reply

Actually:

  • {, }, |, [, ], <, > and # can't be part of a title, meaning only 215 bytes are available.
  • a title's first byte can't be :, space or in the 80–bf range (these are those after the first bytes which also define the number of 80–bf bytes that can be used), reducing the available first bytes to 149.
  • F8 to FF is unused in UTF-8 due to compatibility reasons with UTF-16, reducing the number of available bytes to 207 (141).
  • C0 and C1 are unused to prevent longer-than-necessary byte sequences, reducing the different available bytes to 205 (139)
  • The final byte can't be higher than BF, so the last byte can have 150 different values. The penultimate can't be higher than DF, so that can have 181 different values, and the 3rd-to-last can't start with F, so that can have 197 values.
  • The space is treated equivalently to _, reducing these values to 204 (138 for the first, 149 for the last, 180 for the penultimate byte and 196 for the one before that).
  • Finally, the upper- and lowercase letters are treated equally in the start of the title, reducing its number of possible values to 112.

Thus, a better upper limit of the number of possible titles is   almost 20 billion times lower than the one above. Alfa-ketosav (talk) 18:24, 3 April 2024 (UTC)Reply

@Alfa-ketosav: Corrected. Although, does this account for the fact that sequences of multiple spaces and underscores are treated as a single space, and cannot end a title? (For instance, X_______X (band) actually just has one space and Lozman v. City of Riviera Beach, 585 U.S. ___ actually ends at ".") Or would that not change things to beyond an order of magnitude? -- Tamzin[cetacean needed] (they|xe) 00:30, 17 August 2024 (UTC)Reply
Partly (not for multiple spaces, yes for _). Alfa-ketosav (talk) 03:34, 17 August 2024 (UTC)Reply
I changed the above number due to errors I made when checking. Alfa-ketosav (talk) 12:42, 18 August 2024 (UTC)Reply

F5–F7 are also unused (I checked this by using a code point starting with F4), as the code point 10FFFF corresponds to F4-8F-BF-BF. Thus, the upper bound is  , c. 2.24% of the above number. Alfa-ketosav (talk) 12:42, 18 August 2024 (UTC)Reply

The first 2 bytes may not be b, c, d, f, m, n, q, s, v, w followed by a colon (ten 2-byte sequences). If the first byte is in C2–F4, the second byte must be in 80–BF (except for F4, where it is 80–8F due to limitations) (7035 sequences), and if the first byte is in 20–7E, the second byte cannot be in 80–BF (5568). "./" cannot occur as the beginning of a title (1), as it redirects to the title without "./". So the number of valid 2-byte starting sequences is no more than  . The theoretical maximal title number is thus  , less than half the above value. Alfa-ketosav (talk) 12:46, 19 August 2024 (UTC)Reply

More things:

  • The small/capital rule also applies to Greek letters, of which there are 24 (the compatibility character µ is also recognized as a mu and capitalized, like the Greek letter μ, and ς is a sigma variant used at words' ends, and there is a variant for phi and pi, so there are 28 small letter variants). However, there are also at least 6 archaic variants (e. g. heta or sampi) and 9 variants with diacritics (7 with acute accent, 2 with diaeresis), so another 35 sequences less than above (=9258 seq.). However, this also applies to Cyrillic letters (73 Slavic 152 non-Slavic 202 archaic = 427 total letters). Also, non-breaking spaces (NBSPs) are regarded as regular ones (9257). E0 may not be followed by 80–9F for being "overlong", and F0 may not be followed by 80–8F, so only 9209 sequences remain.
  • Since NBSPs are taken into account as normal spaces, the final 2 bytes may not be C2 A0 as well (26,819 sequences remaining for there).
  • The final 3 bytes cannot be "/..", since this resolves to the string without the final part of the string before the ".." or that "/.." (5,256,523)
  • For 3-byte beginnings:
    • Starting with   sequences:
    • E0 A0–BF and E1–F3 80–BF or F4 80–8F has to be followed by 80–BF (170,640).
    • &X; (where X is a non-empty alphanumeric string) is not valid (62, case-insensitivity).
    • %XX (where XX is a hexadecimal string) is not valid (484, case-insensitive).
    • E2 80 80–8A, E2 80 8E–8F, E2 80 A8–AF, E2 81 9F, E3 80 80, ED A0 80–ED BF BF, EF BF BD–BF are invalid (2072).
    • ~~~ is also not valid (1).
    • "X :" (X=b, c, d, f, m, n, q, s, v, w) equals to X: (10)
    • There are 109 multiple-diacritic variants of Greek-letters between E0 A0 80 and EF BF BF, all with capital versions.
    • There are 177 officially used 2-letter Wikipedia language abbreviations, plus 3 (nb → no, cz → cs, jp → ja) that redirect to a language version with another abbreviation, plus 2 abbreviations for Wikipedia and Wikipedia talk (=182). "mw:" redirects to the MediaWiki wiki. (366)
  • "../" or "/./" cannot be the beginning of a title (2).

So, there are no more than   possible valid titles. Alfa-ketosav (talk) 16:06, 22 August 2024 (UTC)Reply

I said above that there are no more than 9209 valid 2-byte sequences. I checked all 187 Cyrillic letters in the English Wikipedia's "Cyrillic" character set, and found there are 93 lowercase variants, all of which correspond to an uppercase one. The palochka is only present as uppercase, but it has a lowercase version too (94 lowercase variants). So there are no more than 9115 valid 2-byte beginnings,   three-byte ones and   possible valid titles. The actual number is still much lower, as, for example, &X; is also not valid if X is a string containing alphanumeric or non-ASCII characters. Alfa-ketosav (talk) 12:46, 8 October 2024 (UTC)Reply