User:Polbot/ideas/defaultsort

This bot request page is interesting. Apologies if this sort of request is the wrong sort of request, but I wondered if the following data gathering processes are best done with a bot or other ways, or if the processes are even possible?

  • (1) Scrape a transclusion list from "what links here" for {{WPBiography}} (warning: this is large list of nearly 400,000 articles). This will be a list of talk pages of biographical articles, with the talk pages to be examined, along with the associated articles, in the next step. If possible, exclude from this list the pages that have "non-bio=yes" in the WPBiography template.
  • (2) Run down the list and examine the talk pages and associated articles to determine the following:
    • (a) Examine the talk page to see if the WPBiography template has a "listas=" parameter (excluding cases where 'listas' is there but blank). If it does, extract and list the listas parameter.
    • (b) Examine the talk page to see if it has a DEFAULTSORT magic word on the page. If it does, extract and list the sort value being used. If it doesn't, note this as well.
    • (c) Examine the article page to see if it has a DEFAULTSORT magic word on the page. If it does, extract and list the sort value being used. If it doesn't, note this as well.
    • (d) Examine the article page to see if category pipe-sorting is being used. (Should be anything after a "|" in the category tags). If so, extract and list the sort values being used (there may often be more than one sort being used). If no pipe-sorting is being used (sometimes the pipe-sort character is there with nothing after it), note this as well.

The aim of this is to try and get a handle on pipe-sorting, DEFAULTSORTing, listas, and sort them out, but for biographical articles only. These are the ones that most urgently need this to be sorted out, and the vast majority will simply be a switch from "NAMES SURNAME" to "SURNAME, NAMES". The next step would be getting humans to pore over this data and decide what to do next. The ultimate aim is to have categories involving people properly pipe-sorted so that they can be reliably browsed.

    • (e) As a bonus, could the bot-generated list also state whether the article contains Wikipedia:Persondata data, and extract that as well, as this often contains relevant information on how the article should be listed in a list or category?

Some relevant links: {{WPBiography}}, Wikipedia:Category#Category sorting, Wikipedia:Category#Setting a default sort key and Wikipedia:Persondata. Note also {{DEFAULTSORT}}, as many people use the template pipe character (|) instead of the magic word colon (:).

So, is a bot the best way to get this data, or do I need to persuade someone to do lots of datamining on a database dump (I wouldn't really know how to handle or process a database dump)? Carcharoth 16:25, 13 June 2007 (UTC)

So you know, this is a page for requesting people write bots. You would probably get a better response on the talk page of WP:B. In response, the database dumps can be months old - if it doesn't need to be latest data they are the easiest way to go. If it does, a standard bot is the best way to go, and if it's not making any edits, but reading a lot of data, it may be questionable as to wheter it is approved (unless the data it generates will be incredibly usefull). Matt/TheFearow (Talk) (Contribs) (Bot) 23:44, 13 June 2007 (UTC)
I realise this is a page to request the writing of bots, though I wasn't aware that data-gathering bots were questionable. I saw Wikipedia:Bot requests#Data gathering bot on Unused Images above and assumed data gathering was OK. As for whether the data it generates will be useful or not, it would be used to start up a taskforce (combining humans and bots) that would add DEFAULTSORT keys to biographical articles so that they appear in alphabetical order in categories. Surely that is helpful? Carcharoth 01:37, 14 June 2007 (UTC)
This is the sort of thing that a bot could do. I've written a few bots that did somewhat similar things. The major problem I see is the sheer size of the thing. In several places, you say "note this as well". It seems to me that this would end up being a list of half a million articles, with info next to each name. This is too big to have as a wikipedia page, although it could be a series of pages. I guess I'd like to know what precisely you'd want the output to be. Would this be the sort of data you'd want?
[[James Manning Tyler]] | Tyler, James Manning | Tyler, James
(Imagine for this example that a category pipe gave the sort-order name as "Tyler, James", and a biobox listas gave the name as "Tyler, James Manning"). Is this the sort of output you would be looking for? Also, would you want to ignore those articles that already have {{DEFAULTSORT}}? Please provide as much detail as you can about the requested output. All the best, – Quadell (talk) (random) 10:27, 14 June 2007 (UTC)
I can see that the sheer size might be a problem! The ultimate aim is to ensure that all biographical articles have a DEFAULTSORT key and that it is consistent with the listas parameter, the Persondata name parameter, the category pipe-sorting parameters, and also the hCard format. As you can see, there are least five different places where such name/sort key information is being deposited or left out altogether, and this duplicates efforts. Putting that aside for the moment, a more precise request might be:
  • Put articles with no DEFAULTSORT or listas on one list, articles with DEFAULTSORT only on another list, and articles with listas only on a third list (by all means keep the lists offline or break them up if they are too big). The aim would be then to run a bot to add the DEFAULTSORT as a listas parameter, and the listas as the DEFAULTSORT parameter. The list of articles with neither could then be inspected by bots for existing category pipe-sorting and checked by humans to see if that data is suitable for bot-adding of the category pipe-sorting as DEFAULTSORT and listas. If no category pipe-sorting existed, humans would inspect the list and confirm or correct an automatically generated suggested DEFAULTSORT key that a bot would then add as DEFAULTSORT and listas. Care would need to be taken to avoid listing blank listas (something like {{WPBiography|listas=|living=yes|class=...}}) and blank category pipes ([[Category:Foo|]]) as "existing", when in fact they are blank (I think that example doesn't pipe-sort the category). Care is also needed to search both for the template {{DEFAULTSORT|sortkey}} (technically incorrect, but allowed to work by the existing template) and the magic word {{DEFAULTSORT:sortkey}} (the correct way to use DEFAULTSORT). Is that any clearer? :-) Carcharoth 11:41, 14 June 2007 (UTC)
PS. Cases where category pipe-sorting and DEFAULTSORT and listas are similar but different is problematic. I'd like to assume that DEFAULTSORT will always be correct, but maybe, as in the James Manning Tyler example you give, the person filling in the listas might have put "Manning Taylor, James" (some names are like this - non-hyphenated surnames) and the person filling in the DEFAULTSORT might have put "Taylor, James Manning". If they differ, a human would be needed to check. So any list should also be comparing the values. But I can do that offline. It's getting the data that is the problem. The other problem is that, quite legitimately, different systems of pipe-sorting can exist in the same article, so there may be an unknown number of different "|sortkey" bits in the category tags. Hopefully not for biograhical articles, but you never know. Carcharoth 12:03, 14 June 2007 (UTC)
PPS. Ideally, the output I would be looking for is: "article name", "DEFAULTSORT key", "listas parameter", "persondata name parameter", "category pipe-sorting parameters". With "none" where there is nothing there. All of these, except the category pipe-sorting parameters, should only return one result for each article. The aim then would be to standardise across these formats, and fill in the blanks, then get a bot to update the articles with the new information. Persondata name would only be corrected if it was already there. Redundant biographical category pipe-sorting keys could be removed, but that is probably unecessary, though incorrect ones would be removed. If you can put magic words in templates, then possibly DEFAULTSORT could be placed inside Persondata, but that would need testing and discussion. Does that output list above help? Carcharoth 12:03, 14 June 2007 (UTC)
  • Regarding the size issue. The last time someone generated a list it was 376,274 article and (for some reason) was 10Mb in size. They made it available as Image:Bio list.sxw, which is a 3.4Mb file. Maybe future lists could be handled that way? Carcharoth 12:09, 14 June 2007 (UTC)

Questions, by – Quadell (talk) (random) 12:38, 14 June 2007 (UTC)

  1. Are there any articles you would not want listed? (If so, this will keep the size down a little.) For instance, should the bot ignore those articles that already have a DEFAULTSORT? Or those where all fields agree on the same sortname?
  2. Do you want it to fix obvious case while the bot has those articles open? For instance, if all fields agree on the sortname, but there is no DEFAULTSORT, would you want the bot to add a DEFAULTSORT?
Replies:
  1. Only ignore those that all agree on the same sortname. I'd still like to detect the articles that have DEFAULTSORT, but lack listas, if possible, unless of course you decide to fix this on the fly.
  2. If all the other fields agree (or the other fields are blank), then yes, add DEFAULTSORT where listas already exists (within WPBiography, not within other talk page templates). Could the bot also do the reverse in obvious cases? Use an existing DEFAULTSORT key to add any missing listas WPBiography parameters?
I suspect a trial run would be good first. Then a review of what the bot did in different cases. Then carrying out the operation slowly in batches. An informative edit summary would be good as well. For something touching so many articles, I'd also be happier with more input in the final stages once you are close to having something workable to run. I presume, if the right different edit summaries were used, that I could use the bot's contributions list to get an idea of how many corrections of each type were carried out? A number on how many it ignored would be good as well, just to give an idea of the scale of the problem. Finding out that 300,000 of 350,000 articles already have identical listas and DEFAULTSORT sort keys would be a nice surprise! I looked at "what links here" for {{Persondata}} and it is used in just over 10,000 articles, so the vast majority of the nearly 400,000 biographical articles won't have that. I can't gather data for listas and DEFAULTSORT though, as the former is in a protected template (someone could add an invisible redlink to the template and use what links here to detect how many transclusions of the template use listas), but it is detecting the articles using DEFAULTSORT that has me stumped. Also, don't forget to exclude the articles with the WPBiography parameter non-bio=yes. Carcharoth 13:15, 14 June 2007 (UTC)

Looking at some examples (which would help), I picked three articles from the what links here list of the Persondata template:

  • Michael Mietke - DEFAULTSORT and Persondata name are both "Mietke, Michael" - no WPBiography template on the talk page - no category pipe-sorting.
  • Johnny Ashcroft - no DEFAULTSORT, Persondata name is Ashcroft, Johnny - no WPBiography template on the talk page - category pipe-sorting is "Ashcroft" (not sufficient - a good example of why relying on pre-existing pipe-sorting of categories is not a good idea).
  • Abraham Lincoln - DEFAULTSORT and Persondata name are both "Lincoln, Abraham" - lacks a "listas" parameter in the WPBiography template - one of the categories is pipe-sorted with a " " (space).

Firstly, it seems that some articles are still missing WPBiography templates. Someone should run a bot over all the articles with Persondata templates and give them a WPBiography template if they don't have one. The first two examples I gave wouldn't have been included in a transclusion list of {{WPBiography}}, but if they had, the actions would have been add "listas=Mietke, Michael" to WPBiography and add "{{DEFAULTSORT:Ashcroft, Johnny}}" and "listas=Ashcroft, Johnny" to WPBiography, remove the "Ashcroft" pipe-sorting, and add "listas=Lincoln, Abraham" to WPBiography. I think that is it. Of course, problems arise when you have articles like George Washington (inventor), which lacks any sort keys whatsoever. Also, Elizabeth I of England, where the Persondata name is "Elizabeth I", but the DEFAULTSORT key is Elizabeth I of England, and there is no listas paramter. Getting into Asian naming conventions and other issues is also tricky. It needs to be emphasised that this initial step is only (a) gathering data where there are inconsistencies and (b) standardising where one of two overlapping parameters are missing (DEFAULTSORT and listas). The Persondata name field can be used to suggest a DEFAULTSORT, but a DEFAULTSORT value mustn't be used to add a Persondata template - those need to be added by humans. Carcharoth 13:41, 14 June 2007 (UTC)

New discussion

edit
  • One thing I forgot, is where to place a new DEFAULTSORT key. I've generally found people putting it immediately above the categories. Carcharoth 14:08, 14 June 2007 (UTC)
    • Yes, I think that would be best. – Quadell (talk) (random) 22:46, 14 June 2007 (UTC)
  • You mentioned looking for the DEFAULTSORT word on the talk page. Are you sure it can go there? I haven't seen anything on any of the help files to suggest it would work there. – Quadell (talk) (random) 20:01, 14 June 2007 (UTC)
    • Not sure. Which help files do you mean? You could ask the developer that implemented this. See Wikipedia:Wikipedia Signpost/2007-01-02/Technology report. Or you could just put it on a talk page with categories and see if works... :-) Anyway, using listas for the talk page should be anough for now. If DEFAULTSORT catches on on talk pages, that can be changed later. Carcharoth 00:44, 15 June 2007 (UTC)
  • By the way, I've been working on just reading in the names of all the articles that transclude the WPBiography template. (The code module I usually use won't read more than 5000 "what links here" entries, which is enough for almost anything. Except this. I finally got this to a textfile, and there are 387,267 biographies. More will follow. . . – Quadell (talk) (random) 20:43, 14 June 2007 (UTC)
    • Not up to 400,000 yet? :-) Carcharoth 00:44, 15 June 2007 (UTC)
  • Note that in some rare cases, we may actually want the DEFAULTSORT to be different than what's in the Persondata or WPBiography info. I'm thinking of Pope Leo IX, who should be sorted as "Leo 09" (since IX comes before V alphabetically), but should be listed as "Leo IX". – Quadell (talk) (random) 22:46, 14 June 2007 (UTC)
    • Ah, but listas really means sortas. :-) But yes, the Persondata name would use the Roman numeral, while the sort keys would use 09. Carcharoth 00:44, 15 June 2007 (UTC)
  • Now I'm confused about something. In the template description at Template:WPBiography, it says that the "listas" parameter is used for "bios whose title is in the first part of the article, i.e. Prince George of England, you can instead have listas=George of England, Prince -- so that it will show up in the G's..." Is the description wrong, or should we not be putting these in sort-order? – Quadell (talk) (random) 01:38, 15 June 2007 (UTC)
    • Using some real examples: Prince John of the United Kingdom is sorted under 'J'; Arthur Conan Doyle is sorted under 'D'; George I of England has no sort tags as it defaults to 'G' - not sure what to do in those cases, as there are lots of Georges, and they should be sorted in some sort of order... But that is the human-checking stage that comes later, so leave that for now. Carcharoth 09:39, 15 June 2007 (UTC)
      • Then I think the template's description should be changed, then, to say that listas is a sortname and should be used whenever a lastname would be listed first for sort order. – Quadell (talk) (random) 13:50, 15 June 2007 (UTC)
  • For all WPBiography template questions, I suggest contacting User:Kingboyk. He and another user (User:Reedy Boy) have been running bots adding WPBiography to talk pages, and likely know best how to deal with this and the other issues you raise. Carcharoth 09:30, 15 June 2007 (UTC)
  • I think for the hcard data, you just need to look for the tags in the article. But I'm not sure how widespread the use of this is. Maybe best to leave this out for now, or ask at the WikiProject for microformats. I think the hcard tags are integrated into the infoboxes set-ups, but am not sure. Will have a look at Template:Infobox Biography. Carcharoth 09:39, 15 June 2007 (UTC)
    • Yes. Template:Infobox Biography#Microformat confirms this, but I can't work out which bits of the template are relevant. There are lots of people infoboxes as well, so I think it might be best to just leave hcard stuff for now. Carcharoth 09:43, 15 June 2007 (UTC)
      • I will gladly ignore hcard data for the time being. – Quadell (talk) (random) 13:50, 15 June 2007 (UTC)
  • Please look over the spec below. If anything looks problematic, let me know as soon as possible (so I don't code for things that will need to be taken out and changed.) And yes, I will definitely do a small trial run for investigation and comments before running this for the whole shebang. – Quadell (talk) (random) 13:50, 15 June 2007 (UTC)
    • All I can think of for the moment is to exclude cases where non-bio=yes. Many of these are not people (see the documentation at WPBiography for reasons why some non-people articles have WPBiography templates), so will mostly lack what you are looking for and trigger ni entries, so it may be OK to just let them accumulate in the log and be dealt with later. As for spotting problems before coding the bot, hopefully people have seen the bot request and are looking at this page. Is there a way to get more input before you code the bot? Carcharoth 14:32, 15 June 2007 (UTC)
      • Good point about non-bio=yes. I'll add that. I'm not sure how to get more input, other than asking around and waiting to code the bot. I can be patient. . . sort of. :-) – Quadell (talk) (random) 18:17, 15 June 2007 (UTC)
    • Also, the category stuff could be problematic. Many of the articles lacking birth and death year categories may actually have birth and death years in the articles, and so you may swamp the two categories you are proposing to add such articles to, depending on the scale of the lack of use of birth and death date categories. Carcharoth 15:05, 15 June 2007 (UTC)
      • It seems to me that would be a good thing. If an article describes a person who may-or-may-not be alive, whose text may-or-may-not indicate whether he's alive or not, but who doesn't have a death category, I think it would be better to have Category:Possibly living people so as to help draw attention to the lapse. Or if the person is (accord to WPBio) dead, and the text may-or-may-not indicate when he died, but there are no death categories, I think it would be better to flood the Category:Year of death missing category with such cases to help get them fixed. But I would certainly feel better if I had more people's opinions on this. – Quadell (talk) (random) 18:17, 15 June 2007 (UTC)
    • May I ask whether you will be using different edit summaries for each type of operation? I ask because, even if you don't log the actions, the contributions list of the bot will log those if you use appropriate edit summaries. And I'd like to review such a contributions list in any case, and the edit summaries would really help. Carcharoth 15:05, 15 June 2007 (UTC)
      • Yes, I certainly will. I'll add that info in the spec. – Quadell (talk) (random) 18:17, 15 June 2007 (UTC)

Detailed specification

edit

Abbreviations

edit
  • lv = the "living" parameter of {{WPBiography}}
  • dc = the death-related category (Just the year for a deathyear category, "dm" for a "date missing" category, "l" for living.)
  • la = the "listas" parameter of {{WPBiography}}
  • ds = the {{DEFAULTSORT}} magic word value
  • pn = the name parameter in {{Persondata}}
  • cp = a piped sortorder for a category. (There may be multiple cp entries.)
  • ni = no information (i.e. there is no sortorder information in la, ds, pn, or cp).

Decision/action tree

edit

Make a list of all articles whose talk pages transclude {{WPBiography}}. For each article in the list, do the following.

  1. Read the article's talk page. Look for:
    1. the "non-bio" parameter of {{WPBiography}}. If it's yes, then skip this record. (No changes, no logs.)
    2. the "living" parameter of {{WPBiography}}. Store the value.
    3. the "listas" parameter of {{WPBiography}}. Store the value.
  2. Read the article. Look for:
    1. the {{DEFAULTSORT}} magic word (or template). Store the value (if any).
    2. the name parameter in {{Persondata}}. Again, store it (if it's there).
    3. any piped sortorders in categories. Store the category names and associated values.
    4. any death categories (Category:XXXX deaths, Category:Living people, Category:Year of death missing, Category:Year of death unknown, Category:Date of death missing, or Category:Date of death unknown). Store this.
  3. Compare these stored values.
    1. Compare death information.
      1. If there is not an lv or a dc, then add Category:Possibly living people. Don't log this article.
        • Edit summary: "Adding [[Category:Possibly living people]] - this person has no bot-readable death information. (Bot-generated edit)"
      2. If there's an lv and a dc, and they agree, then that's fine. Ignore this information, and don't log it.
      3. If there's an lv and a dc and they disagree, then log the conflicting information.
      4. If there's an lv but no dc, add Category:Living people or Category:Year of death missing. Don't log.
        • Edit summary: "Adding [[Category:Living people]], as indicated in {{WPBiography}} on talk page. (Bot-generated edit)". Or: "Adding [[Category:Year of death missing]], since {{WPBiography}} indicates subject is dead. (Bot-generated edit)"
      5. If there's a dc but no lv, add "living=no" or "living=yes" to the WPBio tag.
        • Edit summary: "Adding 'living=yes/no' to {{WPBiography}}, as indicated by categories. (Bot-generated edit)"
    2. Compare sortname information. Ignore cp values when they are a space (" ").
      1. If there is a ds. . .
        1. . . .and no other sortorder information contradicts it, then the info doesn't need to be logged.
          1. But if there's no la, fill it in.
            • Edit summary: "Adding 'listas=XXX' to match {{DEFAULTSORT}} magic word. (Bot-generated edit)"
        2. . . .and at least one other piece of sortorder information contradicts it, then log this and make no changes.
      2. If there is no ds, but there is an la. . .
        1. . . .and no other sortorder information contradicts it, then create a {{DEFAULTSORT}}. Don't log.
          • Edit summary: "Adding {{DEFAULTSORT:XXXX}}, as indicated in {{WPBiography}} on talk page. (Bot-generated edit)"
        2. . . .and something contradicts it, then log this and make no change.
      3. If there's no ds and no la, but there's a pn and/or dc, then log the info and make no changes.
      4. If there is no sortname information from any source, then the record should be logged as "ni".

File format for the logfile

edit

The logfile will be structured as follows:

  • First, each record will begin with "* [[article name]]", followed by further "entries" which are separated by the pipe character (|).
  • Each entry consists of an abbreviation, an equals sign, and a value. The only exception is "ni", which does not have an equal sign or a value.

Examples:

* [[Person One]]|ni
* [[Person Two]]|cp=Person Two|cp=Two, Person
* [[Person III of Belgium]]|lv=yes|dc=1999|ds=Person 03 of Belgium|pn=Person III of Belgium

Further considerations

edit
  • Consider trialing the process on the smaller list of articles with persondata first. This is also where the conflicts are most likely to arise.
  • Consider storing the sortable name on a subpage of the talk page and transcluding it into Persondata, DEFAULTSORT and listas. This subpage information could later be incorporated into metadata, if/when such ideas move forward.

Examples of processing

edit
  • Ada Lovelace. Her WPBio info says living=no, but doesn't give a listas. She has DEFAULTSORT of 'Lovelace, Ada King, Countess of', and no piped sortorders in categories. Her death category is "1852 deaths".
    • The lv and ds matches, so no change is needed and that information should not be logged. The ds is given and not contradicted anywhere, and the la is missing. The bot will fill in the "listas" parameter in WPBio, and this entry should not be logged at all, since there is no contradiction.
  • Martha Escutia. Her WPBio info says living=yes, and gives a listas of "Escutia, Martha". She has no defaultsort, but all her categories pipe as "Escutia, Martha". Her death category is "Living people".
    • The lv and ds match, so no change is needed and this information shouldn't be logged. The la and cp agree, so there is no discrepancy; however, ds is missing, so the bot would fill in the {{DEFAULTSORT}} keyword. This entry would not be logged.
  • Person One (contrived example). Imagine this article's death category is "Living people", but no other sortorder or death information is given.
    • The living parameter in WPBio would be filled in as "yes", and death info should not be logged. There are no sortorders in the article, so the bot has no information to work with: this record would be logged as:
* [[Person One]]|ni
  • Person Numero Dos (contrived example). Imagine this article has no death category, but their WPBio says "living=no". Also imagine that the only sortorder info is two category pipes, one saying "Numero Dos, Person" and the other saying "Dos, Person Numero".
* [[Person Numero Dos]]|cp=Numero Dos, Person|cp=Dos, Numero Person
  • Person III of Belgium (contrived example). Imagine this article's WPBio says "living=yes", and doesn't give a listas. Imagine that his article gives a death cat of "1999 deaths", has a DEFAULTSORT of "Person 03 of Belgium", and gives the name as "Person III of Belgium" in PersonData.
    • The WPBio and death category do not agree on whether the person is alive or not, so that should be logged. Also, the DEFAULTSORT and the name in PersonData do not agree, so this would be logged as well.
* [[Person III of Belgium]]|lv=yes|dc=1999|ds=Person 03 of Belgium|pn=Person III of Belgium

Further discussion

edit

The plans here are very impressive and wide-ranging, and I see a great attention to detail, but I wonder if they cannot be broken down a little, rather than trying to do everything at once and get it all right the first time (I never do!). I did a little bit of data collection myself just to see the scale of the problem. As you point out, there are nearly 400K articles with WPBiography templates (I got 386964). This is a lot of work to process, even for information gathering, and is a lot of data to mess about with if the bot is more active.

On the other hand, there are much fewer articles with Persondata (I found 10179). Maybe we should try and work with these to start with: they are likely to be the more important articles anyway. One of the plans involves using a bot to add WPBiography templates to these articles where necessary. Given the vastly larger number of articles with templates, it is likely that most of these articles do already have templates. In fact there are only 1046 such articles without templates (and I can provide a list if you would like it). This is a small enough number to process by hand (e.g. using AWB); such processing has the advantage that these relatively important articles could be given priority and quality ratings, not just blank templates.

After this, it seems to me that it makes sense to focus on the 10179 articles with Persondata. These would now all have WPBiography templates, and so it is useful to know if the data is consistent, and maybe also if it is consistent with DEFAULTSORT or piping in categories.

If it isn't, then one has to decide how to make it consistent, and how to store the solution. I have suggested elsewhere that data such as a sortable name should be transcluded from a subpage (see Template talk:Persondata#Persondata on a subpage) and have demonstrated this concept at Alexander Grothendieck. It is not yet clear what is the best way forward, so any data migration should be flexible about the best solution.

If we can get this right for 10179 articles, and the solution is scalable, then it is time to implement this solution for the remaining 375000. Geometry guy 21:54, 15 June 2007 (UTC)

This sounds like a good idea. Quadell, if the bot is approved and after a few trial runs, would you be happy to limit the first full-scale run to the ~10,000 persondata articles? On a side point, it is a bit annoying that it is possible to detect the transclusions of the WPBiography (386964) and Persondata (10179) templates to come up with an idea of the scale of their use, but it seem that there is no way to easily measure the use of a magic word like DEFAULTSORT. Carcharoth 22:32, 15 June 2007 (UTC)
Want to me to assess usage of DEFAULTSORT? It's actually a pretty simple routine, and not server-intensive, it'll just take a couple hours. — Madman bum and angel (talkdesk) 01:32, 25 June 2007 (UTC)
That would be great! What sort of information can you provide and how? Geometry guy 14:15, 25 June 2007 (UTC)
Seemed to me you just wanted to know the extent of its usage, i.e, how many talk pages are using that magic word. Yes? — Madman bum and angel (talkdesk) 14:27, 25 June 2007 (UTC)
This would certainly be a useful start, but I think the people here would like to know some intersections e.g., how many articles use both DEFAULTSORT and the Persondata template, and how many articles with DEFAULTSORT have the WPBiography template on the talk page? If either of these intersections are not too large, it would be nice to have a list of the articles in them. Ultimately the goal is to compare the DEFAULTSORT value with data stored elsewhere. However, any information which you can provide easily would be most welcome! Geometry guy 14:41, 25 June 2007 (UTC)
I'll assess usage of WPBiography, persondata, and DEFAULTSORT, then I can make any intersection you want. My bot will start this report shortly; I'm guessing it'll take about four-five hours. — Madman bum and angel (talkdesk) 14:49, 25 June 2007 (UTC)
Many thanks! Geometry guy 14:54, 25 June 2007 (UTC)
Haha. I thought it would. But it pulls the talk pages' contents, unlike my bot's other tasks, so it'll take a bit longer. I'm satisfied that my trial worked correctly, so I'm going to start relatively soon, and I estimate that the report should be available at June 27th, 4:45 AM (UTC). Sorry it couldn't be sooner, but my bot is very, very careful to make minimal impact on the server on long tasks like these. I don't think I'd be justified right now in changing its behaviour. — Madman bum and angel (talkdesk) 20:48, 25 June 2007 (UTC)
I'm surprised that you need the talk page contents, since "what links here" to Template:WPBiography should be enough for the basic info. Admittedly that would not give such fine detail as the listas parameter in the WPBiography template, but such information can be obtained later. Geometry guy 20:58, 25 June 2007 (UTC)
The only thing I really need to download it for is the magic word. If it was just templates, it'd be quick. — Madman bum and angel (talkdesk) 22:12, 25 June 2007 (UTC)
In that case it is not the talk page but the article that you need to download: we are interest in the use of DEFAULTSORT in the article, not the talk page, but I guess you probably realise this. Good luck, and thanks to both you and your bot! Geometry guy 22:56, 25 June 2007 (UTC)
... oop. ID-10-T error; I've restarted it and told it to gather a sample size of a quarter million articles; we can just extrapolate the results (there'll be a selection bias, but eh. Ask me if I care.)  ;) — Madman bum and angel (talkdesk) 02:55, 26 June 2007 (UTC)
Bet you didn't expect results so soon. My bot woke me up; the API started feeding it bad data and it choked and died. Gracefully, of course; as graceful as one could expect given a "can't happen" error. Bet it can!
Anyhow, it's too late for me to debug it; I'm going back to bed. So I'll just give you the results as-is. Looks like usage of the {{defaultsort:}} magic word in the sample was ~1.15% (n = 855, p = 74500, selection bias = heavy; alphabetical). If you wanted to try to extrapolate that using the current number of articles (it's bad math, but all we need is a ballpark figure), we get 21,255 articles using that magic word. So, about twice as many as use {{persondata}}. (This is all crap but it's the outlines of good crap.)
Night! — Madman bum and angel (talkdesk) 05:16, 26 June 2007 (UTC)
Wow! That's excellent crap! :-) Even the ID-10-T error wasn't too silly, as usage of DEFAULTSORT on talk pages (probably extremely low) would be mildly interesting, if not strictly relevant (many categories on talk page templates use a pipe-sort involving {{PAGENAME}}, so that doesn't help). The ballpark figure of 21,255 articles using DEFAULTSORT sounds about right, as you'd expect more than the number using persondata, but (again, as expected) the numbers using nothing are astronomical. I sometimes fear many of these 350,000 biographical articles are spurious autobiographies that somehow escaped the eagle eyes of the deletionists. Anyway, that's getting off the subject. Thanks awfully for gathering that data, and sorry you were woken up by the bot choking - did you ever find out what it choked on? Special characters? Carcharoth 16:44, 26 June 2007 (UTC)
No; my bot uses UTF-8 natively. It just looks like there was maintenance on the API or a blip or something and it just sent a serialized object that was... nonsense. It wasn't an object at all. *shrugs* — Madman bum and angel (talkdesk) 19:27, 26 June 2007 (UTC)

Thanks for your comments! I know it looks like this bot would be doing 20 different things at once, but my reasoning for this was that it's a ton of reads, and it's much better to read 400K articles once and do a lot of complicated processing, than read 400K articles two or three times. (I want to be nice to the servers.) But that's also a good reason to make sure we have the spec right before going forward. So I'm very open to rethinking how we want to do this.

You have a good point about the 1046 persondata records. I'm not really willing to do these by hand, but perhaps others are. It would certainly be better to add priority and quality and work-group info at the same time, if people are willing to do this.

Your readpersondata technique is attractive. I will be very interested in seeing whether it can be made intuitive and flexible enough to attract widespread community consensus. I'll be making some comments and suggestions regarding this.

I'm going to put the bot authorization request onhold until we get the details straightened out. – Quadell (talk) (random) 22:52, 15 June 2007 (UTC)

Thanks. The idea is still very young. I think it is a general way of storing metadata, and should probably be called a /MetaData subpage. Further development is definitely needed.
I take your point that once the plan is clear, it is more efficient to go through the 400K articles and do everything at once. This is why I suggest testing the process on the 10K articles with Persondata first. As for the 1046 articles, this might be worth flagging as an initial target for the WPBiography summer assessment drive... Geometry guy 23:12, 15 June 2007 (UTC)
1046 is quite a small list. Could you post it somewhere? Although small, it is still a distressingly large number of biographical articles that have escaped the attention of WPBiography. On the other hand, the largest batch of articles to be tagged with WPBiography came from the category 20th century deaths, and is either in the process of being done, or about to be done, so maybe those 1046 ones are already on that other list? Don't suppose it matters much either way who ends up doing the tagging first, but just so you are aware of it in case another bot appears from nowhere and tags all those 1046! Carcharoth 23:24, 15 June 2007 (UTC)
Done, at Template talk:WPBiography/Missing. The articles are just an ascii list, not wikilinked, but it would be easy to wikilink them with a search and replace, as they are separated by carriage-return linefeed. Geometry guy 00:21, 16 June 2007 (UTC)
I've just been tidying up false positives. Wikipedia and User namespaces were still in there... Did you do a straight "what links here", and fail to limit to transclusions only? Carcharoth 00:39, 16 June 2007 (UTC)
Thanks: I'm normally a bit more careful about that. I should have filtered to mainspace talk only, but it took a while to generate the list, and I'm kind of tired now! That cuts it down to 962. I can paste this in instead if you like. Geometry guy 00:43, 16 June 2007 (UTC)
Oh, I've filtered out a few more. Have a look at my list, but leave it til later in the weekedn if you like. I'm away most of the weekend anyway. Just time to paste the ones where persondata needs removing. I'll do that over there. Carcharoth 00:51, 16 June 2007 (UTC)
I made a wikilinked version of the 962 at /Missed instead of /Missing. I'd like to remake the AWB version at some point, but there is no rush. Geometry guy 00:54, 16 June 2007 (UTC)
And I've put the extraneous ones at Template talk:Persondata/Removing data. That's enough for tonight! Carcharoth 01:12, 16 June 2007 (UTC)