Commons:Bots/Requests/Pi bot 1

Operator: Mike Peel (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: Deploy {{Wikidata Infobox}} to Commons categories that have commons sitelinks on Wikidata

Automatic or manually assisted: Automatic

Edit type (e.g. Continuous, daily, one time run): Primarily a one-time (but very large) run; then monthly (smaller) runs for new categories/links from Wikidata

Maximum edit rate (e.g. edits per minute): 60 edits per minute

Bot flag requested: (Y/N): Y

Programming language(s): pywkibot. source code

This bot deploys {{Wikidata infobox}} to all Commons categories, with the provisos that they are linked to Wikidata through a Commons sitelink there, and that the Wikidata item is either not a Wikimedia category or has a category's main topic (P301) link. It does not edit categories that use an alternative template, as listed in the 'templatestoavoid' variable. It looks for subcategories of a specified category, which is currently a small category for testing (running in a manually-assisted way under my main account), but would be Category:Categories for the main run (unless anyone can suggest a better way to select all categories?). Edits look like [1], or in the case of a P301 link, then [2].

The bot deployment is currently being discussed VP/P (but with near-unanimous support so far), and the run would not start until that is complete. Additionally, off-wiki conversations with @Lydia Pintscher (WMDE): about the performance impact of the template mean that it shouldn't be massively deployed until phab:T186714 and phab:T186716 are concluded (the first one is done, the second is still open).

This request is a bit early, but I would like feedback on the bot code and any potential issues, ahead of the closure of the VP/P discussion and the resolution of the phab tickets. Thanks. Mike Peel (talk) 23:45, 3 March 2018 (UTC)[reply]

Discussion

  • Most of the code is okay in logic, though the code styling could be improved (flake8 will complain), and some nitpicks:
    • line 28: templatestoremove is unused. You mean to use it at line 79?
    • line 35: Filter by namespace, not the namespace prefix in title. For more efficient filtering you can use the subcategories() method
    • line 45: continue? avoid nesting in the else:
    • line 49-52: if any(...): continue, avoids use of test flag
    • line 66-70: ditto
    • line 72: P301 is still not defined. You don't want to continue executing as it it is defined, do you?
    • line 98: is this ment to be outside the if-block above?
    • line 99: exception catching too generic. You'd be unable to interrupt it with Ctrl-C (SIGINT).
    • line 104: just break. exit() should not be used in programs
  • Also, I'm asking @Gabrielchihonglee: to work on the interwiki removal (forking my bot's code, should be done in a few weeks due to timezone differences) so that shouldn't be needed. --Zhuyifei1999 (talk) 04:22, 4 March 2018 (UTC)[reply]
@Zhuyifei1999: Thanks for the feedback! I've incorporated all but one of your points (updated on bitbucket); the one I haven't is about P301, since I think this was confusion with P31 that's also used. It's good to know about the interwiki removal, I've dropped that from my code (I'm still not sure why those function calls didn't seem to do anything though). Presumably you've seen Commons:Village_pump/Proposals#Proposal_to_migrate_interwiki_links_to_Wikidata_(wherever_possible)? Thanks. Mike Peel (talk) 12:52, 4 March 2018 (UTC)[reply]
Yes I am aware of that discussion, but I have not read through it. I personally am not very confident on a bot that adds links to wikidata; wikidata bot operators would be better than me to program a logic that determines whether a link is safe to add. Like my own bot task, I consider only removing the redundant links, where they exist as both explicit wikitext form and on wikidata. Unconditionally removing them / migrating them, will end up terribly.
Yeah that P301 may be my error. I don't understand that part of logic too well. What I right now understand is: if P301 is not found, continue only if P31 is not found or P31 exists bit is it not Q4167836. Is my understanding correct and the intended behavior?
Also, line 35 (if 'Category:' in target.title():) is still redundant. all subcategories will be... categories. --Zhuyifei1999 (talk) 18:35, 4 March 2018 (UTC)[reply]
@Zhuyifei1999: Is there an easy way to find categories here with interwiki links? If so, I might have a look into adding commons sitelinks on Wikidata based on them (plus other logic such as title matches), and then they could be removed as redundant. On the properties: the idea is that if P31 is Q4167836 (a category item), then we only want to add the template if there is then a value for P301 (a main topic for that category), that's probably coded a bit backward at the moment. Line 35 will be gone in the next version. :-) Thanks. Mike Peel (talk) 22:01, 4 March 2018 (UTC)[reply]
My method is parsing xml dumps. Searching insource:/\[\[en:/ seems to also work, but the regex will definitely time out and you won't ever get a complete set of results. --Zhuyifei1999 (talk) 23:01, 4 March 2018 (UTC)[reply]
Where should the wikidata template be placed? Sometimes the bot's edits result in a blank line. --Schlurcher (talk) 06:02, 20 March 2018 (UTC)[reply]
@Schlurcher: I'm leaning towards always placing it after the last }} on the page, to avoid any conflicts with other templates that may add a lot of whitespace (due to issues in the other templates, not in this one). Can you point me to some examples of where the bot is adding blank lines? It shouldn't do that, so perhaps it's because those blank lines were there already? Thanks. Mike Peel (talk) 14:45, 26 March 2018 (UTC)[reply]
Here is an example: [6] --Schlurcher (talk) 16:09, 26 March 2018 (UTC)[reply]
@Schlurcher: Aah, thanks. Spotted a stray "\n", and I'm also now removing trailing "\n"s from templates that are removed at the same time. example edit, and the new code's on bitbucket. BTW, @Zhuyifei1999: I've also reworked the category-walking code to avoid infinitely looped categories. Thanks. Mike Peel (talk) 16:37, 26 March 2018 (UTC)[reply]
  • Maybe Template talk:Wikidata Infobox is a better place for it, but i try here. I just placed for tests {{Wikidata Infobox}} on the category I working on - Category:Raspberry Pi. It doesn't work, trying to find item name "Category:Raspberry Pi" at Wikidata. It works on gallery page Raspberry Pi but... the result is some underwhelming in this case. --Jasc PL (talk) 23:11, 25 March 2018 (UTC)[reply]
    @Jasc PL: The infobox relies on the interwiki link between Commons and Wikidata, if that's not present then it will ask you to create it, as was happening at Category:Raspberry Pi. Since there was also a gallery, which occupies the sitelink on the topic item, I created a new Wikidata item that provides the link, and the info should show up now in the category. Note that the bot will not add the infobox to pages without that interwiki link, so it shouldn't be an issue with the large-scale deployment. With the 'underwhelming' result, is there anything else you'd like to be added to the infobox? Note that it's obviously limited to the information that's available in the Wikidata entry. Thanks. Mike Peel (talk) 14:49, 26 March 2018 (UTC)[reply]
  • I feel that most of the concerns have been addressed and the requestor is generally very responsive to feedback. Still due to the potential very high amount of edits, I would like to see an unsupervised test run of around 500 edits performed with the bot. I understand that this is more than we normally require, but some specifics may only be identified in rare cases. --Schlurcher (talk) 05:36, 30 March 2018 (UTC)[reply]
    • @Schlurcher: Sure, I'm happy to do that. Want to suggest a category, or should I start at Category:CommonsRoot? I can either do the run later today, or on Sunday. Note that I've also been testing the code in various categories with manual approval for each edit using my account, and it seems to be working quite well. Thanks. Mike Peel (talk) 15:43, 30 March 2018 (UTC)[reply]
Could you use Special:Random/Category, so we get a large coverage? Please let the bot run unsupervised for around 500 edits. Please update us once completed so we can have a final look. --Schlurcher (talk) 21:59, 30 March 2018 (UTC)[reply]
@Schlurcher: OK, I've coded up the random selection, updated code on the git repository, and it's currently running. I did 10 edits first to double-check, it's now doing the other 490. Thanks. Mike Peel (talk) 22:51, 30 March 2018 (UTC)[reply]
@Schlurcher: All done, see [7]. There seems to be a conflict with {{Building address}} that's due to a bug in that template, reported here, otherwise it seems to have worked well. I haven't made any manual edits with these 500 today, but I'll check through them in more detail on Sunday. Thanks. Mike Peel (talk) 00:01, 31 March 2018 (UTC)[reply]
One was reverted, discussion at User_talk:Josve05a#Wikidata_Infobox. Thanks. Mike Peel (talk) 09:53, 31 March 2018 (UTC)[reply]
It's looking like it is best to exclude Taxon-related categories from this deployment, and I'm also tempted to avoid pages using {{Building address}} until that bug is fixed. Both are simple to do by using the templatestoavoid array, and we can always change that in the monthly runs if needed. Thoughts welcome. Thanks. Mike Peel (talk) 23:40, 31 March 2018 (UTC)[reply]
Thanks for doing this. I have no furhter comments. I think these look good. --Schlurcher (talk) 22:17, 1 April 2018 (UTC)[reply]
If this gets approved (and I'm hoping it's close to that), then there are several extra things to consider. First is the server load, and I'm in contact with the Wikidata team about this, who will be talking to the database admins on 4th April, so the full bot deployment won't start until after that happens and they give a go-ahead. Then, there's the potential number of uses; there are statistics for the number of commons sitelinks at [8], which imply that there are ~1.1 million categories that this could be added to, with another ~0.6 million if [9] is approved, and that shouldn't happen all at once. So perhaps this should be capped at, say, 10k additions per day for a week, and then we can see how things are going - suggestions for that number per day are welcome. Thanks. Mike Peel (talk) 00:23, 1 April 2018 (UTC)[reply]
Pywikibot has a default throttle limit of 6 edits per min. That's 8640 per day, pretty close to your 10k, and with 197 days of running it should finish 1.7 million edits (and 10k per day would be 170 days). I don't think there's a great urgency on this task so 170-197 days looks fine to me, and leaves plenty of room for potential bugfixes. --Zhuyifei1999 (talk) 00:54, 1 April 2018 (UTC)[reply]
I currently have that throttle set to 60 per minute, but I can reset it. There are currently around a dozen editors manually adding the infobox to quite a few categories, and I would like to minimise the time they spend adding it to categories where the bot could have added it, so they can instead focus on adding new sitelinks. So I'd like to see an escalating cap that would ideally finish the initial deployment at the end of April, if possible. Thanks. Mike Peel (talk) 01:11, 1 April 2018 (UTC)[reply]
Pywikibot respects the API etiquette and thus automatically throttles in case of high server load. Thus, a higher edit rate should not be a problem server wise. The edit rate limit is also to allow the bot maintainer to clean up after the bot if needed. Even if the here stated 60 edits per minute are approved (where I have no objections to), my suggestion to the bot owner would be to start for the first days with like 6 per minute, continue to monitor and then increase. Also, please consider to reduce the rate for the subsequent monthly runs. --Schlurcher (talk) 22:17, 1 April 2018 (UTC)[reply]
Thanks. To clarify, the server load question is about the running of the template on so many categories, not so much the edit rate of the bot. Thanks. Mike Peel (talk) 22:47, 1 April 2018 (UTC)[reply]
I see. My comments were sololy regarding the edit rate of the bot. --Schlurcher (talk) 23:06, 1 April 2018 (UTC)[reply]
OK, Amir just emailed me this after discussing the issue with the DBAs: "I just deployed a change in commonswiki that makes the logging table grow at one tenth of the speed it used to be, also I'm planning to delete 72% of that table so storage-wise it won't grow for quite some time. The DBAs told me two things in this case: It needs to be under control and slow as possible. If the table grow twice in one month, we'll have a problem Secondly, any change might end up having effects in unexpected places. In this case, watchlist and recentchanges might get slow. Please inform the community on this change and come back to me as soon as you encounter any problems anywhere so I can investigate and check at least if it's related or not." So I think we can go ahead with this, but slowly, and we keep an eye out for problems.
How does that sound? Can this be approved, with slow roll-out to start with, and we can monitor things as they go? Thanks. Mike Peel (talk) 19:36, 5 April 2018 (UTC)[reply]
@Mike Peel, first of all - great thanks for your whole hard work, kindly feedback and help. I was unable to read all and participate in the discuss at VP/P, but I try to put some more factual comments at Template talk:Wikidata Infobox tomorrow. Only some technical notices (only my point of view, of course) now:
  • Vertical space between image and logo - now both are often glued together, always if both consumes all their space; ~10px Vspace by default should fix it
  • The logo is often disproportionately big, especially when is up-close to quadrat; examples, some that was by hand: Raspberry Pi, Mozilla Firefox, Milanówek, FETA (festiwal)
About Raspberry Pi example: I'v placed this logo in its WD item - is suitable for most applications, but not for infoboxes - without additional scaling down. We have the complete set including this logo, but how to play with both of them in one time - regarding of purpose?
  • In my opinion, the logo, emblem, symbol, flag - should be placed at the top of infobox
However, we have still serious and comprehensive discuss in some community places, but... what for, if in the meantime one of users running a bot/script placing {{Wikidata infobox}} and removing other category pages content? I'v noticed that and asked this user at 19 March 2018 - his arguments are very peculiar for me. One topic below - the same problem. --Jasc PL (talk) 21:44, 12 April 2018 (UTC)[reply]
@Jasc PL: on the first two, try {{Wikidata Infobox/sandbox}} and see how that looks - the padding should be there, and the logo should be smaller. I'll push that into the main version soon. On the positioning, it's probably best to discuss that on the template talk page first. With the bot run, I'm only adding functionality, not removing any content - whether or not to remove that content is something that should probably be decided case-by-case. Thanks. Mike Peel (talk) 22:24, 12 April 2018 (UTC)[reply]
Thanks @Mike Peel! Now, both graphics elements looks OK. One problem I discovered yet is the interference between infobox and often used {{Categorise}} template. --Jasc PL (talk) 01:12, 13 April 2018 (UTC)[reply]
@Jasc PL: The problem is with {{Categorise}}, which has "width:100%; clear: both;" in the table style. That means that it wants 100% of the width of the page, and that it should appear below any other box. So it insists on being below {{Wikidata Infobox}}, which creates a huge amount of whitespace. There are a number of other templates that do the same unfortunately. There's nothing I can do in the infobox to fix this - it requires fixing all of the other templates. The work-around is simply to place the infobox below the other templates, which this bot does by looking for the last }} and adding the infobox below that. Thanks. Mike Peel (talk) 09:07, 13 April 2018 (UTC)[reply]
@Mike Peel you are extremely fast and effective :). Of course, that was not the infobox problem but excellent that you have a good solution avoiding problems with such templates; placing them in separate DIV container with fixed width will work also? --Jasc PL (talk) 15:56, 13 April 2018 (UTC)[reply]
Let me ask one more question @Mike Peel - is there any version/variant of {{Wikidata Infobox}} concerning computing; OS'es, hardware, software etc? --Jasc PL (talk) 19:57, 13 April 2018 (UTC)[reply]
@Jasc PL: Putting the categorise template into a div container would work, I guess. The infobox should work for all topics, let me know if there are any additional Wikidata properties that you want adding to it for computing topics. I’m travelling this weekend, normal service will resume on Monday. ;-) Thanks. Mike Peel (talk) 22:01, 13 April 2018 (UTC)[reply]
Dear Mike Peel, I just added a new topic Template_talk:Wikidata_Infobox#Computing_categories, but now it's work for us, not for you - have a great weekend, without Wikimedia and all professional problems! :) --Jasc PL (talk) 22:51, 13 April 2018 (UTC)[reply]
@Jasc PL: I've replied there, the new properties are now in the sandbox. My weekend offline was partly a photo-expedition, the results of which will be uploaded here soon, so not really a weekend without Wikimedia. ;-) Thanks. Mike Peel (talk) 23:53, 15 April 2018 (UTC)[reply]
Anything else, or can we move on to starting this running? Thanks. Mike Peel (talk) 22:38, 16 April 2018 (UTC)[reply]
Personally, I would say this can be approved in the current state. This is with the understanding that the bot owner is monitoring the downstream effects on Wikidata and will address issues with other templates that might come up in an extended run. Given the responsiveness here, I have no doubt that this will happen. --Schlurcher (talk) 16:12, 17 April 2018 (UTC)[reply]
In my opinion, it could be placed automatically by default, when the new category is manually created (excluding creation by bots). --Jasc PL (talk) 01:17, 18 April 2018 (UTC)[reply]
There was a discussion today at Commons:Village_pump#Wikidata_Infoboxes_disrupt_sorting that was mostly about edits that were made manually, but they've led to some improvements in the DEFAULTSORT code in the infobox, along with some extra tracking categories. I've also written a new script that looks through Category:Pages with DEFAULTSORT conflicts for uses of the infobox, and resolves them by disabling the defaultsort code in the infobox (adding it to Category:Uses of Wikidata Infobox with defaultsort suppressed for later manual/bot debugging). Ideally that'll run once a day. Can that new script also be included in this bot request, or should I start another one for it? Thanks. Mike Peel (talk) 22:42, 18 April 2018 (UTC)[reply]

I'm going to call this approved, please give it a slow start just in case issues arise. --Krd 12:37, 20 April 2018 (UTC)[reply]