Wikidata:Requests for permissions/Bot/William Avery Bot 9
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 18:57, 9 August 2022 (UTC)[reply]
William Avery Bot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: William Avery (talk • contribs • logs)
Task/s: Remove tracking parameters from reference URLs, as suggested at Wikidata:Bot requests § Tracking parameters in reference URLs. I would like to run this as a recurring task, after clearing the c. 2800 current instances.
Code: cleanseRefs.py - pywikibot script. Task logic is in the treat_page method.
Function details:
Candidate items can be harvested using a database query on the externallinks database table.
I am using a list of tracking parameters from here, but the only ones I have so far detected on Wikidata items are fbclid, igshid, gclid and Urchin Tracking Module ("utm_") parameters
The URLs processed are the ones on the reference URL (P854) property of references.
Each URL is parsed using the standard python library urllib, to extract the parameters into a list. Any that are on the list of tracking parameters are removed. The URL is then reassembled using urllib.
After all the references on the item have been processed, the item is updated with the new URL values.
Testing: I have run the script on a few individual items, under my own account
- UTM - https://www.wikidata.org/w/index.php?title=Q3049104&diff=prev&oldid=1692920918
- fbclid - https://www.wikidata.org/w/index.php?title=Q63461504&diff=prev&oldid=1692924249
- igshid - https://www.wikidata.org/w/index.php?title=Q113199232&diff=prev&oldid=1694422750
- UTM - https://www.wikidata.org/w/index.php?title=Q113199296&diff=prev&oldid=1695193286
- gclid - https://www.wikidata.org/w/index.php?title=Q113167042&diff=prev&oldid=1695198613
I intend to run a bulk test on 4 August, if there are no objections. William Avery (talk) 10:53, 3 August 2022 (UTC)[reply]
Diffs from the test run are here
Observations:
- During the test run I turned off 'strict' parsing of the URL parameters, because it was causing the module to throw an error.
- The processing didn't break any links. A few were already broken.
- URLs in archive URL (P1065) are untouched, and can contain tracking parameters. The archiving site are unlikely to make use of these parameters. See here for an example.
Discussion:
I have placed a link to this request at Property talk:P854 § Removal of tracking parameters from reference URLs. William Avery (talk) 11:38, 4 August 2022 (UTC)[reply]
- Support Really useful task. Edits from the test run seem fine. — Envlh (talk) 20:52, 4 August 2022 (UTC)[reply]
- I will approve the bot in a couple of days provided no objections have been raised.--Ymblanter (talk) 19:45, 7 August 2022 (UTC)[reply]
- Support as requester. Thank you, William. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:04, 8 August 2022 (UTC)[reply]
- You might also consider removing the amp prefix (e.g. converting
https://amp.theguardian.com/uk-news/2022/aug/08/birmingham-expects-surge-of-tourism-following-success-of-commonwealth-games
tohttps://theguardian.com/uk-news/2022/aug/08/birmingham-expects-surge-of-tourism-following-success-of-commonwealth-games
). Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:20, 9 August 2022 (UTC)[reply]
- You might also consider removing the amp prefix (e.g. converting