Wikipedia:Bots/Requests for approval/KasparBot 3
The tool to help migrate Persondata is live at toollabs:kasparbot/persondata. |
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: T.seppelt (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 18:11, Wednesday, November 4, 2015 (UTC)
Automatic, Supervised, or Manual: automatic
Programming language(s): Java, own framework
Source code available: not yet
Function overview: removing {{Persondata}} in all articles, copying the information to a certain database which will be accessible on Tool Labs. See toollabs:kasparbot/persondata.
Links to relevant discussions (where appropriate): first RfC, second RfC, Bot request
Edit period(s): one time run
Estimated number of pages affected: 1.2 million
Exclusion compliant (Yes/No): no
Already has a bot flag (Yes/No): Yes
Function details:
- request all pages with {{Persondata}}
- fetch the parameters
- copy them into a database (Wikidata won't be affected)
- remove
{{Persondata|...}}
-- T.seppelt (talk) 18:11, 4 November 2015 (UTC)[reply]
Contents
Discussion
edit@Pigsonthewing, Magioladitis, Izno, GoingBatty, Hawkeye7, and Dirtlawyer1: -- T.seppelt (talk) 18:11, 4 November 2015 (UTC)[reply]
- @T.seppelt: What will be the purpose of the "database" to which Persondata information will be copied as the Persondata templates are being deleted from all Wikipedia articles? Without a plan to review it, parse it, and transfer usable information to Wikidata, I'm not sure that creating a massive database with approximately 1.2 million Persondata profiles serves much of a function. Simply transferring potentially usable information to a database where most English language editors do not have practical access to it, hoping that someone with proper skills, time and motivation will actually do something with it in the future is, I'm afraid . . . well, wishful thinking. Dirtlawyer1 (talk) 18:30, 4 November 2015 (UTC)[reply]
- The purpose of this database is to allow users to add the information to Wikidata. The information will be parsed and a simple interface will be provided which allows users to decide one-by-one if a certain piece of information is suitable for Wikidata. Imagine it in this way: You will have 3 buttons (Next/Skip, Yes, No) and the rest of the work is done by software. It will be easier than comparing articles and items manually and adding complex statements. -- T.seppelt (talk) 18:39, 4 November 2015 (UTC)[reply]
- T.seppelt: I endorse your approach whole-heartedly. I look forward to working with your newly created database and toolset. Thank you for investing your skills, time and effort to do this. Dirtlawyer1 (talk) 18:49, 4 November 2015 (UTC)[reply]
- (edit conflict) Yes, thank you for your time. I am not sure that I would work with the proposed tool, as follows. -P64
- Does Wikidata really want the two-part reference added as part of every statement whose source is English Wikipedia, "imported from: English Wikipedia; retrieved: 4 November 2015" (not to mention same with UTC timestamp)? Then some improved interface must be a must, if you know what I mean. For me it might be sufficient to have a feature at Wikidata that amounts to "repeat the last reference". For English Wikipedia and the few "stated in" sources such as LCNAF that I commonly use there, I would be willing to provide the two-part reference once and then use a convenient "repeat the last reference".
- I doubt that I would dedicate time to the transfer of leftover Persondata to Wikidata, only continue to do some of that where something else takes me to the Wikipedia article. So I doubt (but may be wrong) that the extra efficiency of a dedicated PD to WD interface would matter to me. I have no clue how many editors might focus on leftover PD, where such extra efficiency should be very welcome. Maybe I would help with that myself, merely I doubt it.
- I haven't read Wikidata instruction concerning alternative names, perhaps what I most frequently carefully added to our PD templates myself. I don't use any of the WD statements that pertain to such content, only the "Also known as" header field. I don't know how much good alternative name data there is in our PD templates nor whether the proposed interface would actually be efficient for adding WD statements such as birthname, pseudonym, etc. --P64 (talk) 19:40, 4 November 2015 (UTC)[reply]
- This task is, for the most part, probably going to be "sort out alternative names". I'm not aware of whether "alternative name" really fits into anything but the alias fields (what you refer to as "also known as"). --Izno (talk) 22:47, 4 November 2015 (UTC)[reply]
- T.seppelt: I endorse your approach whole-heartedly. I look forward to working with your newly created database and toolset. Thank you for investing your skills, time and effort to do this. Dirtlawyer1 (talk) 18:49, 4 November 2015 (UTC)[reply]
- The purpose of this database is to allow users to add the information to Wikidata. The information will be parsed and a simple interface will be provided which allows users to decide one-by-one if a certain piece of information is suitable for Wikidata. Imagine it in this way: You will have 3 buttons (Next/Skip, Yes, No) and the rest of the work is done by software. It will be easier than comparing articles and items manually and adding complex statements. -- T.seppelt (talk) 18:39, 4 November 2015 (UTC)[reply]
- Even though the ping didn't reach me, I--of course--advocated this idea (following Alakzi's comment in the second RFC).
What I'd like this bot (or another) to do also would be to simply remove data that is already present on Wikidata, thus never pulling that into the database. We might enlist @GoingBatty: (because the ping didn't hit me I assume it didn't hit him) to do a bot run first removing all the uninteresting data and then start a bot doing this work T.seppelt. --Izno (talk) 22:47, 4 November 2015 (UTC)[reply]
- @Izno: I'm happy to help clear a path for T.seppelt if that would be beneficial. The hard part has been getting consensus on the definition of "uninteresting data". GoingBatty (talk) 03:31, 5 November 2015 (UTC)[reply]
- I think the start that you and Dirtlawyer got on in one or the other RFCs was probably the right direction, taking into account the seeming issue of the calendars (which I'm still fairly certain isn't resolved, though I haven't been watching those tasks). @Jc3s5h: since he cares. --Izno (talk) 04:03, 5 November 2015 (UTC)[reply]
- @Izno: I'm happy to help clear a path for T.seppelt if that would be beneficial. The hard part has been getting consensus on the definition of "uninteresting data". GoingBatty (talk) 03:31, 5 November 2015 (UTC)[reply]
- Interestingly, I wasn't pingged or otherwise made aware of that second RFC. Nonetheless, there's nothing stopping the first portion of this from being enacted; you can use database dumps or the API to scrape the information into whatever form you desire. --slakr\ talk / 23:57, 4 November 2015 (UTC)[reply]
- Agreed—it would be good to get a sense of how the data will be presented, and whether the people who are interested in working with it find the interface useful. — Earwig talk 00:07, 5 November 2015 (UTC)[reply]
- Support. User-friendly interface certainly makes any plan to export Persondata to an external database more appealing. (And more appealing than having to rely on past revisions). Of course, I would like to see a demonstration of the interface functionality and migration accuracy etc before any mass data migration. If it can be successfully implemented then I feel this is a good compromise. Question: Previous discussions have highlighted the confusion between Gregorian and Julian dates (I still don't understand, and most people are probably completely unaware). How would this be dealt with?
(On a side note, I am also a little disturbed that the "second" RfC was not flagged on Persondata talk pages, and that I'm only hearing about it now. But so long as this plan goes ahead, there's probably no point contesting it.) —Msmarmalade (talk) 03:08, 5 November 2015 (UTC)[reply]- I asked in the RFC whether T.seppelt might extend his authority control tool, which I got an affirmative for then. Basically, the tool in question helps us resolve authority control issues and mismatches; this application seemed similar enough to me given that we were fairly certain we would need to deal with human checking. --Izno (talk) 04:03, 5 November 2015 (UTC)[reply]
- I use the standard Wikidata date parsing service which is accessible through the Wikidata API (wbparsevalue) for parsing dates in order to get the same results as an user would get. -- T.seppelt (talk) 11:50, 9 November 2015 (UTC)[reply]
- I asked in the RFC whether T.seppelt might extend his authority control tool, which I got an affirmative for then. Basically, the tool in question helps us resolve authority control issues and mismatches; this application seemed similar enough to me given that we were fairly certain we would need to deal with human checking. --Izno (talk) 04:03, 5 November 2015 (UTC)[reply]
- Support - As Hawkeye and others have pointed out, I strongly suggest that we get behind this plan and turn the bot loose. Every day, Persondata information is being manually deleted by users who have no clue regarding the present status of this discussion, the specifics of any of the recent RfCs, or the efforts to parse and transfer remaining usable information from Persondata to Wikidata. If we're going to do this, we need to do it before any more potentially usable Persondata information is lost to manual deletion without review or transfer to Wikidata. If this is the plan, let's do it. Dirtlawyer1 (talk) 02:22, 7 November 2015 (UTC)[reply]
- I started to fetch the data. The program is still running but the results are available live under [1]. Nothing gets lost from now -- T.seppelt (talk) 11:50, 9 November 2015 (UTC)[reply]
- Looks good so far. — Earwig talk 07:49, 11 November 2015 (UTC)[reply]
- I started to fetch the data. The program is still running but the results are available live under [1]. Nothing gets lost from now -- T.seppelt (talk) 11:50, 9 November 2015 (UTC)[reply]
- Support - To be absolutely honest when Persondata was deprecated a bot should've then be done much much sooner .... I and others have removed alot of Persondata from articles assuming it was already at Wikidata, One wonders what the actual point of WikiData is but that's for another day, Bot's needed ASAP so I whole-heartedly support. –Davey2010Talk 13:52, 12 November 2015 (UTC)[reply]
- Question: At what rate will the bot be editing? Kharkiv07 (T) 17:06, 12 November 2015 (UTC)[reply]
- As preferred: I can make it doing up to thirty edits per minute or I limit the edit rate. Maybe 10 edits per minute are a good amount? Regards, -- T.seppelt (talk) 19:43, 12 November 2015 (UTC)[reply]
- Some numbers here. Given 1.2 million transclusions, 10 edits/minute would take about 84 days to clear all persondata, while 30 edits/minute would take 28 days. Policy recommends not going faster than six per minute. In practice we often let go bots faster than that, but within reason; 30 seems a bit fast for me. I suggest 10 per minute; i.e., a six-second sleep between edits. Make sure your library respects maxlag. — Earwig talk 03:36, 13 November 2015 (UTC)[reply]
- I am implementing the things as you proposed. Regards, --T.seppelt (talk) 07:34, 13 November 2015 (UTC)[reply]
- Some numbers here. Given 1.2 million transclusions, 10 edits/minute would take about 84 days to clear all persondata, while 30 edits/minute would take 28 days. Policy recommends not going faster than six per minute. In practice we often let go bots faster than that, but within reason; 30 seems a bit fast for me. I suggest 10 per minute; i.e., a six-second sleep between edits. Make sure your library respects maxlag. — Earwig talk 03:36, 13 November 2015 (UTC)[reply]
- As preferred: I can make it doing up to thirty edits per minute or I limit the edit rate. Maybe 10 edits per minute are a good amount? Regards, -- T.seppelt (talk) 19:43, 12 November 2015 (UTC)[reply]
- Support - at this point the template is just causing confusion and not serving any useful purpose. I think its time to clean it up. Kaldari (talk) 06:55, 14 November 2015 (UTC)[reply]
- Oppose There is no need to combine two unrelated proposals and create controversy where none existed before. The two proposals here are -
- Copy this data into a database elsewhere (totally uncontroversial - let the bot do this)
- Delete the data from here before the other database is established (why even do this?)
- This proposal is premised on a database created in the future replacing the need of the current system. Why delete the current system before there is broad community approval of this other database, which does not even exist yet? Feel free to make the other database. After that database is appreciated, then make a second proposal to use it to replace the current system.
- I fail to recognize any reason why these steps ought to be combined into a single proposal. What am I missing? Blue Rasberry (talk) 18:02, 15 November 2015 (UTC)[reply]
- Oppose and agree with Bluerasberry's reasoning. Why would we make someone go to two places to look for that information? — Sctechlaw (talk) 01:00, 17 November 2015 (UTC)[reply]
- @Sctechlaw: What do you mean by "two places"? — Earwig talk 01:30, 17 November 2015 (UTC)[reply]
- The Earwig The two places being discussed are English Wikipedia and a project on Tool Labs. Blue Rasberry (talk) 14:38, 17 November 2015 (UTC)[reply]
- The final home of these data is Wikidata, regardless. Intermediary but pursuant to the two RFCs (one to remove Persondata and the second to remove it by bot) would be a "holding pen" of sorts (hosted on Labs) for people to more easily assign the data to Wikidata items (through a person-centric UI). --Izno (talk) 14:43, 17 November 2015 (UTC)[reply]
- Right, that was my understanding. I asked because I'm not clear how we are making people go to two places to look for information. As far as this task is concerned, if we keep {{persondata}} around for too long after the Labs database has been created, we're just encouraging them to go out of sync as people deal with both. I think it makes sense to start running the bot as soon as we are satisfied with the way the tool is structured. — Earwig talk 01:21, 18 November 2015 (UTC)[reply]
- I would like to emphasize that the only alternative to T.seppelt's creation of a Persondata database and interface for review and transfer to Wikidata is the outright deletion of all existing Persondata with no further transfer of usable Persondata information to Wikidata. And I would also like to take note that T.seppelt's analysis below demonstrates that the statements made during the two previous RfCs -- that there remained no usable Persondata information that could be practically transferred to Wikidata -- were uninformed at best and outright misrepresentations at worst. Again, I commend T.seppelt for undertaking this project, and I urge editors who are opposed to this proposal to familiarize themselves with the two previous RfCs related to the removal of Persondata (May 2015 and September-October 2015) and prior bot request (June 2015). This is now the only game in town to preserve and transfer usable Persondata. Dirtlawyer1 (talk) 01:40, 18 November 2015 (UTC)[reply]
- Right, that was my understanding. I asked because I'm not clear how we are making people go to two places to look for information. As far as this task is concerned, if we keep {{persondata}} around for too long after the Labs database has been created, we're just encouraging them to go out of sync as people deal with both. I think it makes sense to start running the bot as soon as we are satisfied with the way the tool is structured. — Earwig talk 01:21, 18 November 2015 (UTC)[reply]
- The final home of these data is Wikidata, regardless. Intermediary but pursuant to the two RFCs (one to remove Persondata and the second to remove it by bot) would be a "holding pen" of sorts (hosted on Labs) for people to more easily assign the data to Wikidata items (through a person-centric UI). --Izno (talk) 14:43, 17 November 2015 (UTC)[reply]
- The Earwig The two places being discussed are English Wikipedia and a project on Tool Labs. Blue Rasberry (talk) 14:38, 17 November 2015 (UTC)[reply]
- @Sctechlaw: What do you mean by "two places"? — Earwig talk 01:30, 17 November 2015 (UTC)[reply]
- Oppose and agree with Bluerasberry's reasoning. Why would we make someone go to two places to look for that information? — Sctechlaw (talk) 01:00, 17 November 2015 (UTC)[reply]
Update I am considering different ways of making the data accessible for user assisted import after the removal at the moment. In order to assess the options I did an analysis of the persondata which is accessible at the moment on enwiki. These are the results:
Persondata field | Wikidata | New ready | New unparsable | Conflict | Conflict unparsable |
---|---|---|---|---|---|
DATE OF BIRTH | P569 | 51269 | 4695 | 88790 | 5093 |
PLACE OF BIRTH | P19 | 310575 | 32086 | 44907 | 27230 |
DATE OF DEATH | P570 | 26379 | 2335 | 67835 | 2724 |
PLACE OF DEATH | P20 | 90996 | 10654 | 14996 | 10737 |
ALTERNATIVE NAMES | alias | 101976 | n/a | ||
SHORT DESCRIPTION | description | 21417 | 135961 | ||
NAME | label (in future alias) | 54 | 244569 |
As you can see so far we have 479,219 statements which could be directly imported. The best option for this data is to me to give it to the Primary Sources Tool. For the conflicting statements, the unparsable data, the aliases, the descriptions and the labels I will provide a software solution. But please consider that we should start this removal process as soon as possible due to the long time it will take. There will be 1,295,269 user actions necessary to complete the import. The 479,219 statements can be accessible in the next days. Even to check these statements will take weeks, in the meantime the next parts of the dataset will be available through the proposed tool. Warm regards, -- T.seppelt (talk) 20:19, 15 November 2015 (UTC)[reply]
Great news You can use the Primary sources tool now to add place of birth and place of death statements to Wikidata. Tpt just uploaded the dataset ([2]). I am working on the tool for descriptions now. It will be available in the next hours or tomorrow. Warm regards, -- T.seppelt (talk) 19:41, 20 November 2015 (UTC)[reply]
Tool is launched in beta I worked on the proposed tool and it is ready for public testing now. You can find it here. Please check your contributions to Wikidata in order to find bugs. Let me know, if something goes wrong. Please come up with ideas for improvement. Warm regards, -- T.seppelt (talk) 15:25, 21 November 2015 (UTC)[reply]
Convenience break no. 1
edit@T.seppelt: Hi, TS. I got a chance to test-drive your new tool today (December 2, 2015) for the first time, following our Thanksgiving holiday week here in the States. It's impressive, and appears to be exactly what you proposed above. I do have a few follow-up questions . . . .
- When I tried it out today, there were only a few hundred conflicting datapoints to be chosen -- are these the only conflicting datapoints remaining, or does this just represent the first batch of conflicting datapoints to be reviewed and selected?
- The tool contains at the moment only a fraction of the datapoints for testing. I am going to upload the rest of it soon. -- T.seppelt (talk) 22:04, 7 December 2015 (UTC)[reply]
- How many non-conflicting Persondata datapoints have been imported directly into Wikidata so far? If that process still has not begun, what is the timeline for it?
- Nothing is going to be directly imported due to the decision of the Wikidata community to not accept automatically imported Persondata data. The tool also contains datapoints which aren't conflicting. Those can be imported manually. -- T.seppelt (talk) 22:04, 7 December 2015 (UTC)[reply]
- How many conflicting Persondata/Wikidata datapoints have been manually reviewed thus far using your new tool?
- 478 datapoints have been reviewed so far. -- T.seppelt (talk) 22:04, 7 December 2015 (UTC)[reply]
- In using your tool to review 100 or so conflicting brief descriptions, I noted that close to half were not so much conflicting as complementary -- e.g., American politician, Founding Father, signer of the Declaration of Independence. Is there any way we may add a Persondata brief description without replacing the existing Wikidata description?
- I have a function in mind which allows the user to edit the suggested description and add it on the spot. I am going to implement it as soon as I find time. -- T.seppelt (talk) 22:04, 7 December 2015 (UTC)[reply]
- Is there any way we can choose to review the conflicting datapoints for particular Wikipedia categories (e.g., Olympic swimmers of Germany)?
- I am storing only the Wikidata entry numbers and the Wikipedia article names. Fetching the categories would be quite advanced, but I am thinking about it -- T.seppelt (talk) 22:04, 7 December 2015 (UTC)[reply]
- What's your plan going forward from here?
Once again, thank you for devoting your time and skills to this endeavor. Dirtlawyer1 (talk) 05:24, 3 December 2015 (UTC)[reply]
- The plan is to import all remaining datapoints to the tool and start with the removal. Everything is ready for it. -- T.seppelt (talk) 22:04, 7 December 2015 (UTC)[reply]
@Dirtlawyer1: Done I imported all remaining datapoints. They are now available in the tool. As you can see there is a long way to go. Anyways, we can start with the removal now. -- T.seppelt (talk) 10:06, 8 December 2015 (UTC)[reply]
- I have been playing around with this for a bit, and here are some thoughts:
- Agree with Dirtlawyer that being able to edit the descriptions is essential.
- Working I will implement this soon. -- T.seppelt (talk) 21:30, 10 December 2015 (UTC)[reply]
- There's no way to link to specific challenges, so I am using a screenshot instead. It's identifying a conflict where none exists. The persondata in Jo Marie Payton is
"Albany, Georgia, U.S."
, so I'm not sure why it is misidentifying that as Q137573.- Whilst parsing , was interpreted as separator if the value didn't contain [[ or ]]. I think this is the best option for most cases. In general articles about pages don't contain ,, so , should be treated as separator between subdivisions (New York City, New York, USA etc.). -- T.seppelt (talk) 21:30, 10 December 2015 (UTC)[reply]
- A direct reference to the guidelines for descriptions and aliases when I am using the tool would be helpful, to clarify the meaning of
"You are kindly ask to decide which one is better."
Better how? Is "Fooian sausage-maker" better than "sausage-maker from Foo"?- Working I will implement this soon. -- T.seppelt (talk) 21:30, 10 December 2015 (UTC)[reply]
- When dealing with descriptions,
"No, take this!"
would be clearer as"Keep this"
or something clarifying that Wikidata won't be edited and this is essentially the "default" action. (Would remove the exclamation marks too, but that's just my opinion.)- Done I changed it. -- T.seppelt (talk) 21:30, 10 December 2015 (UTC)[reply]
- When one article has multiple conflicts, it would be good to deal with all of those at once. The checking necessary for two properties can be very similar (e.g. date and place of birth), so this would save time.
- Working I think about it. This could also cover permanent links to certain conflicts. -- T.seppelt (talk) 21:30, 10 December 2015 (UTC)[reply]
- There's no way we're gonna make a noticeable dent in this any time soon.
- When you do run the bot for removal, can it handle changes in persondata since this initial run?
- I don't plan to do it. I don't think that huge amounts of information were added to articles as templates after the last run. -- T.seppelt (talk) 21:30, 10 December 2015 (UTC)[reply]
- Agree with Dirtlawyer that being able to edit the descriptions is essential.
- — Earwig talk 23:17, 8 December 2015 (UTC)[reply]
- I am going to apply the suggested improvements. Please keep on testing the tool. Warm regards, -- T.seppelt (talk) 21:30, 10 December 2015 (UTC)[reply]
- Another idea: giving a view of the Wikipedia article below the challenge so I don't have to keep opening it. Not sold on it, but worth considering, I think. — Earwig talk 07:45, 13 December 2015 (UTC)[reply]
- I found this a pain as well, though my preference would be for the Wikipedia article at the time of import of the data, though I'm not sure about that. --Izno (talk) 15:32, 13 December 2015 (UTC)[reply]
- When doing the removals, I'd suggest the edit summary link back to the tool page for the particular challenge(s) (using the toollabs:kasparbot/persondata/... interwiki prefix), which would necessitate persistent links to each challenge. — Earwig talk 08:08, 13 December 2015 (UTC)[reply]
- Another idea: giving a view of the Wikipedia article below the challenge so I don't have to keep opening it. Not sold on it, but worth considering, I think. — Earwig talk 07:45, 13 December 2015 (UTC)[reply]
- I have also made suggestions at your WD talk page, TS. Let me know what you think either here or there. For others here, quoted:
Some comments:
- Make the skip button bigger, red, and place it slightly more prominently. (Apparently you had this planned but I'll echo it; "I should make it maybe bigger" from November 24.)
- Make unavailable the varied data which indicate a date earlier than 1920, per the brief discussion at Wikidata talk:Primary sources tool#Migration of enwiki Persondata. Or move it into a different workflow, or something. Earlier dates are a mess right now and I don't think this tool should exacerbate that fact.
- You can probably improve the link to the article by using the Wikidata item's link rather than using Special:Search Article Name.
- It might be nice to have a link available to the version of the article at the date of import. This way I can take a look to see if anything in the article is particularly disagreeable to the persondata as well as how different (whether for example a date is a refinement of the date elsewhere in the article, see e.g. [2] or more likely to have vandalized). On this point, a link to the history of the article would also be appreciated.
[...]
Also, there are a number of challenges where the page on the Persondata side is a disambiguation item and the other is an actual item where the titles on Wikidata are the exact same. Maybe these can be prefiltered? --Izno (talk) 16:24, 11 December 2015 (UTC)
- --Izno (talk) 15:32, 13 December 2015 (UTC)[reply]
Update
edit@Izno and The Earwig: Thank you for your feedback so far. I managed to implement some of the things. Others are still under development. What's done so far:
- The skip button is larger and very red.
- You can now access certain challenges on the Browse challenges-page [3]. This page provides permanent linking as well as a possibility for finding challenges related to certain Wikipedia articles or Wikidata entities.
- Permalinks are now included in edit summaries (for both, removing and adding). When the template is going to be removed from the articles those links can be included too.
- The related policies for the different types of information are now displayed on the deciding pages.
- The deciding pages show now links to the history, talk page and provide an edit link.
I am at the moment on the following things:
- Checking the constraints for the places and exclude the inadequate data. (disambiguation pages etc.)
- Excluding dates before 1920.
- Form for editing descriptions.
Since I didn't store the version ids of the articles while filling the database I would not like to establish this links to a certain version. It would be necessary to reassemble the whole database. I would also like to keep the links using ?title=... because I have some issues with urlencoding and special characters. Regards, --T.seppelt (talk) 11:56, 14 December 2015 (UTC)[reply]
?title is fine; I didn't realize you were using that construction (for some reason I went to Special:Search from one of those pages...?).
Would it be possible to get links to the unique ID of an article as of December 14, 2015 (or date of interest i.e. whenever you can get to it)? This would be Good Enough since no bots have started removing the Persondata (though there are varied editors--including myself--removing them by hand where we bump into them).
Thanks for the work! --Izno (talk) 13:45, 14 December 2015 (UTC)[reply]
I just want to let you know that I see a way to implement all the open improvements. Due to the upcoming holidays I won't be able to make any of the changes accessible until 2016. Warm regards and merry christmas,--T.seppelt (talk) 20:50, 18 December 2015 (UTC)[reply]
@Izno and The Earwig: I am almost done with the proposed changes. Descriptions can now be edited manually, more and more ?oldid=...-links are available and claims with constraint issues are excluded. I would like to do some test edits for the removal of the template. Are you okay with this? -- T.seppelt (talk) 08:30, 4 January 2016 (UTC)[reply]
- We might as well. Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. We can do a larger trial afterwards when we confirm that all is well. — Earwig talk 08:31, 4 January 2016 (UTC)[reply]
- Trial complete. I didn't notice problems with the replacement pattern. What do you think about the edit summaries? -- T.seppelt (talk) 10:27, 4 January 2016 (UTC)[reply]
- I haven't looked through all of the edits yet, but the summary is on the right track; maybe change "related challenges" to "challenges for this article"? — Earwig talk 09:00, 7 January 2016 (UTC)[reply]
- I changed the summary as you proposed. This is probably easier to understand. After this is approved we also have to update Wikipedia:Persondata... Regards, —T.seppelt (talk) 11:20, 8 January 2016 (UTC)[reply]
- I haven't looked through all of the edits yet, but the summary is on the right track; maybe change "related challenges" to "challenges for this article"? — Earwig talk 09:00, 7 January 2016 (UTC)[reply]
- Trial complete. I didn't notice problems with the replacement pattern. What do you think about the edit summaries? -- T.seppelt (talk) 10:27, 4 January 2016 (UTC)[reply]
Observed problem
edit@T.seppelt: Having used your new tool to transfer over 1,000 items of Persondata to Wikidata, I have observed a recurring problem in the tool and/or database's recognition/reconciliation of place names. The tool sometimes selects/suggests a more generalized location than that actually provided in the Persondata template; for example, suggesting the State of New York or the United States, when the birth place or death place actually provided in the Persondata specifically states "New York City". I have also observed that the tool will also sometimes suggest Wikidata disambiguation pages when the Persondata accurately provided the specific item. For examples, please see the Persondata and challenges for Jim Price (baseball manager) and Chase Lyman.
By the way, among those Persondata items imported into your database for further review, I have found absolutely no difference in the reliability of Wikidata vs. Persondata, and I have replaced as many Wikidata items as I have rejected items of Persondata. There remains a great deal of perfectly accurate Persondata to be transferred. What we desperately need now are more editors to review the available items of Persondata, and transfer them as appropriate, using your tool. Dirtlawyer1 (talk) 23:10, 6 January 2016 (UTC)[reply]
- I am still investigating on the problem with the too general descriptions. I have a script for excluding disambiguation claims, but since I can't keep stable connections to the database servers on toollabs at the moment it is going very slowly. I'm going to rewrite the script in PHP and hope that it is more reliable.
- I agree with you. We need more editors to work on this. I hope that dropping a hint as summaries in edit logs of about one million articles will increase the amount of interested users. As stated above we should also update Wikipedia:Persondata. I was also think about using banners. Dewiki did this to inform the community about the web link checking activity of GiftBot. —T.seppelt (talk) 11:35, 8 January 2016 (UTC)[reply]
Updates
editI just want to let you know what's new on the tool:
- The deciding pages for descriptions, aliases and places provide now matches for the proposed value in the Wikipedia article. They are highlighted in yellow. Key words (birth, born, died, death, place, also etc.) are highlighted in blue. In most cases you don't have to check the whole article manually anymore.
- Recent decisions can be accessed now. The page follows the style of Special:RecentChanges and allows you to inspect other editor's decisions.
- The exclusion of disambiguation pages as value for places is in progress. The script is stable. The decisions are marked as excluded by KasparBot. Have a look at them at the recent decisions. After excluding all of them. They are going to be available for user-assisted reparsing in order to make use of those approximately 30,000 claims. I am working on a page for this.
Thank you for testing the tool. Warm regards, — T.seppelt (talk) 10:25, 11 January 2016 (UTC)[reply]
- (Numbered for ease of reference). #1's change pushed the buttons down off my screen (operating at 1920x1200, which is probably one of the standard res's now). It might be desirable to move the selection buttons above that content, either by a) docking them via CSS to the bottom of the viewport or b) (preferentially) just having them above that content in the HTML. This might take the form of proposed description -> buttons -> wiki text. Good change otherwise! --Izno (talk) 12:46, 11 January 2016 (UTC)[reply]
- @Izno: I solved this problem by showing only a single match when the results are getting to long. More details are shown when clicking on Show more. -- T.seppelt (talk) 17:25, 12 January 2016 (UTC)[reply]
- A couple other comments:
- "You don't like this challenge at all? No problem. Skip it!" now seems extraneous to the Skip button and can be removed.
- "Your decision to accepted" -> should say "accept". I think there is a similar problem with "rejected"/"reject" also. "overwritten"/"overwrite" also.
- No notification is posted when the challenge is skipped. Should there be one? Probably.
- Perhaps, add titles to the buttons to explain the intent of the button. Less concerned about this--a help page would be an acceptable substitute about the meanings, or maybe at the bottom an unbulleted list.
- Otherwise, I see little issue in deleting the template now. All of these are certainly quibbles. The Earwig, any other concerns from the run, or have you not had a chance to have a look yet? --Izno (talk) 18:18, 12 January 2016 (UTC)[reply]
- @Izno: The problem with the buttons outside of the viewport is now ultimately solved. The buttons stick to the viewport if the document is higher than the window. I am not sure if it's working with all browsers. For Firefox 43 it's fine. Concerning your comments:
- This bar has been removed.
- Done
- Done
- Instructions for using the tool could be placed at Wikipedia:Persondata. I can add a link to this page and helpful titles.
- -- T.seppelt (talk) 21:11, 12 January 2016 (UTC)[reply]
- @Izno: The problem with the buttons outside of the viewport is now ultimately solved. The buttons stick to the viewport if the document is higher than the window. I am not sure if it's working with all browsers. For Firefox 43 it's fine. Concerning your comments:
- My inclination would be to just do it. I admit I'm still a bit uncertain about Blue Rasberry and Sctechlaw's comments from above; they are the only ones objecting and they seem to have gone MIA since their initial remarks. We're at the point where all of the data has been copied over to the database and it is sufficiently developed to be usable (i.e., the issues above are certainly not deal-breakers), but should we draw more attention to it from the community at large in order to get the "
broad community approval
" that Blue Rasberry mentions? Alternatively, waiting longer leads to increased likelihood of databases being out of sync and manual effort in the meantime going to waste. — Earwig talk 21:29, 12 January 2016 (UTC)[reply]- The Earwig Hello. My original objection was about deleting information when I did not understand why it needed to be deleted. If this bot is not deleting information then I do not object to anything. If it is deleting something then I want more information, either on-wiki or by a phone or video chat if that makes things easier. Blue Rasberry (talk) 21:32, 12 January 2016 (UTC)[reply]
- I understand that persondata will be deleted. I am just not sure why this, as a preservation project, is proposed as the bot to execute that deletion. I support the preservation effort but fail to recognize the rationale for this bot to preserve then delete persondata after archiving it. Blue Rasberry (talk) 21:34, 12 January 2016 (UTC)[reply]
- @Bluerasberry: The ultimate goal is—of course—to get all persondata off of Wikipedia and as much of it onto Wikidata as possible. As it stands, the storage format of persondata makes this very difficult, while the bot's database lets the migrators work more quickly. I think that much is clear. The reason we want to remove persondata rather than just leave it around is because energy is expended dealing with a template that is currently useless; it takes up space in articles, people try to update it, remove it manually, etc, all of which requires effort. A mass-removal clearly marks persondata as historical and freezes its information in one state that can be worked through without concerns of this desynchronization. As long as the bot's database exists and is being reviewed, no information is lost by removing it.
- Maybe someone else can provide a more coherent argument, though. — Earwig talk 22:09, 12 January 2016 (UTC)[reply]
- The Earwig I know that persondata needs to be removed. I just want you to explain why you think this project should remove the persondata before there is confirmation that it is backed up in this database. Why not just delay the deletion until everyone agrees, "Yes, this bot did a backup." I still am not understanding the urgency to do the deletion.
- Can you not just collect the data as of a certain date, like 12 January 2015?
- I am imagining a limbo in which the persondata is deleted and the database is not created. The work flow as described is capture data, delete data locally, then establish the other database. What if someone has a major problem with persondata in your database? Why not get agreement that your database works before deleting the data locally? What information am I lacking that you have that makes you sure that at the time of local deletion, everyone will be happy that the persondata is gone? Blue Rasberry (talk) 22:53, 12 January 2016 (UTC)[reply]
- @Bluerasberry: You seem to be crucially mistaken; the database exists and has been functional for over a month now. (Note that I said in my above comment
We're at the point where all of the data has been copied over to the database and it is sufficiently developed to be usable
.) Feel free to try it out. It's also linked from the edit summaries of each removal from the trial. — Earwig talk 05:41, 13 January 2016 (UTC)[reply]
- @Bluerasberry: You seem to be crucially mistaken; the database exists and has been functional for over a month now. (Note that I said in my above comment
- @T.seppelt: Could we get a look at the source code for the removal component? I want to see how it's doing that. — Earwig talk 07:43, 13 January 2016 (UTC)[reply]
- In Géza, Grand Prince of the Hungarians, why does the "date of death" challenge seem to appear twice? — Earwig talk 08:05, 13 January 2016 (UTC)[reply]
- @The Earwig: the source code is here. This is the calling part:
public static void main(String[] args) throws Exception { GlobalMediaWikiConnection global = new GlobalMediaWikiConnection(); global.setBot(true); MediaWikiConnection wikipedia = global.openConnection("en", "wikipedia.org"); wikipedia.login(Config.LOGIN); wikipedia.setEditInterval(10000L); new TemplateRemovalTask(wikipedia, "Persondata", "migrating [[Wikipedia:Persondata|Persondata]] to Wikidata, [[toollabs:kasparbot/persondata/|please help]], see [[toollabs:kasparbot/persondata/challenge.php/article/%article|challenges for this article]]", 0).run(); }
- I am aware that there are some duplicate with small IDs (< ~ 1000). I am working on identifying them. There are about 100 of them because I didn't truncate the whole database after the testing period and imported some challenges twice. -- T.seppelt (talk) 14:07, 13 January 2016 (UTC)[reply]
- @T.seppelt: My concern here is that the regex replacement will not work correctly if some persondata item contains an embedded template. I am not sure in practice how common this is (it is a mistake and surely very rare, but I don't know if we have some procedure in place that actively removes them)—either way, I don't think we can rely on it never happening. — Earwig talk 22:34, 13 January 2016 (UTC)[reply]
- I know. I was not able to come up with a better pattern. I will check some guides on recursive patterns in Java. Do you have a solution in mind? --T.seppelt (talk) 05:13, 14 January 2016 (UTC)[reply]
- @The Earwig: I checked the guides about regular expressions in Java on the internet. It seem to be impossible to define subrules (like this group should only contain templates) in Java. I would suggest to run the programme as it is currently on GitHub and see how many errors occur. Depending on the amount the script can be adjusted. --T.seppelt (talk) 14:49, 18 January 2016 (UTC)[reply]
- @T.seppelt: My concern here is that the regex replacement will not work correctly if some persondata item contains an embedded template. I am not sure in practice how common this is (it is a mistake and surely very rare, but I don't know if we have some procedure in place that actively removes them)—either way, I don't think we can rely on it never happening. — Earwig talk 22:34, 13 January 2016 (UTC)[reply]
- My inclination would be to just do it. I admit I'm still a bit uncertain about Blue Rasberry and Sctechlaw's comments from above; they are the only ones objecting and they seem to have gone MIA since their initial remarks. We're at the point where all of the data has been copied over to the database and it is sufficiently developed to be usable (i.e., the issues above are certainly not deal-breakers), but should we draw more attention to it from the community at large in order to get the "
More quibbles
editOne more quibble TS: You seem to have done away with the verbs in item 2 above completely. I got the message "Your decision to the information was successfully processed". :) --Izno (talk) 12:16, 13 January 2016 (UTC)[reply]
- In those same descriptions, I might suggest only bolding the verb of interest, rather than the entire sentence in which the verb appears. --Izno (talk) 12:23, 13 January 2016 (UTC)[reply]
- On challenges with conflicts, we get the text "You are kindly ask to decide" -> "You are kindly asked to decide". --Izno (talk) 12:23, 13 January 2016 (UTC)[reply]
- Last quibble: Dates pre-1920 are still appearing in challenges (whether as additions or as conflicts). Have you finished excluding them yet? --Izno (talk) 12:23, 13 January 2016 (UTC)[reply]
- The verbs should be back and the only words which are bold. The text is now grammatically correct. I am working on the exclusion. It will be done until tomorrow. -- T.seppelt (talk) 14:24, 13 January 2016 (UTC)[reply]
- everything Done. They are not excluded but by using ORDER BY challenges with dates before 1920 will only appear when all other challenges are done. This is the fastest solution at the moment. -- T.seppelt (talk) 14:33, 13 January 2016 (UTC)[reply]
- That's an acceptable interim solution. --Izno (talk) 16:23, 13 January 2016 (UTC)[reply]
Conclusion
edit{{BAGAssistanceNeeded}} I think everything is ready so far. I would like to start with the removal. Are there any further objections? –T.seppelt (talk) 20:19, 23 January 2016 (UTC)[reply]
- For the potential nested template issue: the easiest thing for us would probably be to ignore pages containing
\{\{\s*persondata[\{\}]*\{\{
and leave them for manual removal, right? There should be few enough to not cause problems. Can someone explain more about the issue with pre-1920 dates? I'm not aware of any other concerns, though I'd still like another BAG member to check over this due to the sheer number of edits. — Earwig talk 21:25, 23 January 2016 (UTC)[reply]- @The Earwig: Re-pre-1920 dates, the Gregorian calendar did not become "a thing" for all countries until that year, so exact dates prior to that year would be imported with incorrect values, or could be, without doing some research. As it is, I don't think the tool TS has built currently handles being able to import non-Gregorian dates (whether by using the Wikidata API which does provide for it, I believe, or by providing for the translation within the tool; and a UI has to be built regardless). --Izno (talk) 01:56, 24 January 2016 (UTC)[reply]
- @The Earwig: Yes, I'd like to leave them for manual removal since none of us really knows how many they are. @Izno: As you know those challenges are currently deferred. There is no need to hurry; let's think about a good solution. I'd suggest the following: The Pre-1920 dates are available on a separate site which provides more guidelines (and is only accessible when you did a certain amount of regular imports???). If the person has any relations to a country this country is displayed. Besides deciding about import or rejection the user can chose the calendar (Gregorian / Julian). What do you think about this? -- T.seppelt (talk) 08:50, 24 January 2016 (UTC)[reply]
- In response:
- Yes, a separate site.
- Also not sure about certain amount of regular imports. Probably not: There are going to be experts who just want to take care of 'their' page.
- Yes, definitely display the country if it is known. Or perhaps, the country of the location where the person was born and where the person was died, if we can, since a person can move from one place to another inbetween.
- Two ways to do it:
- Instead of offering to "import", have both calendar types displayed.
- When the user selects any choice where the value is imported, ask them the calendar before moving to the next challenge.
- I can go either way on 4. 4.1 seems like it will clutter the UI (multiple import options -> multiple import options x 2) but 4.2 seems like it will add an extra step which could be avoided. --Izno (talk) 16:19, 24 January 2016 (UTC)[reply]
- @Izno: Thank you for the input. I don't really get what you mean by #2. Apart from this I would go for #4.1. I will work on a page. -- T.seppelt (talk) 16:36, 24 January 2016 (UTC)[reply]
- Requiring some number of imports for the pre-1920s dates seems like a bad idea, because there will be some article experts who will know the correct date and can just "handle" it. --Izno (talk) 18:20, 24 January 2016 (UTC)[reply]
- Okay, fine. Now I get it. Thanks, -- T.seppelt (talk) 18:24, 24 January 2016 (UTC)[reply]
- Requiring some number of imports for the pre-1920s dates seems like a bad idea, because there will be some article experts who will know the correct date and can just "handle" it. --Izno (talk) 18:20, 24 January 2016 (UTC)[reply]
- @Izno: Thank you for the input. I don't really get what you mean by #2. Apart from this I would go for #4.1. I will work on a page. -- T.seppelt (talk) 16:36, 24 January 2016 (UTC)[reply]
- In response:
- @The Earwig: Yes, I'd like to leave them for manual removal since none of us really knows how many they are. @Izno: As you know those challenges are currently deferred. There is no need to hurry; let's think about a good solution. I'd suggest the following: The Pre-1920 dates are available on a separate site which provides more guidelines (and is only accessible when you did a certain amount of regular imports???). If the person has any relations to a country this country is displayed. Besides deciding about import or rejection the user can chose the calendar (Gregorian / Julian). What do you think about this? -- T.seppelt (talk) 08:50, 24 January 2016 (UTC)[reply]
- @The Earwig: Re-pre-1920 dates, the Gregorian calendar did not become "a thing" for all countries until that year, so exact dates prior to that year would be imported with incorrect values, or could be, without doing some research. As it is, I don't think the tool TS has built currently handles being able to import non-Gregorian dates (whether by using the Wikidata API which does provide for it, I believe, or by providing for the translation within the tool; and a UI has to be built regardless). --Izno (talk) 01:56, 24 January 2016 (UTC)[reply]
- Hmm, can the bare minimum be some warning when trying to import pre-1920 dates to "be careful"? As long as we have some system in place (which can be developed later) and never lose or hide the data, I think we're fine. I'm going to approve this on January 30, barring any objections, if no one else gets to it first. — Earwig talk 06:59, 27 January 2016 (UTC)[reply]
Approved. — Earwig talk 04:23, 31 January 2016 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.