Page MenuHomePhabricator

Do not take existing URL or identifier for granted
Open, LowPublic

Description

OAbot should propose new links even on citations which already have an URL, or green OA identifiers (like PMC and arXiv). because the new URL/identifier can be an improvement. See an example of good edit made after removing the if already_oa_param restriction.

Factors we can consider with small effort:

  • for direct links to a PDF URL and certain identifiers, whether they're still functioning (CiteSeerX records and others can be taken down);
  • for functioning links, whether they're identical to the resolved DOI;
  • for identifiers, whether the existing identifier has higher priority than the proposed one (e.g., CiteSeerX identifier is less useful when there's already a PMC/arXiv identifier).

New closed access URLs keep coming all the time, largely due to T352405 / T232771.

Event Timeline

CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 5 2018, 6:57 PM

A first step could be to remove the publisher URLs (which already have a lower rank per T228829) when there's already doi-access=free, so that there's no change in linkage but users are then free to manually set the URL rather than keep what was forced into it by VisualEditor. There's no shortage of them.

Nemo_bis moved this task from Bugs to Functionality on the OABot board.

A first step could be to remove the publisher URLs [...]

This is a larger task than it might seem, see for instance over 10k redundant URLs at P14448. The main reason nowadays is that VisualEditor is very busy adding garbage to the url parameter (T232771), but we also have many years of maintenance debt caused by previous additions of non-permanent URLs which then break (case in point: 15k broken oxfordjournals.org URLs).

Can probably make the diff less noisy than https://en.wikipedia.org/w/index.php?title=LGBT_rights_in_Cameroon&curid=3998635&diff=1170361053&oldid=1170360928 , though it's not a big deal to add a few URL-related parameters.

A bug in the current version

Error:
'NoneType' object has no attribute 'strip'
Traceback (most recent call last):
File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/flask/app.py", line 1484, in full_dispatch_request
rv = self.dispatch_request()
File "/data/project/oabot/www/python/venv/lib/python3.9/site-packages/flask/app.py", line 1469, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "app.py", line 131, in process
context = get_proposed_edits(page_name, force)
File "app.py", line 218, in get_proposed_edits
filtered = list([e for e in all_templates if e.proposed_change])
File "app.py", line 218, in
filtered = list([e for e in all_templates if e.proposed_change])
File "./oabot/main.py", line 385, in add_oa_links_in_references
edit.propose_change(only_doi)
File "./oabot/main.py", line 197, in propose_change
url = get_value(self.template, 'url').strip()
AttributeError: 'NoneType' object has no attribute 'strip'

The most popular domains to be replaced can be found with

$ find ~/www/python/src/cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep url= | grep -Eo 'url=[^"|] ' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 40
   1110 doi.org
    940 dx.doi.org
    893 www.sciencedirect.com
    724 www.jstor.org
    639 web.archive.org
    571 onlinelibrary.wiley.com
    469 www.researchgate.net
    451 www.tandfonline.com
    444 www.nature.com
    316 www.cambridge.org
    277 link.springer.com
    259 linkinghub.elsevier.com
    259 archive.org
    227 www.escholarship.org
    210 journals.sagepub.com
    197 academic.oup.com
    196 www.academia.edu
    192 pubmed.ncbi.nlm.nih.gov
    191 www.biodiversitylibrary.org
    184 books.google.com

There are some obvious regulars like archive.org but I'm not sure they're frequent enough to warrant an exception.

And currently

$ find ~/www/python/src/cache -type f -exec jq '.proposed_edits | .[] | .orig_string' {} \; | grep url= | grep -Eo 'url=[^"|] ' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 40     
   1427 doi.org
   1229 dx.doi.org
   1180 www.sciencedirect.com
    940 www.jstor.org
    875 web.archive.org
    736 onlinelibrary.wiley.com
    606 www.researchgate.net
    591 www.nature.com
    586 www.tandfonline.com
    408 www.cambridge.org
    376 archive.org
    337 link.springer.com
    328 linkinghub.elsevier.com
    310 www.escholarship.org
    302 journals.sagepub.com
    283 www.academia.edu
    265 academic.oup.com
    261 pubmed.ncbi.nlm.nih.gov
    259 www.biodiversitylibrary.org
    244 books.google.com
    238 www.science.org
    224 babel.hathitrust.org
    220 zenodo.org
    212 nrs.harvard.edu
    184 ieeexplore.ieee.org
    177 digitalcommons.law.yale.edu
    176 www.journals.uchicago.edu
    166 urn.kb.se
    164 pubs.acs.org
    123 www.bioone.org
    118 nbn-resolving.de
    117 philarchive.org
    110 muse.jhu.edu
    110 link.aps.org
    105 www.research.manchester.ac.uk
    100 bioone.org
     87 www.aeaweb.org
     86 www.osti.gov
     79 pubs.rsc.org
     77 dspace.lboro.ac.uk

So a lot of cruft (and the occasional moved repository like dspace.lboro.ac.uk) to be replaced with far better links:

$ find ~/www/python/src/cache/ -maxdepth 1 -name "*json" -exec cat {}   | grep -Eo 'proposed_link": "https?://[^/] ' | sed 's,proposed_link": ",,g' | sort | uniq -c | sort -nr | head -n 40
   4286 http://citeseerx.ist.psu.edu
   3161 https://zenodo.org
   1198 https://www.biodiversitylibrary.org
    983 https://escholarship.org
    604 https://hal.archives-ouvertes.fr
    596 https://figshare.com
    466 https://www.biorxiv.org
    398 https://dash.harvard.edu
    375 http://doc.rero.ch
    361 https://authors.library.caltech.edu
    318 https://philpapers.org
    314 https://www.osti.gov
    254 https://pure.rug.nl
    251 https://pure.manchester.ac.uk
    250 https://openyls.law.yale.edu
    235 https://discovery.ucl.ac.uk
    229 http://pdfs.semanticscholar.org
    221 https://ora.ox.ac.uk
    213 https://academiccommons.columbia.edu
    183 https://digital.library.unt.edu
    167 https://kops.uni-konstanz.de
    155 https://eprints.whiterose.ac.uk
    152 http://www.nber.org
    152 https://ris.utwente.nl
    152 https://osf.io
    146 https://www.repository.cam.ac.uk
    146 http://eprints.lse.ac.uk
    143 https://aacr.figshare.com
    135 https://pure.uva.nl
    135 https://hal.science
    131 http://eprints.whiterose.ac.uk
    130 https://www.zora.uzh.ch
    128 https://comptes-rendus.academie-sciences.fr
    127 https://lirias.kuleuven.be
    127 https://eprints.gla.ac.uk
    119 https://hcommons.org
    117 https://digitalcommons.unl.edu
    116 https://dr.lib.iastate.edu
    115 https://eprints.soas.ac.uk
    109 https://lawcat.berkeley.edu:443

We should not replace an existing url-access with another for the same URL as happened https://en.wikipedia.org/w/index.php?title=Soft_skills&diff=prev&oldid=1188731807 (even though I'd argue the archive.org inlibrary items are more "limited" than "registration").

Need to check how many url-access=limited we'd add to non-DOI citations like AdsAbs https://en.wikipedia.org/w/index.php?title=T_Scorpii&diff=prev&oldid=1188735108

Currently the most represented domains would be:

$ find -maxdepth 1 -type f -mtime -1 -print0 | xargs -0 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("subscription")) | .orig_string' | grep -Eo '\| *url *= *http[^|}] ' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 30
    916 dx.doi.org
    723 www.sciencedirect.com
    658 doi.org
    519 www.jstor.org
    312 onlinelibrary.wiley.com
    292 linkinghub.elsevier.com
    267 www.researchgate.net
    221 www.tandfonline.com
    218 www.cambridge.org
    204 link.springer.com
    182 pubmed.ncbi.nlm.nih.gov
    179 www.nature.com
    152 journals.sagepub.com
    131 pubs.acs.org
    102 www.science.org
     94 academic.oup.com
     93 semanticscholar.org
     87 archive.org
     79 www.academia.edu
     74 pubs.geoscienceworld.org
     55 doi.wiley.com
     54 www.journals.uchicago.edu
     52 pubs.rsc.org
     50 muse.jhu.edu
     49 www.semanticscholar.org
     47 ieeexplore.ieee.org
     43 iopscience.iop.org
     42 link.aps.org
     37 xlink.rsc.org
     35 aip.scitation.org

After a broader run

$ find -maxdepth 1 -type f -print0 | xargs -0 -P8 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("subscription")) | .orig_string' | grep -Eo '\| *url *= *http[^|}] ' |
 cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 30
   3020 dx.doi.org
   2666 www.jstor.org
   2569 doi.org
   2116 www.sciencedirect.com
   1217 www.researchgate.net
   1105 onlinelibrary.wiley.com
   1011 www.tandfonline.com
    822 www.cambridge.org
    789 pubmed.ncbi.nlm.nih.gov
    748 linkinghub.elsevier.com
    685 link.springer.com
    630 www.nature.com
    522 journals.sagepub.com
    453 muse.jhu.edu
    435 pubs.acs.org
    361 www.academia.edu
    351 semanticscholar.org
    341 academic.oup.com
    338 www.science.org
    301 archive.org
    244 www.persee.fr
    210 www.journals.uchicago.edu
    187 books.google.com                                                                                                                                                                                                                   
    180 ieeexplore.ieee.org
    157 pubs.geoscienceworld.org
    150 doi.wiley.com
    149 www.semanticscholar.org
    120 pubs.rsc.org
    119 brill.com
    108 link.aps.org

$ find -maxdepth 1 -type f -print0 | xargs -0 -P8 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("URLACCESS")) | .orig_string'                                        
| grep -Eo '\| *url *= *http[^|}] ' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 10
   9252 books.google.com        
   4715 www.jstor.org
   3908 archive.org            
   3127 www.biodiversitylibrary.org
   2999 www.researchgate.net
   2728 www.youtube.com
   1920 www.academia.edu                                                                                                                                                                                                                    
   1184 www.britannica.com                             
   1113 lpsn.dsmz.de                                              
    879 www.uniprot.org

Currently with some 160k pages found:

$ find -maxdepth 1 -type f -print0 | xargs -0 -P8 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("subscription")) | .orig_string' | grep -Eo '\| *url *= *http[^|}] ' | cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 30
  15725 www.jstor.org
  14451 dx.doi.org
  12927 doi.org
   9520 www.sciencedirect.com
   6442 www.researchgate.net
   5630 www.tandfonline.com
   5491 onlinelibrary.wiley.com
   4498 www.cambridge.org
   3824 pubmed.ncbi.nlm.nih.gov
   3477 link.springer.com
   3182 muse.jhu.edu
   3024 linkinghub.elsevier.com
   2928 www.nature.com
   2770 journals.sagepub.com
   2065 www.academia.edu
   1934 pubs.acs.org
   1896 academic.oup.com
   1736 www.persee.fr
   1520 www.science.org
   1473 semanticscholar.org
   1247 www.journals.uchicago.edu
   1210 archive.org
   1128 books.google.com
    956 ieeexplore.ieee.org
    854 www.oxforddnb.com
    789 brill.com
    707 doi.wiley.com
    646 www.semanticscholar.org
    620 zenodo.org
    571 www.degruyter.com

Latest run

$ find ../bot_cache -maxdepth 1 -type f -print0 | xargs -0 -P8 -n1 jq '.proposed_edits|.[]| select(.proposed_change|contains("subscription")) | .orig_string' | grep -Eo '\| *url *= *
http[^|}] ' |  cut -d/ -f3 | sort | uniq -c | sort -nr | head -n 30                                    

  23674 dx.doi.org                                                                          
  23599 www.jstor.org                                                                                                                        
  20461 doi.org                                                                             
  16283 www.sciencedirect.com                                                                                                                                                                                                               
   9215 onlinelibrary.wiley.com                                                                                                                                                                   
   9100 www.tandfonline.com                                                                 
   6939 www.cambridge.org                                                                   
   8635 link.springer.com                                                                   
   5370 linkinghub.elsevier.com                                                                        
   5140 www.nature.com                                                                                                                                                                                                                      
   4755 muse.jhu.edu                                                                        
   4455 journals.sagepub.com                                                                                                                                                                                                                
   3187 pubs.acs.org                                                                                                                                                                                                                        
   3054 academic.oup.com                                                                                                                                                                                                                    
   2644 www.science.org                                                                     
   1930 www.journals.uchicago.edu                                                           
   1718 books.google.com                                                                    
   1528 ieeexplore.ieee.org                                                                                                                                                                                                                 
   1245 www.oxforddnb.com                                                                   
   1227 doi.wiley.com                                                                                                                                                                                                                       
   1214 brill.com                                                                           
    965 www.degruyter.com                                                                   
    910 pubs.geoscienceworld.org                                                                                                                                                                                                            
    821 pubs.rsc.org                                                                        
    787 www.cybertruffle.org.uk                                                                         
    749 www.annualreviews.org                                                                             
    743 bioone.org                                                                          
    722 doi.apa.org                                                                   
    697 www.bmj.com                                                  
    695 www.publish.csiro.au