Maniphest T161978

Epic: Generalized OCR for Wikisource
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Apr 2 2017, 12:42 PM

Description

Build an OCR tool for wikisource so we don't need to rely on external services.

Related Objects
Search...

Status	Assigned	Task
Open	None	T161979 Optimize OCR model for Wikisource for each book based on initial proofreading
Resolved	Samwilson	T161978 Epic: Generalized OCR for Wikisource
Resolved	• aezell	T244100 Spike: New/Improved OCR tool [8 hours]
Resolved	aborrero	T247422 Update Tesseract on Toolforge to v4.1.0
Resolved	kaldari	T246944 Improve OCR: Test accuracy and features of various OCR engines

Event Timeline

Halfak created this task.Apr 2 2017, 12:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2017, 12:42 PM

Halfak added a project: All-and-every-Wikisource.Apr 2 2017, 12:42 PM

Halfak added a parent task: T161979: Optimize OCR model for Wikisource for each book based on initial proofreading.Apr 2 2017, 12:44 PM

Capankajsmilyo subscribed.Apr 18 2018, 2:22 PM

Pols12 awarded a token.Oct 25 2019, 3:01 PM

Pols12 subscribed.

Xover subscribed.Oct 31 2019, 7:31 AM

Ltrlg subscribed.Jan 2 2020, 7:44 AM

ifried mentioned this in T244100: Spike: New/Improved OCR tool [8 hours].Feb 4 2020, 1:03 AM

kaldari renamed this task from Generalized OCR for Wikisource to Epic: Generalized OCR for Wikisource.Oct 19 2020, 5:09 PM

kaldari added a project: Epic.

@ifried, @Samwilson, @aezell - I talked with Alexandros Kosiaris about how we could communicate with Google's OCR API from a production extension (similar to what Content Translation is already doing). He informed me that all you have to do is proxy the API requests through the HTTP proxy specified by $wgCopyUploadProxy. Thus it should be relatively easy to move Wikisource OCR into a MediaWiki extension, if we decide we want to do that.

That sounds like a great idea.

That would work for the Google Cloud Vision API, but is there a production/external API for Tesseract? Or is it okay to call Toolforge for that?

No, there is no production/external API for Tesseract, and we would not want to call Toolforge from production. If Tesseract is needed, we'll need to request that Platform Engineering build a service for that. Since Platform Engineering already has a large backlog, we should make that request as soon as we are sure that Tesseract would be needed in production, as it could take a long time (a year or more) to get such a service up and running in production.

MJL subscribed.Nov 16 2020, 5:59 PM

kaldari added a subtask: T244100: Spike: New/Improved OCR tool [8 hours].Nov 19 2020, 2:23 AM

Inductiveload subscribed.Dec 10 2020, 1:39 AM

diegodlh subscribed.Feb 28 2021, 7:28 PM

@Samwilson this was the ticket listed on the community wishlist. what do we consider the status of this now ?