Build an OCR tool for wikisource so we don't need to rely on external services.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T161979 Optimize OCR model for Wikisource for each book based on initial proofreading | |||
Resolved | Samwilson | T161978 Epic: Generalized OCR for Wikisource | |||
Resolved | • aezell | T244100 Spike: New/Improved OCR tool [8 hours] | |||
Resolved | aborrero | T247422 Update Tesseract on Toolforge to v4.1.0 | |||
Resolved | kaldari | T246944 Improve OCR: Test accuracy and features of various OCR engines |
Event Timeline
@ifried, @Samwilson, @aezell - I talked with Alexandros Kosiaris about how we could communicate with Google's OCR API from a production extension (similar to what Content Translation is already doing). He informed me that all you have to do is proxy the API requests through the HTTP proxy specified by $wgCopyUploadProxy. Thus it should be relatively easy to move Wikisource OCR into a MediaWiki extension, if we decide we want to do that.
That sounds like a great idea.
That would work for the Google Cloud Vision API, but is there a production/external API for Tesseract? Or is it okay to call Toolforge for that?
No, there is no production/external API for Tesseract, and we would not want to call Toolforge from production. If Tesseract is needed, we'll need to request that Platform Engineering build a service for that. Since Platform Engineering already has a large backlog, we should make that request as soon as we are sure that Tesseract would be needed in production, as it could take a long time (a year or more) to get such a service up and running in production.
@Samwilson this was the ticket listed on the community wishlist. what do we consider the status of this now ?
I think we can call this done!
We do still have an external service (Google Cloud Vision API) but we don't rely on it (we have the internal Tesseract as the default).