Pytesseract is a Python library that provides an interface to the Tesseract optical character recognition (OCR) engine. OCR is a technology used to recognize and extract text from images, scanned documents or other visual media.
Python Tesseract Explained
Tesseract is an open-source optical character recognition (OCR) engine that is used to extract text from images. In Python, pytesseract is a library that provides an interface to Tesseract’s OCR engine.
What Is Python Tesseract?
Tesseract is an open-source OCR engine developed by Google and is widely considered one of the most accurate OCR engines available.
Pytesseract is a useful Python library that provides an interface to the Tesseract OCR engine. It pre-processes the input image first in order to improve its quality. After that, it examines the page’s arrangement/orientation to determine text blocks, paragraphs and characters. By matching patterns in the segmented areas, Tesseract recognizes individual characters through a combination of machine learning and conventional image processing approaches. In order to increase accuracy and handle many languages, it uses language models. Following identification, post-processing operations like spell checking and error correction are used to improve the outcomes.
To effectively recognize text, Tesseract, the OCR engine underlying pytesseract, is trained on language-specific data sets. It offers support for several languages and comes with training data sets specific to each language.
How to Install Tesseract in Python
Installing pytesseract is not straightforward, and it can be very confusing on how to properly install it. Let’s start with the basic steps to install it.
First, you’ll need to install Tesseract OCR and then install the pytesseract Python package.
For Windows:
pip install pytesseract
For Linux (Ubuntu/Debian):
sudo apt-get install tesseract-ocr
These are the initial and basic steps for installing pytesseract.
Still, there are a number of issues that you may come across during the installation phase. Below are some steps you can take to resolve them.
ISSUE 1: ModuleNotFoundError
ModuleNotFoundError: No module named ‘pytesseract’
This is an indication that there is no pytesseract present in the system. The first thing is to do a simple installation of pytesseract, as we do for other libraries.
Here’s how you can install the required modules:
This involves one step: Install the pytesseract
module:
pip install pytesseract
ISSUE 2: path_to_tesseract Executable Isn’t in Your PATH
TesseractNotFoundError: path_to_tesseract_executable is not
installed or it's not in your PATH
. See README file for more information
This can be a tricky error. The pytesseract
module requires that the Tesseract OCR engine be installed and accessible on your system’s PATH. The error indicates that the Tesseract OCR engine is not found in your PATH. You have to add it to your ENV variables of the system.
To fix this, follow these steps. First, install Tesseract OCR engine.
Download and install the Tesseract OCR engine from the official repository. Windows users will have to download the installer from a different source.
After installing Tesseract, you need to add its installation directory to your system’s PATH environment variable. This step varies depending on your operating system.
For Windows
During the installation of Tesseract, there might be an option to add it to the PATH. If you missed that option, you can manually add the Tesseract installation path to your PATH. Typically, it’s installed in C:\Program Files\Tesseract-OCR or C:\Program Files (x86)\Tesseract-OCR
.
Follow these steps to add it to the PATH:
First, on the Window search bar, search for “Environment Variables.” You will find “Edit the System Variable.”
Next, in the “System Properties” window, click on the “Environment Variables” button.
Under “System variables,” find the “Path” variable, select it, and click the “Edit” button.
Click the “New” button and add the path to the Tesseract installation directory, e.g., C:\Program Files\Tesseract-OCR
.
Then, click “OK” to save the changes.
For MacOS and Linux
For macOS and Linux, the installation path may vary. You can typically find Tesseract installed in /usr/bin/tesseract
or /usr/local/bin/tesseract
. To add it to your PATH:
- Open a terminal window.
- Edit your shell profile configuration file (e.g.,
~/.bashrc
,~/.bash_profile
,~/.zshrc
, etc.) using a text editor like nano or vim. - Add the following line at the end of the file, providing the correct path to the Tesseract executable:
-
export PATH="/usr/bin:$PATH" # Replace "/usr/bin" with the correct path if needed.
- Save the file and close the text editor.
- Run the command
source ~/.bashrc
, or the respective profile file you edited, to apply the changes to your current terminal session.
You should see the Tesseract version information if the installation was successful.
Now, you can check from the following command if pytesseract
is working or not.
Check PIP Installations
Open a terminal or command prompt and enter the following command:
pip show pytesseract
For Conda (Anaconda/Miniconda) installations, pen a terminal or Anaconda prompt and enter the following command
conda list pytesseract
If Pytesseract is installed through Conda, the command will list the package details. If it’s not installed, you will see a message saying that the package is not found.
Advantages to Python Tesseract
Tesseract has a number of benefits. Tesseract is a popular tool with strong community support. Pytesseract is updated frequently to ensure compatibility with the most recent Python versions as well as different versions of other libraries. This release also includes a Pytesseract operating principle change.
You can use several page segmentation modes (PSMs) to guide you when scanning documents or images in the direction you wish to extract text from.
Tesseract may also be used to identify text angles, which is beneficial in a variety of scenarios. Pytesserect has different configurable options like language setting, configuration options that help you for perfectly extracting desired text from images.