This repository contains modules for basic ocr on image and pdf, correcting skew in scanned documents, auto noise type detector and noise reduction and watermark removal. Detailed examples on how to use these modules are given in seperate jupyter notebooks and how the code is working is detailed in pdf documentation.
- Tesseract
- Mac ->
brew install tesseract
- Linux ->
sudo apt-get install tesseract-ocr
- Mac ->
- Poppler installation instructions for different operating systems --> https://pypi.org/project/pdf2image/
Create conda env called env_name (or any name)
conda create --name env_name python=3.8
activate this environment
conda activate env_name
Clone this repo and cd to the repository directory, then run this command to install all packages
conda install --file requirements.txt --channel default --channel anaconda --channel conda-forge
Once all packages are installed, use this command to add conda environment to jupyter notebook as a kernel
python -m ipykernel install --user --name=env_name
Project root/
├── assets/
│ ├── ...
├── other_scripts/
│ ├── ....
├── modules/
│ ├── __init__.py
│ ├── noise_reduction_apply.py
│ ├── noise_type_detector.py
│ ├── ocr.py
│ ├── orientation_correction.py
│ └── watermark_removal.py
├── pdf_documentation/
│ ├── Automatic_Noise_Detection_and_Removal_Pipeline.pdf
│ ├── Gaussian_Noise_Removal.pdf
│ ├── Orientation_Correction.pdf
│ └── Watermark_removal.pdf
├── README.md
├── requirements.txt
└── watermark_stain_removal_example.ipynb
├── noise_detection_and_reduction_pipeline.ipynb
├── ocr_example.ipynb
├── orientation_correction_example.ipynb
assets/
-> directory containing pdf and image files which are used to demo the modulesother_scripts/
-> directory containing extra scripts which were used to prototype the different modulesmodules/noise_reduction_apply.py
-> class with methods to remove noise from imagesmodules/noise_type_detector.py
-> class with methods to identify if the noise in an image is gaussian or impulsemodules/ocr.py
-> class with methods to run OCR on single image or entire pdfmodules/orientation_correction.py
-> class with methods to identify angle of skew and correct the orientatio by rotating the imagemodules/watermark_removal.py
-> class with method to remove watermarks from scanned documentspdf_documentation/Automatic_Noise_Detection_and_Removal_Pipeline.pdf
-> this pdf explains how the automatic noise detection and removal pipeline workspdf_documentation/Gaussian_Noise_Removal.pdf
-> this pdf explains how i implemented a new gaussian noise removal algorithm from the given paperpdf_documentation/Orientation_Correction.pdf
-> this pdf explains how the orientation correction works and how well it works at different angles of skewpdf_documentation/Watermark_removal.pdf
-> this pdf explains how the watermark removal worksrequirements.txt
-> package information for installation with condawatermark_stain_removal_example.ipynb
-> jupyter notebook explaining how to use thewatermark_removal.py
modulenoise_detection_and_reduction_pipeline.ipynb
-> jupyter notebook explaining how to use thenoise_type_detector.py
andnoise_reduction_apply.py
modulesocr_example.ipynb
-> jupyter notebook explaining how to use theocr.py
moduleorientation_correction_example.ipynb
-> jupyter notebook explaining how to use theorientation_correction.py
module