-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit bf615a1
Showing
17 changed files
with
629 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,14 @@ | ||
/.bundle/ | ||
/.yardoc | ||
/_yardoc/ | ||
/coverage/ | ||
/doc/ | ||
/pkg/ | ||
/spec/reports/ | ||
/tmp/ | ||
|
||
# rspec failure tracking | ||
.rspec_status | ||
|
||
/.idea | ||
*.iml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,3 @@ | ||
--format documentation | ||
--color | ||
--require spec_helper |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,7 @@ | ||
--- | ||
sudo: false | ||
language: ruby | ||
cache: bundler | ||
rvm: | ||
- 2.4.4 | ||
before_install: gem install bundler -v 1.16.3 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,6 @@ | ||
source "https://rubygems.org" | ||
|
||
git_source(:github) {|repo_name| "https://github.com/#{repo_name}" } | ||
|
||
# Specify your gem's dependencies in hocr_turtletext.gemspec | ||
gemspec |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,39 @@ | ||
PATH | ||
remote: . | ||
specs: | ||
hocr_turtletext (0.1.0) | ||
nokogiri (~> 1.10.7) | ||
|
||
GEM | ||
remote: https://rubygems.org/ | ||
specs: | ||
diff-lcs (1.3) | ||
mini_portile2 (2.4.0) | ||
nokogiri (1.10.7) | ||
mini_portile2 (~> 2.4.0) | ||
rake (10.5.0) | ||
rspec (3.7.0) | ||
rspec-core (~> 3.7.0) | ||
rspec-expectations (~> 3.7.0) | ||
rspec-mocks (~> 3.7.0) | ||
rspec-core (3.7.1) | ||
rspec-support (~> 3.7.0) | ||
rspec-expectations (3.7.0) | ||
diff-lcs (>= 1.2.0, < 2.0) | ||
rspec-support (~> 3.7.0) | ||
rspec-mocks (3.7.0) | ||
diff-lcs (>= 1.2.0, < 2.0) | ||
rspec-support (~> 3.7.0) | ||
rspec-support (3.7.1) | ||
|
||
PLATFORMS | ||
ruby | ||
|
||
DEPENDENCIES | ||
bundler (~> 1.16) | ||
hocr_turtletext! | ||
rake (~> 10.0) | ||
rspec (~> 3.0) | ||
|
||
BUNDLED WITH | ||
1.16.3 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,21 @@ | ||
The MIT License (MIT) | ||
|
||
Copyright (c) 2020 Sue Zheng Hao | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in | ||
all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,168 @@ | ||
# HocrTurtletext | ||
|
||
Heavily inspired by [PDF::Reader::Turtletext](https://github.com/tardate/pdf-reader-turtletext), HocrTurtletext provides convenient methods to extract content from a hOCR file. hOCR output is commonly produced by OCR software such as tesseract-ocr. | ||
|
||
## Installation | ||
|
||
Add this line to your application's Gemfile: | ||
|
||
```ruby | ||
gem 'hocr_turtletext' | ||
``` | ||
|
||
And then execute: | ||
|
||
$ bundle | ||
|
||
Or install it yourself as: | ||
|
||
$ gem install hocr_turtletext | ||
|
||
## Usage | ||
|
||
### Instantiate HocrTurtletext | ||
|
||
Typical usage: | ||
```ruby | ||
hocr_path = '/tmp/page1.hocr' | ||
options = { :y_precision => 7 } | ||
reader = HocrTurtletext::Reader.new(hocr_path, options) | ||
``` | ||
|
||
Options: | ||
`x_whitespace_threshold`: Words with a x distance of less than this threshold will be concatenated with a space. Try increasing this value if words/letters that are supposed to belong together are separated. | ||
`y_precision`: Different rows of text with y positions that are less than y_precision of difference will be put together into one row. Try increasing this value if words that are supposed to be on the same row are detected as separate rows. | ||
|
||
### Extract text within a region described in relation to other text | ||
|
||
This method works nearly identically to its counterpart from PDF::Reader::Turtletext. | ||
The main difference is that we are not dealing with multiple pages in our hOCR input, so | ||
there is no need to support page selection. | ||
|
||
Given that we know the text we want to find is relatively positioned (for example) | ||
below a certain bit of text, to the left of another, and above some other text, use | ||
the `bounding_box` method to describe the region and extract the matching text. | ||
``` | ||
textangle = reader.bounding_box do | ||
below /electricity/i | ||
above 10 | ||
right_of 240.0 | ||
left_of "Total ($)" | ||
end | ||
textangle.text | ||
=> [['string','string'],['string']] # array of rows, each row is an array of text elements in the row | ||
``` | ||
|
||
The range of methods that can be used within the `bounding_box` block are all optional, and include: | ||
- `inclusive` - whether region selection should be inclusive or exclusive of the specified positions | ||
(default is false). | ||
- `below` - a string, regex or number that describes the upper limit of the text box | ||
(default is top border of the page)`. | ||
- `above` - a string, regex or number that describes the lower limit of the text box | ||
(default is bottom border of the page). | ||
- `left_of` - a string, regex or number that describes the right limit of the text box | ||
(default is right border of the page). | ||
- `right_of` - a string, regex or number that describes the left limit of the text box | ||
(default is left border of the page). | ||
|
||
Note that `left_of` and `right_of` constraints do *not* need to be within the vertical | ||
range of the box being described. | ||
For example, you could use an element in the page header to describe the `left_of` limit | ||
for a table at the bottom of the page, if it has the correct alignment needed to describe your text region. | ||
|
||
Similarly, `above` and `below` constraints do *not* need to be within the horizontal | ||
range of the box being described. | ||
|
||
### Using a block parameter with the `bounding_box` method | ||
|
||
An explicit block parameter may be used with the `bounding_box` method: | ||
``` | ||
textangle = reader.bounding_box do |r| | ||
r.below /electricity/i | ||
r.left_of "Total ($)" | ||
end | ||
textangle.text | ||
=> [['string','string'],['string']] # array of rows, each row is an array of text elements in the row | ||
``` | ||
|
||
### How to describe an inclusive `bounding_box` region | ||
|
||
By default, the `bounding_box` method makes exclusive selection (i.e. not including the | ||
region limits). | ||
|
||
To specify an inclusive region, use the `inclusive!` command: | ||
```ruby | ||
textangle = reader.bounding_box do | ||
inclusive! | ||
below /electricity/i | ||
left_of "Total ($)" | ||
end | ||
``` | ||
Alternatively, set `inclusive` to true: | ||
```ruby | ||
textangle = reader.bounding_box do | ||
inclusive true | ||
below /electricity/i | ||
left_of "Total ($)" | ||
end | ||
``` | ||
Or with a block parameter, you may also assign `inclusive` to true: | ||
```ruby | ||
textangle = reader.bounding_box do |r| | ||
r.inclusive = true | ||
r.below /electricity/i | ||
r.left_of "Total ($)" | ||
end | ||
``` | ||
### Extract text for a region with known positional co-ordinates | ||
|
||
If you know (or can calculate) the x,y positions of the required text region, you can extract the region's text using the `text_in_region` method. | ||
``` | ||
text = reader.text_in_region( | ||
10, # minimum x (left-most) | ||
900, # maximum x (right-most) | ||
200, # minimum y (top-most) | ||
400, # maximum y (bottom-most) | ||
false # inclusive of x/y position if true (default false) | ||
) | ||
=> [['string','string'],['string']] # array of rows, each row is an array of text elements in the row | ||
``` | ||
Note that the x,y origin is at the **top-left**. | ||
This differs from how it works in PDF::Reader::Turtletext, where the origin | ||
was bottom-left of the page. | ||
|
||
### How to find the x,y co-ordinate of a specific text element | ||
|
||
If you are doing low-level text extraction with `text_in_region` for example, | ||
it is usually necessary to locate specific text to provide a positional reference. | ||
|
||
Use the `text_position` method to locate text by exact or partial match. | ||
It returns a Hash of x/y co-ordinates that is the bottom-left corner of the text. | ||
``` | ||
text_by_exact_match = reader.text_position("Transaction Table") | ||
=> { :x => 10.0, :y => 600.0 } | ||
text_by_regex_match = reader.text_position(/transaction summary/i) | ||
=> { :x => 10.0, :y => 300.0 } | ||
``` | ||
Note: in the case of multiple matches, only the first match is returned. | ||
|
||
## Development | ||
|
||
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment. | ||
|
||
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org). | ||
|
||
## Contributing | ||
|
||
- Check issue tracker if someone is working on what you plan to work on | ||
- Fork project | ||
- Create new branch | ||
- Make changes in new branch | ||
- Submit pull request | ||
|
||
## License | ||
|
||
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT). | ||
|
||
## Special Thanks | ||
- Paul Gallagher, creator of the [PDF::Reader::Turtletext](https://github.com/tardate/pdf-reader-turtletext) gem, from which large sections of this gem was copied/modified from. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,6 @@ | ||
require "bundler/gem_tasks" | ||
require "rspec/core/rake_task" | ||
|
||
RSpec::Core::RakeTask.new(:spec) | ||
|
||
task :default => :spec |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,14 @@ | ||
#!/usr/bin/env ruby | ||
|
||
require "bundler/setup" | ||
require "hocr_turtletext" | ||
|
||
# You can add fixtures and/or initialization code here to make experimenting | ||
# with your gem easier. You can also use a different console, if you like. | ||
|
||
# (If you use this, don't forget to add pry to your Gemfile!) | ||
# require "pry" | ||
# Pry.start | ||
|
||
require "irb" | ||
IRB.start(__FILE__) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,8 @@ | ||
#!/usr/bin/env bash | ||
set -euo pipefail | ||
IFS=$'\n\t' | ||
set -vx | ||
|
||
bundle install | ||
|
||
# Do any other automated setup that you need to do here |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,43 @@ | ||
|
||
lib = File.expand_path('../lib', __FILE__) | ||
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib) | ||
require 'hocr_turtletext/version' | ||
|
||
Gem::Specification.new do |spec| | ||
spec.name = 'hocr_turtletext' | ||
spec.version = HocrTurtletext::VERSION | ||
spec.authors = ['Sue Zheng Hao'] | ||
|
||
spec.summary = 'Reads structured text from hOCR input.' | ||
spec.description = <<-DESC | ||
Parses hOCR input and provides methods to access text in a structured manner. Typical use | ||
cases include parsing formatted text from a hOCR file produced by running a document | ||
through OCR. | ||
DESC | ||
spec.homepage = 'https://github.com/emmeryn/hocr-turtletext' | ||
spec.license = 'MIT' | ||
|
||
# Prevent pushing this gem to RubyGems.org. To allow pushes either set the 'allowed_push_host' | ||
# to allow pushing to a single host or delete this section to allow pushing to any host. | ||
if spec.respond_to?(:metadata) | ||
spec.metadata['allowed_push_host'] = "TODO: Set to 'http://mygemserver.com'" | ||
else | ||
raise 'RubyGems 2.0 or newer is required to protect against ' \ | ||
'public gem pushes.' | ||
end | ||
|
||
# Specify which files should be added to the gem when it is released. | ||
# The `git ls-files -z` loads the files in the RubyGem that have been added into git. | ||
spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do | ||
`git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) } | ||
end | ||
spec.bindir = 'exe' | ||
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) } | ||
spec.require_paths = ['lib'] | ||
|
||
spec.add_development_dependency 'bundler', '~> 1.16' | ||
spec.add_development_dependency 'rake', '~> 10.0' | ||
spec.add_development_dependency 'rspec', '~> 3.0' | ||
|
||
spec.add_runtime_dependency 'nokogiri', '~> 1.10.7' | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 1,2 @@ | ||
require 'hocr_turtletext/version' | ||
require 'hocr_turtletext/reader' |
Oops, something went wrong.