Obsidian Text Extractor

Text Extractor is a "companion" plugin. It"s mainly useful when used in conjunction with other plugins ( like Omnisearch), but you can also use it to quickly extract texts from images & PDFs.

Supported files:

Images (.png, .jpg, .jpeg)
PDFs (.pdf)

Limitations

The plugin currently uses Tesseract.js and pdf-extract to extract texts from images and PDFs. Those libraries are not perfect, and may not work on some files.
🟥 Text Extraction does not work on mobile 🟥. Read the following section for more details.

Cache & Sync

The plugin caches the extracted texts as local small .json files inside the plugin directory. Those files can be synced between your devices. Since text extraction does not work on mobile, the plugin will use the synced cached texts if available. If not, an empty string will be returned.

Installation

Text Extractor is available on the Obsidian community plugins repository. You can also install it manually by downloading the latest release from the releases page or by using the BRAT plugin manager.

Why?

Text extraction is a useful feature, but it is not easy to implement, and consumes a lot of resources.

With this plugin, I hope to provide a unified way to extract texts from images & PDFs, and make it available to other plugins. This way, other plugins can use it without having to worry about the implementation details, and without having to needlessly consume resources.

⚠️ Work in progress

I"m dogfooding this plugin with Omnisearch. The API functions likely won"t change, but this is still a beta.

Using Text Extractor as a dependency for your plugin

The exposed API:

// Add this type somewhere in your code
export type TextExtractorApi = {
  extractText: (file: TFile) => Promise<string>
  canFileBeExtracted: (filePath: string) => boolean
  isInCache: (file: TFile) => Promise<boolean>
}

// Then, you can just use this function to get the API
export function getTextExtractor(): TextExtractorApi | undefined {
  return (app as any).plugins?.plugins?.["text-extractor"]?.api
}

// And use it like this
const text = await getTextExtractor()?.extractText(file)

Note that Text Extractor only extract texts on demand, when you call extractText() on a file, to avoid unnecessary resource consumption. Subsequent calls to extractText() will return the cached text.

Development

While this plugin is first developped for Omnisearch, it"s totally agnostic and I"d like it to become a community effort. If you wish to submit a PR, please open an issue first so we can discuss the feature.

The plugin is split in two parts:

The text extraction library, which does the actual work
The plugin itself, which is a wrapper around the library and exposes some useful options to the user

Each project is in its own folder, and has its own package.json and node_modules. The library uses Rollup (easier to setup with Wasm and web workers), while the plugin uses esbuild.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github		.github
images		images
lib		lib
plugin		plugin
.editorconfig		.editorconfig
.gitignore		.gitignore
.prettierrc.js		.prettierrc.js
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
manifest.json		manifest.json
version-bump.mjs		version-bump.mjs
versions.json		versions.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Obsidian Text Extractor

Limitations

Cache & Sync

Installation

Why?

⚠️ Work in progress

Using Text Extractor as a dependency for your plugin

Development

About

Releases

Packages

Languages

License

sdaitzman/obsidian-text-extractor

Folders and files

Latest commit

History

Repository files navigation

Obsidian Text Extractor

Limitations

Cache & Sync

Installation

Why?

⚠️ Work in progress

Using Text Extractor as a dependency for your plugin

Development

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages