feat: Support Tika for document text extraction #3582
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See discussion #3475 for background.
dev
branch.Changelog Entry
Description
Before any non-image files attached to chats can be used by LLMs, their content text must be extracted. This PR adds support to use Apache Tika for this rather than using the Langchain community document loaders.
Added
backend/apps/rag/main.py
and supporting files to conditionally make calls out to an Apache Tika server to extract document text rather than using the community-supported loaders from Langchain.Changed
.dockerignore
to prevent local virtual environments invenv
directory from being copied into Docker images.Deprecated
None
Removed
None
Fixed
None
Security
Breaking Changes
None
Additional Information
Apache Tika is a mature tool for extracting text and metadata from files. It supports hundreds of different document types, is under active development and is well-supported. Features include fully integrated OCR of embedded images in hierarchical documents and output of both plain and structured text.
The Tika server is available as a pre-packaged Docker image for multiple platforms. While this code supports communicating with a server in any network-accessible location, the suggested configuration is to deploy it as a sidecar container along side the Open WebUI server. This can be done by adding the following to the
docker-compose.yaml
file:and updating the configuration for the Open WebUI container with an additional environment variable: