feat: Support Tika for document text extraction #3582

nickovs · 2024-07-01T18:56:50Z

See discussion #3475 for background.

Target branch: Please verify that the pull request targets the dev branch.
Description: Provide a concise description of the changes made in this pull request.
Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources? Documentation will be added to the relevant tutorial via a separate PR once this PR is accepted.
Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
Testing: Have you written and run sufficient tests for validating the changes?
Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
Prefix: To cleary categorize this pull request, prefix the pull request title, using one of the following:
- BREAKING CHANGE: Significant changes that may affect compatibility
- build: Changes that affect the build system or external dependencies
- ci: Changes to our continuous integration processes or workflows
- chore: Refactor, cleanup, or other non-functional code changes
- docs: Documentation update or addition
- feat: Introduces a new feature or enhancement to the codebase
- fix: Bug fix or error correction
- i18n: Internationalization or localization changes
- perf: Performance improvement
- refactor: Code restructuring for better maintainability, readability, or scalability
- style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
- test: Adding missing tests or correcting existing tests
- WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

Before any non-image files attached to chats can be used by LLMs, their content text must be extracted. This PR adds support to use Apache Tika for this rather than using the Langchain community document loaders.

Added

Code in backend/apps/rag/main.py and supporting files to conditionally make calls out to an Apache Tika server to extract document text rather than using the community-supported loaders from Langchain.
Additions to the stored configuration schema to support text extraction engine selection and configuration.
UI components for the Admin Panel->Settings->Documents page to allow engine selection and configuration.

Changed

Minor (incidental) tweak to .dockerignore to prevent local virtual environments in venv directory from being copied into Docker images.

Deprecated

None

Removed

None

Fixed

None

Security

Note that the use of a new service to which user content will be sent opens up a new attack surface. Apache Tika is a mature product and will typically be deployed in a sidecar container, so the risk is very small.

Breaking Changes

None

Additional Information

Apache Tika is a mature tool for extracting text and metadata from files. It supports hundreds of different document types, is under active development and is well-supported. Features include fully integrated OCR of embedded images in hierarchical documents and output of both plain and structured text.

The Tika server is available as a pre-packaged Docker image for multiple platforms. While this code supports communicating with a server in any network-accessible location, the suggested configuration is to deploy it as a sidecar container along side the Open WebUI server. This can be done by adding the following to the docker-compose.yaml file:

  tika:
    image: apache/tika:latest-full

and updating the configuration for the Open WebUI container with an additional environment variable:

  open-webui:
    ...
    environment:
      ...
      - 'TEXT_EXTRACTION_ENGINE=tika'

Added persistent configuration options to configure use and location of Tika service. Updated backend.apps.rag.main:get_loader() to make use of Tika document loader.

…xt extraction engine. Updated RAG /config and /config/update endpoints to support UI updates. Fixed .dockerignore to prevent Python venv from being copied into Docker image.

tjbck · 2024-07-02T00:07:37Z

Excellent addition to our webui, I love it! I'll be updating the variable names to be more generalisable but the rest LGTM, Thanks!

nickovs added 2 commits June 30, 2024 15:49

Added support for using Apache Tika as a document loader.

9cf622d

Added persistent configuration options to configure use and location of Tika service. Updated backend.apps.rag.main:get_loader() to make use of Tika document loader.

Added HTML and Typescript UI components to support configration of te…

7aa35a3

…xt extraction engine. Updated RAG /config and /config/update endpoints to support UI updates. Fixed .dockerignore to prevent Python venv from being copied into Docker image.

nickovs changed the title ~~Tika document text~~ feat: Support Tika for document text extraction Jul 1, 2024

tjbck changed the base branch from main to dev July 2, 2024 00:00

tjbck merged commit 3c1ea24 into open-webui:dev Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Tika for document text extraction #3582

feat: Support Tika for document text extraction #3582

nickovs commented Jul 1, 2024 •

edited

Loading

tjbck commented Jul 2, 2024

feat: Support Tika for document text extraction #3582

feat: Support Tika for document text extraction #3582

Conversation

nickovs commented Jul 1, 2024 • edited Loading

Changelog Entry

Description

Added

Changed

Deprecated

Removed

Fixed

Security

Breaking Changes

Additional Information

tjbck commented Jul 2, 2024

nickovs commented Jul 1, 2024 •

edited

Loading