Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Tika for document text extraction #3582

Merged
merged 2 commits into from
Jul 2, 2024

Conversation

nickovs
Copy link
Contributor

@nickovs nickovs commented Jul 1, 2024

See discussion #3475 for background.

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources? Documentation will be added to the relevant tutorial via a separate PR once this PR is accepted.
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests for validating the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To cleary categorize this pull request, prefix the pull request title, using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

Before any non-image files attached to chats can be used by LLMs, their content text must be extracted. This PR adds support to use Apache Tika for this rather than using the Langchain community document loaders.

Added

  • Code in backend/apps/rag/main.py and supporting files to conditionally make calls out to an Apache Tika server to extract document text rather than using the community-supported loaders from Langchain.
  • Additions to the stored configuration schema to support text extraction engine selection and configuration.
  • UI components for the Admin Panel->Settings->Documents page to allow engine selection and configuration.

Changed

  • Minor (incidental) tweak to .dockerignore to prevent local virtual environments in venv directory from being copied into Docker images.

Deprecated

None

Removed

None

Fixed

None

Security

  • Note that the use of a new service to which user content will be sent opens up a new attack surface. Apache Tika is a mature product and will typically be deployed in a sidecar container, so the risk is very small.

Breaking Changes

None


Additional Information

Apache Tika is a mature tool for extracting text and metadata from files. It supports hundreds of different document types, is under active development and is well-supported. Features include fully integrated OCR of embedded images in hierarchical documents and output of both plain and structured text.

The Tika server is available as a pre-packaged Docker image for multiple platforms. While this code supports communicating with a server in any network-accessible location, the suggested configuration is to deploy it as a sidecar container along side the Open WebUI server. This can be done by adding the following to the docker-compose.yaml file:

  tika:
    image: apache/tika:latest-full

and updating the configuration for the Open WebUI container with an additional environment variable:

  open-webui:
    ...
    environment:
      ...
      - 'TEXT_EXTRACTION_ENGINE=tika'

Added persistent configuration options to configure use and location of Tika service.

Updated backend.apps.rag.main:get_loader() to make use of Tika document loader.
…xt extraction engine.

Updated RAG /config and /config/update endpoints to support UI updates.

Fixed .dockerignore to prevent Python venv from being copied into Docker image.
@nickovs nickovs changed the title Tika document text feat: Support Tika for document text extraction Jul 1, 2024
@tjbck tjbck changed the base branch from main to dev July 2, 2024 00:00
@tjbck
Copy link
Contributor

tjbck commented Jul 2, 2024

Excellent addition to our webui, I love it! I'll be updating the variable names to be more generalisable but the rest LGTM, Thanks!

@tjbck tjbck merged commit 3c1ea24 into open-webui:dev Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants