Support Apache Tika for RAG text extraction #3475

nickovs · 2024-06-27T20:23:03Z

nickovs
Jun 27, 2024

Currently, document attachments for RAG are parsed using by selecting from a grab-bag of document loaders from the LangChain community set. While this avoids the need for external services, the supported file type set is small, the results are not always high quality in terms of output order and spacing, and it doesn't support valuable features such as OCR.

It would be great if Open WebUI optionally allowed use of Apache Tika as an alternative way of parsing attachments.

Tika has mature support for parsing hundreds of different document formats, which would greatly expand the set of documents that could be passed in to Open WebUI. It also has integrated support for applying OCR to embedded images, so for instance text extraction from a PDF that is made up of scans of pages "just works".

Importantly, useful installations of Tika are available as completely self-contained Docker images with a REST interface, including versions with bundled Tesseract OCR, making deployment as part of a docker-compose.yml very easy.

Supporting Tika would involve providing a configuration option to let the admin set a Tika service URL and then fixing up backend/apps/rag/main.py to offload most of the text extraction work to Tika if this variable is set.

tjbck · 2024-06-27T20:27:27Z

tjbck
Jun 27, 2024
Maintainer

I believe the new Filter function should enable this use case. I'd love to collaborate on this if you're interested!

4 replies

nickovs Jun 27, 2024
Author

Right now the current filter implementation doesn't allow this for several reasons:

Firstly, pipelines filters don't have access to the raw attachment file since it's not accessible through the /documents/doc endpoint and your pipelines instance might be running in a different container which won't have direct access to the uploads directory.
Even if the previous issue was fixed, right now calls out to filters are performed using requests.post(), which blocks, and in order to even get to the existing data from the /documents/doc endpoint, you have to be able to start another request to the (blocked) Open WebUI backend, so you end up in a deadlock.
It's not the right place to put it, unless the existing RAG file extraction is going to be moved to a whole new type of pipeline filter, since (a) this should really be a replacement to the existing text extraction function and (b) by the time even a priority 0 filter is called the existing text extraction code has already been run and the text already indexed.

There is also a separate issue in that the current chat body that is passed to the filter has separate message and docs lists, and you can't straightforwardly work out the order in which these should be interleaved in a multi-message chat.

My feeling is that:

should probably be addresses as a new feature request for a new /documents/... endpoint,
is a bug, and calls out to pipelines should be switched to aiohttp because right now one stalled pipeline filter can hang the whole Open WebUI system
is the real reason why this should be implemented in the core backend, rather than as a filter.

That said, I'm happy to hear thoughts from others.

tjbck Jun 27, 2024
Maintainer

I meant the filter “function” not the filter “pipeline”, we just introduced the new functions feature with 0.3.6 alongside Files API, hope that clarifies!

nickovs Jun 28, 2024
Author

@tjbck Thanks for the clarification. Can you point me to some documentation (or at least specific source code) for this new functionality?

nickovs Jun 28, 2024
Author

Having run some test, for the general task of just getting text and metadata out of any file you care to through at it, here is all the code that you need to make an asynchronous call to Tika and return the contents:

async def tika_text(data, mine_type=None):
    headers = {"Content-Type": mime_type} if mime_type else {}
    async with aiohttp.ClientSession() as session:
        async with session.put(TIKA_BASE_URL   "tika/text", data=data, headers=headers) as response:
            metadata = await response.json()
            text = metadata['X-TIKA:content']
    return text, metadata

nickovs · 2024-06-28T21:45:14Z

nickovs
Jun 28, 2024
Author

I have a working implementation of Tika integration, although right now its server address is hard-wired because I have no skills in the JavaScript/Svelte space. Basically I created a simple Tika loader class in backend/apps/rag/main.py that looks like this:

class TikaLoader:
    def __init__(self, file_path, mime_type=None):
        self.file_path = file_path
        self.mime_type = mime_type

    def load(self) -> List[Document]:
        with open(self.file_path, "rb") as f:
            data = f.read()

            if self.mime_type is not None:
                headers = {"Content-Type": self.mime_type}
            else:
                headers = {}

            endpoint = TIKA_SERVER_URL   ("" if TIKA_SERVER_URL[-1] == "/" else "/")   "tika/text"

            r = requests.put(endpoint, data=data, headers=headers)

            if r.ok:
                raw_metadata = r.json()
                text = raw_metadata.get("X-TIKA:content", "<No text content found>")

                return [Document(page_content=text, metadata=headers)]
            else:
                raise Exception("Error calling Tika")

I then updated get_loader(...) to preface the large if...elif... ladder with:

    if USE_TIKA:
        if file_ext in known_source_ext or (
                file_content_type and file_content_type.find("text/") >= 0
        ):
            loader = TextLoader(file_path, autodetect_encoding=True)
        else:
            loader = TikaLoader(file_path, file_content_type)
    else:
        ...

This is currently enabled with:

USE_TIKA = True
TIKA_SERVER_URL = "http://tika:9998"

but of course this needs be replaced with appropriate user configuration.

I'm currently running this with Tika running in an adjacent Docker container by sticking these two lines into my docker-compose.yaml:

  tika:
    image: apache/tika:latest-full

The document parsing works perfectly on pretty much anything you can throw at it, from scans of pages to calendar .ics files to old format PowerPoint.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Apache Tika for RAG text extraction #3475

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Support Apache Tika for RAG text extraction #3475

nickovs Jun 27, 2024

Replies: 2 comments · 4 replies

tjbck Jun 27, 2024 Maintainer

nickovs Jun 27, 2024 Author

tjbck Jun 27, 2024 Maintainer

nickovs Jun 28, 2024 Author

nickovs Jun 28, 2024 Author

nickovs Jun 28, 2024 Author

nickovs
Jun 27, 2024

Replies: 2 comments 4 replies

tjbck
Jun 27, 2024
Maintainer

nickovs Jun 27, 2024
Author

tjbck Jun 27, 2024
Maintainer

nickovs Jun 28, 2024
Author

nickovs Jun 28, 2024
Author

nickovs
Jun 28, 2024
Author