Replies: 2 comments 4 replies
-
I believe the new Filter function should enable this use case. I'd love to collaborate on this if you're interested! |
Beta Was this translation helpful? Give feedback.
-
I have a working implementation of Tika integration, although right now its server address is hard-wired because I have no skills in the JavaScript/Svelte space. Basically I created a simple Tika loader class in class TikaLoader:
def __init__(self, file_path, mime_type=None):
self.file_path = file_path
self.mime_type = mime_type
def load(self) -> List[Document]:
with open(self.file_path, "rb") as f:
data = f.read()
if self.mime_type is not None:
headers = {"Content-Type": self.mime_type}
else:
headers = {}
endpoint = TIKA_SERVER_URL ("" if TIKA_SERVER_URL[-1] == "/" else "/") "tika/text"
r = requests.put(endpoint, data=data, headers=headers)
if r.ok:
raw_metadata = r.json()
text = raw_metadata.get("X-TIKA:content", "<No text content found>")
return [Document(page_content=text, metadata=headers)]
else:
raise Exception("Error calling Tika") I then updated if USE_TIKA:
if file_ext in known_source_ext or (
file_content_type and file_content_type.find("text/") >= 0
):
loader = TextLoader(file_path, autodetect_encoding=True)
else:
loader = TikaLoader(file_path, file_content_type)
else:
... This is currently enabled with: USE_TIKA = True
TIKA_SERVER_URL = "http://tika:9998" but of course this needs be replaced with appropriate user configuration. I'm currently running this with Tika running in an adjacent Docker container by sticking these two lines into my tika:
image: apache/tika:latest-full The document parsing works perfectly on pretty much anything you can throw at it, from scans of pages to calendar |
Beta Was this translation helpful? Give feedback.
-
Currently, document attachments for RAG are parsed using by selecting from a grab-bag of document loaders from the LangChain community set. While this avoids the need for external services, the supported file type set is small, the results are not always high quality in terms of output order and spacing, and it doesn't support valuable features such as OCR.
It would be great if Open WebUI optionally allowed use of Apache Tika as an alternative way of parsing attachments.
Tika has mature support for parsing hundreds of different document formats, which would greatly expand the set of documents that could be passed in to Open WebUI. It also has integrated support for applying OCR to embedded images, so for instance text extraction from a PDF that is made up of scans of pages "just works".
Importantly, useful installations of Tika are available as completely self-contained Docker images with a REST interface, including versions with bundled Tesseract OCR, making deployment as part of a
docker-compose.yml
very easy.Supporting Tika would involve providing a configuration option to let the admin set a Tika service URL and then fixing up
backend/apps/rag/main.py
to offload most of the text extraction work to Tika if this variable is set.Beta Was this translation helpful? Give feedback.
All reactions