Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Citations event via __event_emitter__ #3615

Merged
merged 4 commits into from
Jul 6, 2024

Conversation

michaelpoluektov
Copy link
Contributor

@michaelpoluektov michaelpoluektov commented Jul 3, 2024

Motivation: I would like to create custom citations for my search tool. Example usage:

    async def custom_search_docs(self, query: str, __event_emitter__) -> str:
        await __event_emitter__(
            {"type": "status", "data": {"type": "status", "description": "Retrieving documentation", "done": False}}
        )
        # retrieve use data
        docs = retriever.invoke(query)
        doc = docs[0]
        await __event_emitter__(
            {
                "type": "citation",
                "data": {
                  "document": [doc["page_content"]],
                  "metadata": {"source": doc["metadata"]["url"]},
                  "source": {"name": doc["metadata"]["title"]},
                }
            }
        )
        await __event_emitter__(
            {"type": "status", "data": {"type": "status", "description": "Retrieving documentation", "done": True}}
        )
        # the rest of the tool

This currently doesn't work: I'm guessing messages is not rendered directly?

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests for validating the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To cleary categorize this pull request, prefix the pull request title, using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

  • Add the ability to create new citations via __event_emitter__

Added

  • Add the ability to create new citations via __event_emitter__

Breaking Changes

  • BREAKING CHANGE: __event_emitter__ now requires type key to specify which kind of event we'd like to emit, with data containing the payload.

src/lib/components/chat/Chat.svelte Outdated Show resolved Hide resolved
@michaelpoluektov
Copy link
Contributor Author

michaelpoluektov commented Jul 4, 2024

Would be good to validate the schema for payload a bit more rigorously: I'm terrible at anything frontend related though, so I don't know what the proper way to do this is. Is there a type definition for citations and status bubbles anywhere?

I left the API the same as the one used on the frontend, feel free to change it if you've got any opinions

@michaelpoluektov michaelpoluektov changed the title WIP: Citations event via __event_emitter__ feat: Citations event via __event_emitter__ Jul 4, 2024
@michaelpoluektov michaelpoluektov changed the title feat: Citations event via __event_emitter__ WIP: Citations event via __event_emitter__ Jul 4, 2024
@michaelpoluektov michaelpoluektov changed the title WIP: Citations event via __event_emitter__ feat: Citations event via __event_emitter__ Jul 5, 2024
@tjbck
Copy link
Contributor

tjbck commented Jul 6, 2024

Thanks!

@tjbck tjbck merged commit 3928ac1 into open-webui:dev Jul 6, 2024
2 of 5 checks passed
@justinh-rahb
Copy link
Collaborator

Groovy 😎

Screen.Recording.2024-07-07.at.12.39.39.AM.mp4

@PlebeiusGaragicus
Copy link

PlebeiusGaragicus commented Jul 9, 2024

Is it working now?!? Can someone share a full code snipped please? Will this also work inside pipelines?

I imagine I have to wait for the next release... this is an incredible feature!

Also wondering if we can have pipelines stream multiple replies. My use case is LangGraph. I want to show the execution as it computes - I may also have multiple LLMs streaming their replies which will build off of each other.

@justinh-rahb
Copy link
Collaborator

justinh-rahb commented Jul 9, 2024

@PlebeiusGaragicus

Screen_Recording_2024-07-08_at_1.53.27_PM.mp4
"""
title: Web Search using SearXNG and Scrape first N Pages
author: constLiakos with enhancements by justinh-rahb
funding_url: https://github.com/open-webui
version: 0.1.3
license: MIT
"""

import os
import requests
from datetime import datetime
import json
from requests import get
from bs4 import BeautifulSoup
import concurrent.futures
from html.parser import HTMLParser
from urllib.parse import urlparse
import re
import unicodedata
from pydantic import BaseModel, Field
import asyncio
from typing import Callable, Any


class Tools:
    class Valves(BaseModel):
        SEARXNG_ENGINE_API_BASE_URL: str = Field(
            default="https://example.com/search",
            description="The base URL for Search Engine",
        )
        SEARXNG_ENGINE_RESULT_NO: int = Field(
            default=3,
            description="The number of Search Engine Results",
        )
        CITATION_LINKS: bool = Field(
            default=False,
            description="If True, send custom citations with links",
        )

    def __init__(self):
        self.valves = self.Valves()

    async def search_web(
        self,
        query: str,
        __event_emitter__: Callable[[dict], Any] = None,
    ) -> str:
        """
        Search the web and get the content of the relevant pages. Search for unknown knowledge, news, info, public contact info, weather, etc.
        :params query: Web Query used in search engine.
        :return: The content of the pages in json format.
        """

        def get_base_url(url):
            parsed_url = urlparse(url)
            base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
            return base_url

        def process_result(result):
            title_site = result["title"]
            url_site = result["url"]
            snippet = result.get("snippet", "")  # Get snippet if available

            try:
                response_site = requests.get(url_site, timeout=120)
                response_site.raise_for_status()
                html_content = response_site.text

                soup = BeautifulSoup(html_content, "html.parser")
                content_site = soup.get_text(separator=" ", strip=True)

                content_site = unicodedata.normalize("NFKC", content_site)
                content_site = re.sub(r"\s ", " ", content_site)
                content_site = content_site.strip()

                links = []
                if self.valves.CITATION_LINKS:
                    for a in soup.find_all("a", href=True):
                        links.append(
                            {
                                "title": a.text.strip(),
                                "link": get_base_url(url_site)   a["href"],
                            }
                        )

                return {
                    "title": title_site,
                    "url": url_site,
                    "content": content_site,
                    "snippet": snippet,
                    "links": links,
                }

            except requests.exceptions.RequestException as e:
                return {
                    "title": title_site,
                    "url": url_site,
                    "content": f"Failed to retrieve the page. Error: {str(e)}",
                    "snippet": snippet,
                }

        if __event_emitter__:
            await __event_emitter__(
                {
                    "type": "status",
                    "data": {
                        "status": "in_progress",
                        "description": f"Initiating web search for: {query}",
                        "done": False,
                    },
                }
            )

        number_of_results = self.valves.SEARXNG_ENGINE_RESULT_NO
        search_engine_url = self.valves.SEARXNG_ENGINE_API_BASE_URL
        params = {"q": query, "format": "json", "number_of_results": number_of_results}

        try:
            if __event_emitter__:
                await __event_emitter__(
                    {
                        "type": "status",
                        "data": {
                            "status": "in_progress",
                            "description": "Sending request to search engine",
                            "done": False,
                        },
                    }
                )

            headers = {
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
            }
            resp = requests.get(
                search_engine_url, params=params, headers=headers, timeout=120
            )
            resp.raise_for_status()
            data = resp.json()

            if __event_emitter__:
                await __event_emitter__(
                    {
                        "type": "status",
                        "data": {
                            "status": "in_progress",
                            "description": f"Retrieved {len(data.get('results', []))} search results",
                            "done": False,
                        },
                    }
                )

        except requests.exceptions.RequestException as e:
            if __event_emitter__:
                await __event_emitter__(
                    {
                        "type": "status",
                        "data": {
                            "status": "error",
                            "description": f"Error during search: {str(e)}",
                            "done": True,
                        },
                    }
                )
            return json.dumps({"error": str(e)})

        results_json = []
        if "results" in data:
            results = data["results"][:number_of_results]

            if __event_emitter__:
                await __event_emitter__(
                    {
                        "type": "status",
                        "data": {
                            "status": "in_progress",
                            "description": f"Processing {len(results)} search results",
                            "done": False,
                        },
                    }
                )

            with concurrent.futures.ThreadPoolExecutor() as executor:
                futures = [
                    executor.submit(process_result, result) for result in results
                ]
                results_json = [
                    future.result()
                    for future in concurrent.futures.as_completed(futures)
                ]

            # Add custom citations only if CITATION_LINKS is True
            if self.valves.CITATION_LINKS and __event_emitter__:
                for result in results_json:
                    await __event_emitter__(
                        {
                            "type": "citation",
                            "data": {
                                "document": [result["snippet"] or result["title"]],
                                "metadata": [{"source": result["url"]}],
                                "source": {"name": result["title"]},
                            },
                        }
                    )

        if __event_emitter__:
            await __event_emitter__(
                {
                    "type": "status",
                    "data": {
                        "status": "complete",
                        "description": f"Web search completed. Retrieved content from {len(results_json)} pages",
                        "done": True,
                    },
                }
            )

        return json.dumps(results_json, ensure_ascii=False)

    async def get_website(
        self,
        url: str,
        __event_emitter__: Callable[[dict], Any] = None,
    ) -> str:
        """
        Get the content of the URL provided.
        :params url: The URL of the website
        :return: The content of the page from the URL in json format.
        """

        def get_base_url(url):
            parsed_url = urlparse(url)
            base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
            return base_url

        def generate_excerpt(content, max_length=200):
            return (
                content[:max_length]   "..." if len(content) > max_length else content
            )

        if __event_emitter__:
            await __event_emitter__(
                {
                    "type": "status",
                    "data": {
                        "status": "in_progress",
                        "description": f"Fetching content from URL: {url}",
                        "done": False,
                    },
                }
            )

        results_json = []

        try:
            headers = {
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
            }
            response_site = requests.get(url, headers=headers, timeout=120)
            response_site.raise_for_status()
            html_content = response_site.text

            if __event_emitter__:
                await __event_emitter__(
                    {
                        "type": "status",
                        "data": {
                            "status": "in_progress",
                            "description": "Parsing website content",
                            "done": False,
                        },
                    }
                )

            soup = BeautifulSoup(html_content, "html.parser")

            page_title = soup.title.string if soup.title else "No title found"
            page_title = unicodedata.normalize("NFKC", page_title.strip())

            title_site = page_title
            url_site = url

            content_site = soup.get_text(separator=" ", strip=True)
            content_site = unicodedata.normalize("NFKC", content_site)
            content_site = re.sub(r"\s ", " ", content_site)
            content_site = content_site.strip()

            links = []
            if self.valves.CITATION_LINKS:
                for a in soup.find_all("a", href=True):
                    links.append(
                        {
                            "title": a.text.strip(),
                            "link": get_base_url(url_site)   a["href"],
                        }
                    )

            result_site = {
                "title": title_site,
                "url": url_site,
                "content": content_site,
                "excerpt": generate_excerpt(content_site),
                "links": links,
            }

            results_json.append(result_site)

            if self.valves.CITATION_LINKS and __event_emitter__:
                await __event_emitter__(
                    {
                        "type": "citation",
                        "data": {
                            "document": [result_site["excerpt"]],
                            "metadata": {"source": url_site},
                            "source": {"name": title_site},
                        },
                    }
                )

            if __event_emitter__:
                await __event_emitter__(
                    {
                        "type": "status",
                        "data": {
                            "status": "complete",
                            "description": "Website content retrieved and processed successfully",
                            "done": True,
                        },
                    }
                )

        except requests.exceptions.RequestException as e:
            results_json.append(
                {
                    "url": url,
                    "content": f"Failed to retrieve the page. Error: {str(e)}",
                }
            )

            if __event_emitter__:
                await __event_emitter__(
                    {
                        "type": "status",
                        "data": {
                            "status": "error",
                            "description": f"Error fetching website content: {str(e)}",
                            "done": True,
                        },
                    }
                )

        return json.dumps(results_json, ensure_ascii=False)

@michaelpoluektov michaelpoluektov deleted the citations-event branch July 9, 2024 10:09
@PlebeiusGaragicus
Copy link

THANKS. Bless you sir.

Does this also work for pipelines? That's honestly where I'd be using it.

I need to emit the state of my graph computation. My pipeline will call a LangServe endpoint and being able to show the progress in the UI would be essential.

@michaelpoluektov
Copy link
Contributor Author

Does this also work for pipelines? That's honestly where I'd be using it.

Unfortunately not. As of now it works with tools and filter functions (but not regular functions/manifolds). I was planning to make it work for regular functions but 0.3.8 got released first, and there are a lot of things I would have to refactor to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants