Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading documents connects to external web services such as an AWS ELB? #687

Closed
prologic opened this issue Feb 9, 2024 · 25 comments
Closed

Comments

@prologic
Copy link

prologic commented Feb 9, 2024

Bug Report

Description

Bug Summary:

I tried to upload a document to my locally hosted instance of Ollama Web UI and to my horror I discovered that the Docker container (running Ollaba Web UI) wanted to connect to an AWS ELB?! Naturally I blocked this connection (thanks to LittleSnitch). Then it wanted to connect to another external services, some packages (didn't capture it).

Steps to Reproduce:

  • Install a filtering/logging firewall like LittleSnitch
  • Upload a document
  • Observe external connections made

Expected Behavior:

I don't know wtf this is trying to do, but I really DO NOT expect a locally hosted instance of anything to be connecting externally to some 3rd-party services (within reason of course). This is absurd.

At the very least, could someone please explain why this is happening and what this is even used for? Maybe it's legit and required for some part of the "Upload Document" user journey to work?

Actual Behavior:

I expect locally hosted software to NOT connect to external services. The whole point of using Ollama in the first place is to run local LLM models 😅

Environment

Not really relevant. But Docker container on a Mac.

PS: Your Issue template is too long. Please simplify it, I don't generally have and time and patience to fill out everything asked, especially of a vision impaired person. It also takes some of the "human"(ity) out of helping to contribute to "better" open source software.

@prologic
Copy link
Author

prologic commented Feb 9, 2024

FWIW blocking the two connections didn't appear to affect the functionality of Uploading a document. I was later able to select it and use it in context with #, so I'm really confused as to why those connections are even necessary at all 🤔

@tjbck
Copy link
Contributor

tjbck commented Feb 9, 2024

Hi, Thanks for reporting this issue. Could you verify that AWS ELB connection is 100% occurring from the webui-side? Our backend code does no contain any code that explicitly makes connection with AWS ELB, so my guess is the request is made from one of our dependency libraries. If you could narrow down what part of the code making the connection, that would be tremendously helpful, Thanks!

@prologic
Copy link
Author

Yup makes sense!

I'll try to narrow this down 👌 As you said, If you're not doing this explicitly in this codebase then I consider a sneaky supply chain type of thing 🤣

@prologic
Copy link
Author

So here we go:

Text version(s):

Docker wants to connect to a046be49099ce4659abbcfa853797f20-5fd7cc9498e4883e.elb.ap-southeast-1.amazonaws.com on TCP port 443 (https)

Docker wants to connect to packages.unstructured.io on TCP port 443 (https)

Screenshots:
Screenshot 2024-02-10 at 11 12 24
Screenshot 2024-02-10 at 11 12 55

@prologic
Copy link
Author

Note that this is the container itself trying to do this, so something to do with the backend.

@prologic
Copy link
Author

Doing a search for the 2nd connection yield this:

https://github.com/Unstructured-IO/unstructured/blob/d11c70cf83fdb8a08fed2cf01c6c0bd114d817df/unstructured/utils.py#L287-L319

Are we using this in the backenda anywhere? 🤔

@tjbck
Copy link
Contributor

tjbck commented Feb 10, 2024

Here's a list of our suspects:

langchain
langchain-community
chromadb
sentence_transformers
pypdf
docx2txt
unstructured
markdown
pypandoc
pandas
openpyxl
pyxlsb
xlrd

@prologic
Copy link
Author

We are:

https://github.com/ollama-webui/ollama-webui/blob/cb5520c519dde81bfe08ce358753ab7f11417f97/backend/requirements.txt#L25

Why does it need to connect to an external service? 🤔

@prologic
Copy link
Author

I can't figure out this random ELB though, might need some help figuring that one out. But at least we have some culprits now.... The question is, what do we do about it? Blocking both doesn't adversely affect Ollama Web UI in any way that I can tell hmmm

@prologic
Copy link
Author

prologic commented Feb 10, 2024

Oh wow!

def scarf_analytics():
...

If this library is sending analytics, that's disgusting 😱

@tjbck
Copy link
Contributor

tjbck commented Feb 10, 2024

UnstructuredMarkdownLoader seems to be the culprit, investigating more.

@prologic
Copy link
Author

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

www.unstructured.io/

I have half a mind to go yell at this company and ask them to please explain themselves 🤣 Shame on them!

@tjbck
Copy link
Contributor

tjbck commented Feb 10, 2024

Just reviewed the code, I reckon setting DO_NOT_TRACK env var to True will stop the telemetry, could you try testing it?

@prologic
Copy link
Author

Love it! Let's do it, happy to test the fix 👌

@prologic
Copy link
Author

And thank you for responding to this so quickly! When you're self hosting and insisting on doing things locally, you really don't expect your software to reach out to the internet without you knowing about it 😅

@prologic
Copy link
Author

Some kudos I posted for you 😅

@justinh-rahb
Copy link
Collaborator

Good find guys, ya that definitely not nice of them to do. Is there any disclosure from the libary anywhere?

@prologic
Copy link
Author

Good find guys, ya that definitely not nice of them to do. Is there any disclosure from the libary anywhere?

Are you suggesting we file a bug upstream too? It was a bit of a rude surprise to be honest 😅

@tjbck
Copy link
Contributor

tjbck commented Feb 10, 2024

@justinh-rahb none I can find from their readme :/

EDIT: they do mention at the very bottom of their readme to set the environment variable SCARF_NO_ANALYTICS=true.

@tjbck
Copy link
Contributor

tjbck commented Feb 10, 2024

Added

ENV SCARF_NO_ANALYTICS true
ENV DO_NOT_TRACK true

with #694, it should disable the telemetry. Please try it out and let me know!

@justinh-rahb
Copy link
Collaborator

justinh-rahb commented Feb 10, 2024

With RAG being as hot as it is right now, I guess we shouldn't be surprised that some libary authors are cashing in on the user data flowing through their code.

Perhaps it'll be prudent to think about dependency audits in the future. With Ollama now supporting a broad range of CPU-only configurations, it can be integrated into GitHub Actions, along with Ollama-WebUI for thorough end-to-end testing. I'm going to give this a think over the weekend, I seem to recall there being a thread in discussions about using the webUI API directly that may come in handy here, time to do some research...

@tjbck
Copy link
Contributor

tjbck commented Feb 13, 2024

@prologic has the issue been resolved with the latest release?

@prologic
Copy link
Author

I pulled the latest Docker image and restarted my local instance and so far so good 😊

@tjbck
Copy link
Contributor

tjbck commented Feb 14, 2024

I'll close this issue for now, feel free to open new issues if you encounter any spywares from the dependency supply chain, thanks!

@aswani-ms
Copy link

Do you have an example code how to upload a document programatically through an api? is it possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants