Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for VertexAi matching engine. #289

Closed

Conversation

pascalconfluent
Copy link
Contributor

No description provided.

Copy link
Owner

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pascalconfluent thanks a lot for your contribution!
I have left some comments, please check them, thanks!

langchain4j-vertex-ai/pom.xml Outdated Show resolved Hide resolved
langchain4j-vertex-ai/pom.xml Outdated Show resolved Hide resolved

@Override
protected double getCosineSimilarity(Embedding embedding, Embedding referenceEmbedding) {
return CosineSimilarity.between(embedding, referenceEmbedding);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to keep the behavior of all embedding stores the same. Score should be a number from 0 to 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In VertexAI is depends on the algorithm you are using when creating the index. The one use default in the unit tests is not recommended by Google. But happy to revert back and put a note in the unit tests.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which algorithms are supported and wich is the default one? can't find it in the code...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is defined during index creation. Here is different algorithm that can be configured:

SQUARED_L2_DISTANCE

Euclidean (L2) Distance

L1_DISTANCE

Manhattan (L1) Distance

DOT_PRODUCT_DISTANCE

Default value. Defined as a negative of the dot product.

COSINE_DISTANCE

Cosine Distance. We strongly suggest using DOT_PRODUCT_DISTANCE UNIT_L2_NORM instead of the COSINE distance. Our algorithms have been more optimized for the DOT_PRODUCT distance, and when combined with UNIT_L2_NORM, it offers the same ranking and mathematical equivalence as the COSINE distance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using the default one for testing!

Copy link
Owner

@langchain4j langchain4j Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... is index created automatically by your implementation or user manually creates it? I see that you create VertexAiEmbeddingIndex each time addAll() is called, but I can't find the code responsible for defining algorithm in your implementation...

But my initial point was that instead of using CosineSimilarity.between(embedding, referenceEmbedding), please use RelevanceScore.fromCosineSimilarity(CosineSimilarity.between(embedding, referenceEmbedding)). This way it will be in line with all other integrations. And it should work the same for both cosine similarity and dot product L2 normalization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each List has an index file attached to it. This index file is used by the matching engine to index it. If you create a batch mode index, then matching engine scan all the files to add the embeddings to the index. In short, an index is created in 2 phases, first create an index in matching engine (that is where you define the algorithm, ....), then you need to create files that represent a list of embeddings. I implemented the 2 phase, will probably do the first one but this one can be done manually using the UI.

For the relevance, yes I can do that and assume that the index is created with the right algorithm.

By the way I looked the implementation from the python's langchain and they are doing the same.

final IndexDatapoint resultDatapoint = neighbor.getDatapoint();
return new EmbeddingMatch<>(neighbor.getDistance(),
id,
Embedding.from(resultDatapoint.getFeatureVectorList()),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you made returnFullDatapoint configurable, the check is needed here that the embedding is returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It return an empty array. I think we should also return an empty embedding.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty Embedding makes no sense, if we don't have an embedding, it is better to return null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's debatable. Same goes with strings and other empty/null entities.
Anyway, happy to return null instead of empty if no result.

@langchain4j
Copy link
Owner

BTW is this a recommended way to store and search embeddings using Vertex AI? Somehow feels overly complicated and in some places inefficient.

log.info("Uploading {} index to GCS.", filename);

// Upload the index to GCS
getGcpBlobService().upload(filename, index.toString());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LangChain implementation seem to upload embeddings to GCS bucket first and then update the index, providing a reference to the bucket (where to read embeddings from). But in your implementation you provide embeddings directly in the index, so it feels that uploading them to GCS is unnecessary. Could you please remove this line and test if it still works? Thanks!

Copy link
Owner

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pascalconfluent, could you please check my comments? Thanks!

@geoand
Copy link
Contributor

geoand commented Jun 5, 2024

I assume this has gone pretty stale at this point :)

@langchain4j
Copy link
Owner

AFAIK @glaforge wanted to work on adding Vertex AI Search soon, so maybe this PR can be revived?

@glaforge
Copy link
Collaborator

Would be interesting to know if @pascalconfluent wants to update the PR?
We'll have to add support for metadata filtering as well, I guess.
But that would be great to see Vertex AI Search support (used to be called Matching Engine)

@pascalconfluent
Copy link
Contributor Author

pascalconfluent commented Jun 17, 2024 via email

@langchain4j langchain4j added the P3 Medium priority label Sep 10, 2024
@langchain4j
Copy link
Owner

@pascalconfluent closing this PR due to inactivity. Please feel free to reopen if/when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Medium priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants