Ospp/new llm embedding #19727

charleschile · 2024-10-31T15:01:08Z

User description

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_EMBEDDING function. This function can a string or embed the content from a specified txt file by using LLM platforms like Ollama and LLM models like llama3.

Three global variables are introduced for users to customize their own LLM platforms, proxy and models.

Three global variables:

llm_embedding_platform: default is ollama
llm_server_proxy: default is http://localhost:11434/api/embed
llm_model: default is llama3

Usage: llm_embedding(<input txt datalink>); or llm_embedding(<input string>);

Return Value: a vector of 4096 32-bit floating point.

Note:

we only support ollama for the platform right now.
if you want to use the Llama 3 model, ensure that the model is running by executing ollama run llama3 in the shell before using the embedding function.

Example SQL:

-- datalink
SELECT llm_embedding(cast('file:///Users/charles/Desktop/codes/testData/embeddingTest.txt' as datalink));

-- string
SELECT llm_embedding("This is text... bla ");

Example return:

mysql> SELECT embedding("This is text... bla ");

embedding(This is text... bla )                                                                                                                                                                                                                                |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 [-0.021955958, -0.024112409, 0.010994452, -0.011615187, 0.03107838, -0.024938945, -0.029811664, -0.018000955, -0.01751671, -0.0007840055, -0.021467123, -0.014996716, -0.0045815744, -0.026652282, -0.016832603, 0.009745739, -0.016642446, 0.00514828, -0.015859831, -0.010222016, -0.010362088, 0.0056358385, -0.015048031, -0.0031982209, -0.010094931, 0.016682236, 0.008610655, -0.010177104, 0.0035889482, -0.011772896, 0.0020790452, 0.0064958483, 0.007662446, -0.011871371, -0.0060495464, -0.00031009244 ｜

1 row(s) fetched.

PR Type

Enhancement, Tests

Description

Implemented new LLM functions: LLM_CHUNK, LLM_EXTRACT_TEXT, and LLM_EMBEDDING.
Added unit and SQL tests for LLM functions, covering chunking, text extraction, and embedding.
Integrated Ollama service for embedding operations and defined related system variables.
Updated dependencies to include a PDF processing library.

Changes walkthrough 📝

Relevant files

Enhancement

6 files

func_llm.go `Implement LLM functions for chunking, text extraction, and embedding` pkg/sql/plan/function/func_llm.go Implemented LLM_CHUNK, LLM_EXTRACT_TEXT, and LLM_EMBEDDING functions. Added chunking strategies: fixed width, sentence, paragraph, and document. Integrated PDF text extraction and embedding operations. Managed file reading and error handling for LLM operations.	471/-0
list_builtIn.go `Register new LLM functions and define overloads` pkg/sql/plan/function/list_builtIn.go Registered new LLM functions: LLM_EXTRACT_TEXT, LLM_CHUNK, and LLM_EMBEDDING. Defined overloads and argument types for each function.	76/-2
ollama_service.go `Implement Ollama service interaction for embeddings` pkg/sql/plan/function/ollama_service.go Implemented Ollama service interaction for embeddings. Defined request and response structures for Ollama API.	96/-0
embedding_service.go `Define EmbeddingService interface and implement Ollama service` pkg/sql/plan/function/embedding_service.go Defined EmbeddingService interface and OllamaEmbeddingService. Implemented method to retrieve embeddings using Ollama API.	70/-0
variables.go `Add system variables for LLM embedding configuration` pkg/frontend/variables.go Added system variables for LLM embedding configuration. Defined defaults for platform, proxy, and model.	24/-0
function_id.go `Add function IDs for new LLM functions` pkg/sql/plan/function/function_id.go Added function IDs for new LLM functions. Updated function ID registry with LLM functions.	8/-0

Tests

7 files

func_llm_test.go `Add unit tests for LLM chunking and extraction functions` pkg/sql/plan/function/func_llm_test.go Added unit tests for LLM chunking and text extraction functions. Tested various chunking modes and error scenarios.	237/-0
func_llm_chunk.result `Add expected results for LLM chunking function tests` test/distributed/cases/function/func_llm_chunk.result Added expected results for LLM chunking function tests. Verified chunking outputs for various modes.	30/-0
func_llm_chunk.sql `Add SQL test cases for LLM chunking function` test/distributed/cases/function/func_llm_chunk.sql Added SQL test cases for LLM chunking function. Tested different chunking strategies and data inputs.	24/-0
func_llm_extract_file.result `Add expected results for LLM text extraction function tests` test/distributed/cases/function/func_llm_extract_file.result Added expected results for LLM text extraction function tests. Verified text extraction from PDF files.	6/-0
func_llm_extract_file.sql `Add SQL test cases for LLM text extraction function` test/distributed/cases/function/func_llm_extract_file.sql Added SQL test cases for LLM text extraction function. Tested extraction from PDF files to text files.	2/-0
4.txt `Add text file for LLM chunking test resources` test/distributed/resources/llm_test/chunk/4.txt Added text file for LLM chunking test resources.	3/-0
func_llm_embedding.sql `Add SQL test case for LLM embedding function` test/distributed/cases/function/func_llm_embedding.sql Added SQL test case for LLM embedding function. Included a skip directive for known issue.	2/-0

Dependencies

2 files

go.sum `Update dependencies for PDF processing` go.sum Updated dependencies to include PDF processing library.	2/-0
go.mod `Add PDF processing library to module dependencies` go.mod Added PDF processing library to module dependencies.	1/-0

Additional files (token-limit)

1 files

func_llm_embedding.result `...` test/distributed/cases/function/func_llm_embedding.result ...	7/-1

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

As part of our document LLM support, we are introducing the `LLM_EXTRACT_TEXT` function. This function extracts text from PDF files and writes the extracted text to a specified text file, extractor type can be specified by the third argument.

mergify · 2024-10-31T15:01:49Z

⚠️ The sha of the head commit of this PR conflicts with #18470. Mergify cannot evaluate rules on this PR. ⚠️

qodo-merge-pro · 2024-10-31T15:01:59Z

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 Security concerns Sensitive information exposure: The code includes hardcoded default values for LLM server proxy and model in the `pkg/frontend/variables.go` file. While these are just defaults, it's important to ensure that production deployments don't accidentally expose internal LLM services to the public internet.
⚡ Recommended focus areas for review Resource Management The `defer r.Close()` inside the for loop may not close the file descriptor until the function ends. This could lead to resource leaks if processing a large number of rows. Error Handling Some error cases return errors directly without proper wrapping or context. Consider using `moerr.NewInvalidInputf` consistently for better error handling and debugging. Error Handling Error messages in `getLLMGlobalVariable` function use different error creation methods. Standardize error handling for consistency.

qodo-merge-pro · 2024-10-31T15:02:43Z

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
Best practice	Implement proper error handling for deferred Close() calls to prevent resource leaks and capture important errors Consider implementing proper error handling and logging for the defer r.Close() statement. Ignoring potential errors from Close() could lead to resource leaks or mask important issues. pkg/sql/plan/function/func_llm.go [162-166] r, err := ReadFromFile(moUrl, fs) if err != nil { return err } -defer r.Close() defer func() { if cerr := r.Close(); cerr != nil { // Log the error or update the return error if it's nil if err == nil { err = cerr } else { // Log cerr or combine with existing error } } }() Apply this suggestion Suggestion importance[1-10]: 9 Why: Proper error handling for deferred Close() calls is crucial to prevent resource leaks and capture errors that might otherwise be ignored, making this a highly impactful suggestion.	9
Performance	Use a more efficient string concatenation method for improved performance when handling large texts or many chunks Consider using a more efficient string concatenation method, such as strings.Builder, for building the result string in the ChunkString function, especially when dealing with large texts or many chunks. pkg/sql/plan/function/func_llm.go [128-133] -var chunkStrings []string -for _, chunk := range chunks { - chunkStrings = append(chunkStrings, fmt.Sprintf("[%d, %d, %q]", chunk.Start, chunk.Length, chunk.Text)) var builder strings.Builder builder.WriteString("[") for i, chunk := range chunks { if i > 0 { builder.WriteString(", ") } fmt.Fprintf(&builder, "[%d, %d, %q]", chunk.Start, chunk.Length, chunk.Text) } builder.WriteString("]") return builder.String(), nil -return "[" strings.Join(chunkStrings, ", ") "]", nil - Apply this suggestion Suggestion importance[1-10]: 8 Why: The suggestion to use strings.Builder for string concatenation is a well-known optimization that can significantly improve performance when dealing with large texts or many chunks, making it a valuable enhancement.	8
Enhancement	Implement context-aware file reading to improve resource management and responsiveness Consider using a context-aware file reader that respects cancellation signals. This can help prevent resource leaks and improve responsiveness when handling large files or slow I/O operations. pkg/sql/plan/function/func_llm.go [162-172] -r, err := ReadFromFile(moUrl, fs) ctx, cancel := context.WithTimeout(context.Background(), 30time.Second) defer cancel() r, err := ReadFromFileWithContext(ctx, moUrl, fs) if err != nil { return err } defer r.Close() -ctx, err := io.ReadAll(r) content, err := io.ReadAll(r) if err != nil { return err } Apply this suggestion* Suggestion importance[1-10]: 7 Why: The suggestion to use context-aware file reading can enhance resource management and responsiveness, especially for large files or slow I/O operations. However, it requires significant changes to the existing code, including introducing a timeout and modifying the ReadFromFile function to accept a context.	7
	Expand test cases for the embedding function to cover more scenarios Consider adding more diverse test cases with different input types and lengths to thoroughly test the `llm_embedding` function. test/distributed/cases/function/func_llm_embedding.sql [1-2] -- @skip:issue#18664 SELECT llm_embedding("This is text... bla "); SELECT llm_embedding("A longer piece of text to test embedding generation with more context."); SELECT llm_embedding(""); -- Test empty string SELECT llm_embedding(NULL); -- Test NULL input Apply this suggestion Suggestion importance[1-10]: 7 Why: Adding more diverse test cases for the `llm_embedding` function is a good suggestion as it would help ensure the function handles various input scenarios, including edge cases like empty strings and NULL values, thereby improving test coverage and reliability.	7
	Implement a more comprehensive error handling strategy to collect and report multiple errors instead of failing on the first error Consider implementing a more robust error handling strategy for the LLMExtractText function. Instead of returning early on the first error, collect all errors and return them together, possibly using a custom error type or a slice of errors. pkg/sql/plan/function/func_llm.go [202-253] var errors []error for i := uint64(0); i < rowCount; i { inputBytes, nullInput := input.GetStrValue(i) if nullInput { if err := rs.AppendMustNullForBytesResult(); err != nil { - return err errors = append(errors, fmt.Errorf("row %d: %w", i, err)) } continue } // ... (similar error checking for other inputs) success, err := extractTextFromPdfAndWriteToFile(moUrl, outputPathUrl, proc) if err != nil { - return err errors = append(errors, fmt.Errorf("row %d: %w", i, err)) continue } // ... (result appending) } if len(errors) > 0 { return fmt.Errorf("multiple errors occurred: %v", errors) } Apply this suggestion Suggestion importance[1-10]: 6 Why: Collecting and reporting multiple errors can improve the robustness of the LLMExtractText function by providing more comprehensive error feedback. However, it adds complexity to error handling and may not be necessary for all use cases.	6
	Add error handling to the text extraction queries to improve robustness Consider adding error handling or validation checks to ensure the PDF files exist and are accessible before attempting to extract text. test/distributed/cases/function/func_llm_extract_file.sql [1-2] -select llm_extract_text(cast('file://$resources/llm_test/extract_text/MODocs1.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/MODocs1.txt' as datalink), "pdf"); -select llm_extract_text(cast('file://$resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/example.txt' as datalink), "pdf"); SELECT CASE WHEN llm_extract_text(cast('file://$resources/llm_test/extract_text/MODocs1.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/MODocs1.txt' as datalink), "pdf") THEN 'Success' ELSE 'Failure' END AS extraction_result; SELECT CASE WHEN llm_extract_text(cast('file://$resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/example.txt' as datalink), "pdf") THEN 'Success' ELSE 'Failure' END AS extraction_result; Apply this suggestion Suggestion importance[1-10]: 4 Why: The suggestion to add error handling for file existence and accessibility is valid and could improve robustness. However, the proposed SQL code does not actually check for file existence or accessibility, it merely wraps the function call in a CASE statement, which doesn't address the underlying issue.	4

💡 Need additional feedback ? start a PR chat

charleschile and others added 16 commits September 1, 2024 16:34

LLM_EMBEDDING implementation

9dd0b19

EmbeddingService interface implementation

06e34e5

Merge branch 'main' into ospp/LLM_EMBEDDING

b847296

Merge branch 'matrixorigin:main' into ospp/LLM_EMBEDDING

ddf8008

fix hardcode about Ollama

311707b

Merge branch 'matrixorigin:main' into ospp/LLM_EMBEDDING

df68435

Merge branch 'matrixorigin:main' into ospp/LLM_EMBEDDING

ebc6454

Merge branch 'main' into ospp/LLM_EMBEDDING

eeb4cc9

add BVT tests

6da3eb3

fix bug

b5b07d4

Merge branch 'matrixorigin:main' into ospp/LLM_EMBEDDING

7b39c0b

Merge branch 'main' into ospp/LLM_EMBEDDING

b5edef3

Merge branch 'main' into ospp/LLM_EMBEDDING

ed8190c

LLM_CHUNK implementation (matrixorigin#18471)

77e7294

Merge branch 'LLMIndex' into ospp/LLM_EMBEDDING

15a0c9c

charleschile requested review from m-schen, daviszhen, qingxinhome, aressu1985, heni02 and zhangxu19830126 as code owners October 31, 2024 15:01

charleschile had a problem deploying to ci October 31, 2024 15:01 — with GitHub Actions Failure

charleschile temporarily deployed to ci October 31, 2024 15:01 — with GitHub Actions Inactive

charleschile had a problem deploying to ci October 31, 2024 15:01 — with GitHub Actions Failure

charleschile temporarily deployed to ci October 31, 2024 15:01 — with GitHub Actions Inactive

charleschile had a problem deploying to ci October 31, 2024 15:01 — with GitHub Actions Failure

charleschile temporarily deployed to ci October 31, 2024 15:01 — with GitHub Actions Inactive

matrix-meow added the size/XL Denotes a PR that changes [1000, 1999] lines label Oct 31, 2024

qodo-merge-pro bot added the Review effort [1-5]: 4 label Oct 31, 2024

mergify bot added the kind/feature label Oct 31, 2024

daviszhen approved these changes Nov 1, 2024

View reviewed changes

aressu1985 approved these changes Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ospp/new llm embedding #19727

Ospp/new llm embedding #19727

charleschile commented Oct 31, 2024 •

edited by qodo-merge-pro bot

Loading

mergify bot commented Oct 31, 2024

qodo-merge-pro bot commented Oct 31, 2024

qodo-merge-pro bot commented Oct 31, 2024

Ospp/new llm embedding #19727

Are you sure you want to change the base?

Ospp/new llm embedding #19727

Conversation

charleschile commented Oct 31, 2024 • edited by qodo-merge-pro bot Loading

User description

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

PR Type

Description

Changes walkthrough 📝

mergify bot commented Oct 31, 2024

qodo-merge-pro bot commented Oct 31, 2024

PR Reviewer Guide 🔍

qodo-merge-pro bot commented Oct 31, 2024

PR Code Suggestions ✨

charleschile commented Oct 31, 2024 •

edited by qodo-merge-pro bot

Loading