Kalyan KS’ Post

9mo

JINA EMBEDDINGS 2 - Open Source Text Embeddings for Long Documents 1️⃣ Text embedding models are powerful tools for representing text as fixed-sized vectors. 2️⃣ Most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents. 3️⃣ Jina Embeddings v2, an open-source text embedding model addresses this issue. 4️⃣ Jina Embeddings v2 is capable of encoding long documents of up to 8192 tokens. 5️⃣ Jina Embeddings v2 not only achieves state-of-the-art performance on MTEB benchmark. 6️⃣ Jina Embeddings v2 matches the performance of OpenAI’s proprietary text-embedding-ada-002 model. ➡️ Jina Embeddings v2 (base model) link: https://lnkd.in/gYv_Xnhq ➡️ Jina Embeddings v2 (small model) link: https://lnkd.in/gSektkWa ✔️ For complete details, refer the paper (paper link in the comments) #nlproc #nlp #deeplearning #datascience #ai #generativeai #embeddings

4 Comments

Kalyan KS

9mo

Paper link: https://arxiv.org/pdf/2310.19923.pdf

1 Reaction

Shriman Narayan

Generative Ai engineer, NLP engineer, LLMs, ChatBots, LangChain engineer

9mo

Kalyan KS Any guide to implement with Langchain ..

Meenakshi A.

Technologist & Believer in Systems for People and People for Systems

9mo

Thanks for the good 😊

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Sachin Gupta

Doctoral Generative AI Researcher |AI leader |Hands on Guy |AI Product Mentor| Generative AI Specialist | AI Architect| Growth Strategy expert|M.Tech in DataScience from BITS
9mo
Report this post
Open source embedding for long document. Strong research in sentence embedding called JiNA embedding with 8k seq length. Earlier we used use Bert sentence embedding but embedding paradigm are changing. Good part is performance is equal to open AI text embedding ada-002.
Kalyan KS
9mo

JINA EMBEDDINGS 2 - Open Source Text Embeddings for Long Documents 1️⃣ Text embedding models are powerful tools for representing text as fixed-sized vectors. 2️⃣ Most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents. 3️⃣ Jina Embeddings v2, an open-source text embedding model addresses this issue. 4️⃣ Jina Embeddings v2 is capable of encoding long documents of up to 8192 tokens. 5️⃣ Jina Embeddings v2 not only achieves state-of-the-art performance on MTEB benchmark. 6️⃣ Jina Embeddings v2 matches the performance of OpenAI’s proprietary text-embedding-ada-002 model. ➡️ Jina Embeddings v2 (base model) link: https://lnkd.in/gYv_Xnhq ➡️ Jina Embeddings v2 (small model) link: https://lnkd.in/gSektkWa ✔️ For complete details, refer the paper (paper link in the comments) #nlproc #nlp #deeplearning #datascience #ai #generativeai #embeddings
Like Comment
To view or add a comment, sign in
Patrick Vientos

Data/AI Scientist | Software Engineer/Developer | eDiscovery Expert
9mo
Report this post
NLP (Natural Language Processing) is a critical piece of technology for eDiscovery. As data grows exponentially (more specially unstructured types) being able to process large documents without cutting off due to limits will be a thing to consider when building applications for data processing and the important deep learning inference (embedding documents for/with neural networks) components. Good tech tech to read. #ediscovery #legaltech #datascience
Kalyan KS
9mo

JINA EMBEDDINGS 2 - Open Source Text Embeddings for Long Documents 1️⃣ Text embedding models are powerful tools for representing text as fixed-sized vectors. 2️⃣ Most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents. 3️⃣ Jina Embeddings v2, an open-source text embedding model addresses this issue. 4️⃣ Jina Embeddings v2 is capable of encoding long documents of up to 8192 tokens. 5️⃣ Jina Embeddings v2 not only achieves state-of-the-art performance on MTEB benchmark. 6️⃣ Jina Embeddings v2 matches the performance of OpenAI’s proprietary text-embedding-ada-002 model. ➡️ Jina Embeddings v2 (base model) link: https://lnkd.in/gYv_Xnhq ➡️ Jina Embeddings v2 (small model) link: https://lnkd.in/gSektkWa ✔️ For complete details, refer the paper (paper link in the comments) #nlproc #nlp #deeplearning #datascience #ai #generativeai #embeddings
Like Comment
To view or add a comment, sign in
Muhammad Ehsan

Founder @Indollar | Data Scientist | AI | GenAI | Machine Learning | Deep Learning | NLP | LLMs | AGI | Quantum AI | 15M Views
3mo
Report this post
If the last time you thought about your LLM selection and use cases was six months ago, it is time to take another look. Back in October, the only option for GPT-4 level responses we had was to pay $30 input and $60 output per million tokens. Now, six months later, we have access to a much cheaper GPT-4o model ($5/$15) as well as a 50% discount for a batch option. Don't underestimate the batch option, while it gives you responses within 24 hours (in practice, normally within a few hours), there are many use cases where it is perfectly fine, e.g., analysing large amounts of unstructured data. Have a look at it here: https://lnkd.in/ebSEpGiW Needless to say, this must unlock quite a few use cases that were not quite viable before. ------ Like this post? Follow Muhammad Ehsan press "like,” and hit the 🔔 on my profile and/or share with your network. #llms #ai #gpt4o #languagemodels #machinelearning #naturallanguageprocessing #nlp #artificialintelligence #usecases #datascience #deeplearning #innovation #technology #batchprocessing #datamining #textanalysis #automation #bigdata #analytics #efficiency
Like Comment
To view or add a comment, sign in
Alexander Golubev

Machine Learning Engineer @ Constructor.io
2mo Edited
Report this post
🔎 During the final stages of model fine-tuning, human feedback is often used in various forms. Conventionally, in algorithms like DPO, we need pairs (y1, y2) where one response is explicitly stated as better than the other. To build such a dataset, we can use crowdsourcing platforms or other models that act as critics (e.g., Reward Models, LLM as a judge) to evaluate the proposed pairs. Both approaches have their pitfalls, such as position bias, where the model tends to prefer the first answer. However, the model-critic solution allows for the collection of large amounts of data. In early May, an interesting paper, Prometheus-2 (model dataset, https://lnkd.in/e9V2YCE8) with an Apache 2.0 license, was released. ✅ The authors trained two independent models for pointwise (evaluating a single answer from 1 to 10) and pairwise (choosing a winner from a pair) tasks, not just by giving a number but by generating feedback as an explanation. This technique improves the overall quality, as the model generates a chain of reasoning that it can refer to before giving the final answer, essentially implementing the Chain of Thought methodology. ✅ The two models are then combined by merging their weights: w = a * w_pointwise (1−a) * w_pairwise. The authors tried other methods, but the simplest one gave the best results. You can read more about model merging in the post on Hugging Face (https://lnkd.in/eJzpkahp). The result has a high correlation with human estimates and the GPT-4/Claude3 models. I tried the model for pair estimation in a new Kaggle competition on Reward Model training for the LMSYS chatbot arena, and the initial results look pretty nice. However, in its original format, this model cannot be used due to the time constraints for inference. #llm #transformers #reward_model #finetuning #deeplearning #nlp
Like Comment
To view or add a comment, sign in
SuperAnnotate

19,001 followers
3mo
Report this post
We released a new open-source model to detect AI-written text with performance rivaling popular closed-source alternatives. It is available on HuggingFace and GitHub for you to try out now! 💡 Why We Built It: Our work with numerous companies crafting high-performing #LLM fine-tuning datasets highlighted the need for a robust AI text detection tool, as lower-quality AI-generated text can degrade fine-tuning dataset quality. Existing solutions fell short of our requirements, so we built our own. 🔍 What is it? We’ve fine-tuned a RoBERTa Large model on a dataset comprising 20,000 LLM-generated and human-written text samples. We focused on achieving a high-quality calibration of the model to get reasonable confidence estimates and good overall performance. We are happy to say that our model achieves a similar level of accuracy as closed-source alternatives while being open to all! Model Access: Access the model and model weights on our Hugging Face page: https://lnkd.in/ebSnpubx Model Serving: Our GitHub repository contains the code to run inference and deploy your HTTP LLM content detection service. It also includes a step-by-step tutorial on integrating this tool with the SuperAnnotate platform to prepare high-quality data for your LLM training: https://lnkd.in/euNXsf5R #SuperAnnotate #GeneratedTextDetection #AI #NLP #FineTuning
2 Comments
Like Comment
To view or add a comment, sign in
David Talby

Putting artificial intelligence to work
7mo
Report this post
The #NoCode #NLP #Lab 5.7 is out! Now with support for auto-training Relation Extraction models. This includes features to bootstrap models with pre-annotated from models (transfer learning), rules, or prompts (train a small model from a #GPT model) - plus the ability to share, publish, search, and reuse such models with your team in the included Private Models Hub. Full release notes: https://lnkd.in/gAwcMipc Get started: https://lnkd.in/geJ-Tj-W #ai #datascience #nlproc #textmining #deeplearning #transferlearning #llm #llms #dataannotation #modelhub
Like Comment
To view or add a comment, sign in
Nirmal Patel

AI | Education | Smart Paper | Playpower Labs | Tech for Good
3mo
Report this post
"Let's think dot by dot" - A new paper shows that when LLMs are trained to produce dot tokens (e.g., ".....") as a way of doing chain-of-thought (CoT) 'thinking', they perform on par with or better than CoT methods for certain tasks. This is fascinating, and the authors note that the dots may prompt some internal 'thinking' in the LLM that we cannot see, but it produces more correct results than the baseline. Paper and twitter thread link in the comment. #ai #research #paper #news #cot #chainofthought #llm #llms #nlp #ml #machinelearning #modeling #models #data #bigdata #analytics #datascience
1 Comment
Like Comment
To view or add a comment, sign in
Deepak Chawla

Building Production-Ready Gen AI Solutions
11mo
Report this post
Prompt Engineering: Retrieval Augmented Generation(RAG) 1. Read from the PDF (Clarett user manual PDF) and tokenize with a chunk_size of 1000 tokens 2. Create a vector embedding of these tokens. We will be using OpenAIEmbeddings library to create the vector embeddings. 3. Store the vector embeddings locally. We will be using simple ChromaDB as our VectorDB. We could be using Pinecone or any other such more highly available, production-grade VectorDBs instead. 4. The user issues a prompt with the query/question. 5. This issues a search and retrieval from the vectorDB to get more contextual data from the VectorDB. 6. This contextual data is now will be used along with the prompt. 7. The prompt is augmented by the context. This is typically referred to as context enrichment. 8. The prompt along with the query/question and this enhanced context is now passed to the LLM 9. LLM now responds back, based on this context. #PromptEngineering #RAG #RetrievalAugmentedGeneration #NLP #AI #LanguageModels #Innovation #TechInsights
2 Comments
Like Comment
To view or add a comment, sign in
Mohamed Farag

Machine Learning Engineer
8mo
Report this post
I'm happy to share our latest breakthrough : Machine Learning application in production powered by FastAPI and Docker 🐳, featuring the Vision-and-Language Transformer (ViLT) model fine-tuned on VQAv2! we solve the intricacies of deployment, emphasizing how FastAPI and Docker played crucial roles in ensuring a smooth, scalable, and reliable transition of your ViLT-powered application into the production environment. Adjust and expand upon specifics based on the intricacies of your deployment process! # Key Features: -FastAPI Integration: Seamlessly integrated into a FastAPI framework for robust and efficient API deployment. -Dockerization: Deployed and scaled effortlessly using Docker, ensuring consistent performance across different environments. -VQAv2 Fine-tuning: Tuned on VQAv2 dataset for enhanced accuracy in understanding visual content and responding to questions. link the repository : https://lnkd.in/dUH3QAJD #AI #MachineLearning #FastAPI #Docker #ViLTModel #ComputerVision #NLP #Innovation #deployment #production

16 Comments
Like Comment
To view or add a comment, sign in

26,976 followers

View Profile Follow

Kalyan KS’ Post

More from this author

Top LLM Papers of the Week (August Week 3, 2024)

Top LLM Papers of the week (August Week 2, 2024)

Top LLM Papers of the Week (August Week 1, 2024)

Explore topics