SuperAnnotate’s Post

View organization page for SuperAnnotate, graphic

17,932 followers

2mo

We released a new open-source model to detect AI-written text with performance rivaling popular closed-source alternatives. It is available on HuggingFace and GitHub for you to try out now! 💡 Why We Built It: Our work with numerous companies crafting high-performing #LLM fine-tuning datasets highlighted the need for a robust AI text detection tool, as lower-quality AI-generated text can degrade fine-tuning dataset quality. Existing solutions fell short of our requirements, so we built our own. 🔍 What is it? We’ve fine-tuned a RoBERTa Large model on a dataset comprising 20,000 LLM-generated and human-written text samples. We focused on achieving a high-quality calibration of the model to get reasonable confidence estimates and good overall performance. We are happy to say that our model achieves a similar level of accuracy as closed-source alternatives while being open to all! Model Access: Access the model and model weights on our Hugging Face page: https://lnkd.in/ebSnpubx Model Serving: Our GitHub repository contains the code to run inference and deploy your HTTP LLM content detection service. It also includes a step-by-step tutorial on integrating this tool with the SuperAnnotate platform to prepare high-quality data for your LLM training: https://lnkd.in/euNXsf5R #SuperAnnotate #GeneratedTextDetection #AI #NLP #FineTuning

2 Comments

Peter Ngure

Hello, do you have any job opportunities remote?

Patrick Umolu

Skilled Researcher

1mo

Very promising!

See more comments

To view or add a comment, sign in

More Relevant Posts

Patrick Vientos

Data/AI Scientist | Software Engineer/Developer | eDiscovery Expert
8mo
Report this post
NLP (Natural Language Processing) is a critical piece of technology for eDiscovery. As data grows exponentially (more specially unstructured types) being able to process large documents without cutting off due to limits will be a thing to consider when building applications for data processing and the important deep learning inference (embedding documents for/with neural networks) components. Good tech tech to read. #ediscovery #legaltech #datascience
Kalyan KS
8mo

JINA EMBEDDINGS 2 - Open Source Text Embeddings for Long Documents 1️⃣ Text embedding models are powerful tools for representing text as fixed-sized vectors. 2️⃣ Most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents. 3️⃣ Jina Embeddings v2, an open-source text embedding model addresses this issue. 4️⃣ Jina Embeddings v2 is capable of encoding long documents of up to 8192 tokens. 5️⃣ Jina Embeddings v2 not only achieves state-of-the-art performance on MTEB benchmark. 6️⃣ Jina Embeddings v2 matches the performance of OpenAI’s proprietary text-embedding-ada-002 model. ➡️ Jina Embeddings v2 (base model) link: https://lnkd.in/gYv_Xnhq ➡️ Jina Embeddings v2 (small model) link: https://lnkd.in/gSektkWa ✔️ For complete details, refer the paper (paper link in the comments) #nlproc #nlp #deeplearning #datascience #ai #generativeai #embeddings
Like Comment
To view or add a comment, sign in
Kalyan KS
8mo
Report this post
JINA EMBEDDINGS 2 - Open Source Text Embeddings for Long Documents 1️⃣ Text embedding models are powerful tools for representing text as fixed-sized vectors. 2️⃣ Most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents. 3️⃣ Jina Embeddings v2, an open-source text embedding model addresses this issue. 4️⃣ Jina Embeddings v2 is capable of encoding long documents of up to 8192 tokens. 5️⃣ Jina Embeddings v2 not only achieves state-of-the-art performance on MTEB benchmark. 6️⃣ Jina Embeddings v2 matches the performance of OpenAI’s proprietary text-embedding-ada-002 model. ➡️ Jina Embeddings v2 (base model) link: https://lnkd.in/gYv_Xnhq ➡️ Jina Embeddings v2 (small model) link: https://lnkd.in/gSektkWa ✔️ For complete details, refer the paper (paper link in the comments) #nlproc #nlp #deeplearning #datascience #ai #generativeai #embeddings
4 Comments
Like Comment
To view or add a comment, sign in
Alexander Golubev

Machine Learning Engineer @ Constructor.io
1mo Edited
Report this post
🔎 During the final stages of model fine-tuning, human feedback is often used in various forms. Conventionally, in algorithms like DPO, we need pairs (y1, y2) where one response is explicitly stated as better than the other. To build such a dataset, we can use crowdsourcing platforms or other models that act as critics (e.g., Reward Models, LLM as a judge) to evaluate the proposed pairs. Both approaches have their pitfalls, such as position bias, where the model tends to prefer the first answer. However, the model-critic solution allows for the collection of large amounts of data. In early May, an interesting paper, Prometheus-2 (model dataset, https://lnkd.in/e9V2YCE8) with an Apache 2.0 license, was released. ✅ The authors trained two independent models for pointwise (evaluating a single answer from 1 to 10) and pairwise (choosing a winner from a pair) tasks, not just by giving a number but by generating feedback as an explanation. This technique improves the overall quality, as the model generates a chain of reasoning that it can refer to before giving the final answer, essentially implementing the Chain of Thought methodology. ✅ The two models are then combined by merging their weights: w = a * w_pointwise (1−a) * w_pairwise. The authors tried other methods, but the simplest one gave the best results. You can read more about model merging in the post on Hugging Face (https://lnkd.in/eJzpkahp). The result has a high correlation with human estimates and the GPT-4/Claude3 models. I tried the model for pair estimation in a new Kaggle competition on Reward Model training for the LMSYS chatbot arena, and the initial results look pretty nice. However, in its original format, this model cannot be used due to the time constraints for inference. #llm #transformers #reward_model #finetuning #deeplearning #nlp
Like Comment
To view or add a comment, sign in
Zakari Salifu

Machine Learning || Deep Learning || AI || Accounting || Kaggle Grandmaster.
1mo Edited
Report this post
🚀 𝐌𝐮𝐥𝐭𝐢-𝐋𝐚𝐛𝐞𝐥 𝐁𝐨𝐨𝐤 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐌𝐨𝐝𝐞𝐥! I'm excited to share my latest project where I've built a robust multi-label classification model to categorize books based on their details. Using a comprehensive dataset from https://www.wonderbk.com/ , this project leverages advanced techniques including BERT embeddings and Convolutional Networks for feature extraction. 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬: - The dataset contains details of 103,063 books with rich attributes. - 𝘉𝘌𝘙𝘛 𝘦𝘮𝘣𝘦𝘥𝘥𝘪𝘯𝘨𝘴 were used for the embedding layer, ensuring high-quality text representation. - Convolutional Networks were employed for feature extraction, enhancing the model's ability to understand complex patterns. 📊 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐈𝐧𝐬𝐢𝐠𝐡𝐭𝐬: The attached image showcases the distribution of the Jaccard Similarity scores between the model's predictions and the actual categories of the books, highlighting the model's efficacy in multi-label classification tasks. 🔗 Check out the full project and code on Kaggle: https://lnkd.in/ex5uFY9f I look forward to any feedback or discussions on how this model can be further improved or applied in real-world scenarios! #MachineLearning #DataScience #DeepLearning #NLP #BERT #ConvolutionalNetworks #BookClassification #AI #Kaggle #DataAnalysis
Like Comment
To view or add a comment, sign in
Sachin Gupta

Doctoral Researcher (Emerging Technologies, Gen AI) |AI Product Mentor| Generative AI Specialist | AI Architect| Growth Strategy expert|M.Tech in DataScience from BITS
8mo
Report this post
Open source embedding for long document. Strong research in sentence embedding called JiNA embedding with 8k seq length. Earlier we used use Bert sentence embedding but embedding paradigm are changing. Good part is performance is equal to open AI text embedding ada-002.
Kalyan KS
8mo

JINA EMBEDDINGS 2 - Open Source Text Embeddings for Long Documents 1️⃣ Text embedding models are powerful tools for representing text as fixed-sized vectors. 2️⃣ Most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents. 3️⃣ Jina Embeddings v2, an open-source text embedding model addresses this issue. 4️⃣ Jina Embeddings v2 is capable of encoding long documents of up to 8192 tokens. 5️⃣ Jina Embeddings v2 not only achieves state-of-the-art performance on MTEB benchmark. 6️⃣ Jina Embeddings v2 matches the performance of OpenAI’s proprietary text-embedding-ada-002 model. ➡️ Jina Embeddings v2 (base model) link: https://lnkd.in/gYv_Xnhq ➡️ Jina Embeddings v2 (small model) link: https://lnkd.in/gSektkWa ✔️ For complete details, refer the paper (paper link in the comments) #nlproc #nlp #deeplearning #datascience #ai #generativeai #embeddings
Like Comment
To view or add a comment, sign in
Utshav Paudel

Data science | Machine learning and AI practitioner
5mo
Report this post
🚀Day180 of #300daysofdata 🧠Today I dive deep to implement every steps from loading datasets,tokenizer,model and preparing training arguments to finetune bert model for NER recognition and uploaded the finetuned model to hugging face. 💡Also implemented dynamic padding using collator function and designed custom evaluation functions for NER where the evaluation functions doesnot take padded values. 💡And spended some time researching different approaches that we can implement to imporve our RAG systems and learned about creating parent and child splits for better retrieval and avoid missing data and use choere ranking of our retrieval. 💡Below is the code implementation of NER recognition using bert model. 🍃github : https://lnkd.in/gm387VJ5 🌐Ner Recognition : https://lnkd.in/dXbuG9JK #nlp #deeplearnign #ai #llm #rag #pipeline #huggingface #data #dailylearning
Like Comment
To view or add a comment, sign in
Hajar Mousannif

AI Evangelist and Strategist | Senior Lead at Katanemo | Full Professor at UCA | 36K on LinkedIn
1mo
Report this post
At the heart of Retrived Augmented Generation (RAG) lies the concept of function calling. But what exactly is it, why is it crucial, and how can developers implement it effectively? Let’s dive in! What is Function Calling? Function calling is a mechanism by which a function or method is invoked to perform a specific task within a program. In the context of RAG, it involves leveraging functions to retrieve relevant information from a large corpus and then using that information to generate coherent and contextually accurate responses. Why is Function Calling Important in RAG? Function calling allows for precise retrieval of relevant data, which is then used to generate more accurate and contextually appropriate responses. How to Implement Function Calling in RAG? 1. Start by defining functions that handle specific tasks, such as data retrieval, preprocessing, and response generation. Ensure these functions have clear input and output parameters. 2. Implement retrieval mechanisms using techniques like dense vector retrieval or traditional keyword-based search. Tools like Elasticsearch or FAISS can be invaluable here. 3. Integrate the retrieved information with generative models like GPT or BERT to produce final responses. This often involves fine-tuning the generative model on a dataset augmented with retrieved data. 4. Continuously monitor and optimize the performance of your function calls to ensure they operate efficiently, especially as your data scales. Some resources for developers: - A comprehensive guide to function calling: https://lnkd.in/ePz-Y573 - A walk through on how to fine-tune gpt-3.5-turbo with function calls with LlamaIndex : https://lnkd.in/eBChVatv - OpenAI function calling with Elasticsearch: https://lnkd.in/es379g4f - Function calling with local models & Langchain - Ollama, llama3 & Phi-3: https://lnkd.in/eQb4jSX6 Katanemo’s intelligent gateway enables function calling, governance & guardrails, routing, as well as evaluation and monitoring. Get in touch with us for a demo! #AI #MachineLearning #RAG #FunctionCalling #Developers #TechInnovation #NLP #ArtificialIntelligence
4 Comments
Like Comment
To view or add a comment, sign in
Kalyan KS
6mo
Report this post
🚀 How Code Empowers LLMs to Serve as Intelligent Agents (Survey) ✅ The survey paper discusses the integration of code into large language models (LLMs) and its impact. ☑️ It highlights that modern LLMs are not only larger but also trained on a mix of natural language and code. ✅ This integration of code brings several advantages: - Enhancing LLMs in code generation. - Unlocking LLMs' reasoning ability for more complex natural language tasks. - Encouraging structured and precise intermediate steps. - Utilizing code compilation and execution environments for model improvement. ✅ The paper also discusses how these capabilities have turned LLMs into intelligent agents in scenarios where understanding instructions, goal decomposition, planning, execution, and feedback refinement are essential. ☑️ Finally, it outlines key challenges and future directions for empowering LLMs with code. 📢 Survey Paper: https://lnkd.in/gGpNG7cf -------------------- For the latest Generative AI updates, join the Generative AI and LLM LinkedIn group: https://lnkd.in/gAXxagu3 #llms #generativeai #datascience #nlproc #deeplearning #nlp #ai
1 Comment
Like Comment
To view or add a comment, sign in
AI topics

769 followers
8mo
Report this post
ai.plainenglish.io: BERT, or Bidirectional Encoder Representations from Transformers, requires specific considerations when preparing datasets for effective functioning. This includes tokenizing the text data using BERT's tokenizer, using special tokens such as [SEP] and [CLS], and adding attention masks to differentiate content from padding. The dataset can be prepared by creating conversation pairs for Next Sentence Prediction (NSP) and by tokenizing the data using the BertWordPieceTokenizer from the HuggingFace transformer library. Preprocessing involves masking words and creating a custom PyTorch Dataset class called BERTDataset. The data is then ready for pre-training a BERT model in PyTorch. - Artificial Intelligence topics! #ai #artificialintelligence #intelligenzaartificiale

A Step-by-Step Guide to Preparing Datasets for BERT implementation with PyTorch (Part 1)

ai.plainenglish.io
Like Comment
To view or add a comment, sign in
HARSSHAD PAWAR

Banking & Fintech | Project Delivery | Powering Innovation with AI
5mo Edited
Report this post
🚀 Text Preprocessing: It's Not Just About Removing😮Emojis 🤔 Building powerful LLMs isn't about brute force data. The secret sauce? Text preprocessing! 🪄 It cleans & refines raw text, boosting model performance, training speed, and unlocking incredible applications like sentiment analysis & machine translation. Here's a sneak peek at 10 essential steps: (GitHub Link : https://shorturl.at/quEFS) 1. ✂️ Tokenization: Break text into manageable chunks. 2. Lowercase: One language, please! 3. Bye emojis, you distract! 4. Remove punctuation for consistency. ️5. Code & URLs? Not today! 6. Out with filler words ("the," "a"). ️ 7. Standardize slang & abbreviations. 8. 🪄 Group similar words for better understanding. 9. Catch those typos! Accuracy matters. 10. Trim unnecessary spaces for efficiency. Remember, the specific approach varies! ✨ Experiment and find your perfect recipe for LLM success. Thanks to Sunny Savita and iNeuron.ai team for making it easier to master into the field of Generative AI. #NLP #MachineLearning #TextProcessing #LLMs #DataScience #AI #genrativeai
Like Comment
To view or add a comment, sign in

17,932 followers

View Profile Follow

SuperAnnotate’s Post

More Relevant Posts

Explore topics