468 Capital hat dies direkt geteilt
The DVC team is excited to release DataChain today! DataChain is an open-source Python library for processing and curating unstructured data at scale. 🤖 AI-Driven Data Curation: Use local ML models, LLM APIs calls to enrich your data. 🚀 GenAI Dataset scale: Handle 10s of millions of files or file snippets. 🐍 Python-friendly: Python objects instead of JSON to represent annotations DataChain enables the parallel processing of multiple data files or samples. It can chain different operations such as filtering, aggregation, and merging datasets. The resulting datasets can be saved, versioned, and extracted as files or converted to a PyTorch data loader. DataChain can serialize Python objects (via Pydantic) to an embedded SQLite database. It efficiently deserializes Python objects or runs vectorized analytical queries in the DB without deserialization. The typical use cases are: ◆ LLM judging LLMs dialogues (see code in image) ◆ Auto-deserializing LLM responses to Pydantic. ◆ Vectorized analytics over Python objects ◆ Annotating cloud images with a local model. ◆ Dataset curation using AI annotations. DataChain excels at optimizing batch operations, such as parallelizing synchronous API calls or leveraging heavy batch processing tasks. We believe that DataChain will serve as a solid foundation for new and upcoming unstructured data-wrangling libraries, as well as custom AI-driven curation solutions. ⭐️ Give DataChain a try for your Generative AI data management, give it a star, and as always, your feedback is welcome! Link to repo to get started in the comments! #Generativeai #AI #computervision #LLM