The new NIM Serverless API by HF and Nvidia is a great option if you want a reliable API for open-weight LLMs like Llama-3.1-405B that are too expensive to run on your own hardware.
- It's pay-as-you-go, so it doesn't have rate limits like the standard HF Serverless API and you don't need to commit to hardware like for a dedicated endpoint. - It works out-of-the box with the new v0.25 release of our huggingface_hub.InferenceClient - It's specifically tailored to a small collection of popular open-weight models. For a broader selection of open models, we recommend using the standard HF Serverless API. - Note that you need a token from an Enterprise Hub organization to use it.
Why would you fine-tune a model if you can just prompt an LLM? The new paper "What is the Role of Small Models in the LLM Era: A Survey" provides a nice pro/con overview. My go-to approach combines both:
1. Start testing an idea by prompting an LLM/VLM behind an API. It's fast and easy and I avoid wasting time on tuning a model on a task that might not make it into production anyways.
2. The LLM/VLM then needs to be manually validated. Anyone seriously considering putting AI into production has to do at least some manual validation. Setting up a good validation pipeline with a tool like Argilla is crucial and it can be reused for any future experiments. Note: you can use LLM-as-a-judge to automate some evals, but you always also need to validate the judge!
3. Based on this validation I can then (a) either just continue using the prompted LLM if it is accurate enough and it makes sense financially given my load; or (b) if the LLM is not accurate enough or too expensive to run in the long-run, I reuse the existing validation pipeline to annotate some additional data for fine-tuning a smaller model. This can be sped up by reusing & correcting synthetic data from the LLM (or just pure distillation).