Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edit TensorRT-LLM Docs #2426

Merged
merged 14 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed docs/docs/guides/providers/image.png
Binary file not shown.
192 changes: 157 additions & 35 deletions docs/docs/guides/providers/tensorrt-llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,72 15,197 @@ slug: /guides/providers/tensorrt-llm
<meta name="twitter:description" content="Learn how to install Jan's official TensorRT-LLM Extension, which offers 20-40% faster token speeds on Nvidia GPUs. Understand the requirements, installation steps, and troubleshooting tips."/>
</head>

Users with Nvidia GPUs can get **20-40% faster\* token speeds** on their laptop or desktops by using [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). The greater implication is that you are running FP16, which is also more accurate than quantized models.
:::info

TensorRT-LLM support was launched in 0.4.9, and should be regarded as an Experimental feature.

- Only Windows is supported for now.
- Please report bugs in our Discord's [#tensorrt-llm](https://discord.com/channels/1107178041848909847/1201832734704795688) channel.

:::

Jan supports [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an alternate Inference Engine, for users who have Nvidia GPUs with large VRAM. TensorRT-LLM allows for blazing fast inference, but requires Nvidia GPUs with [larger VRAM](https://nvidia.github.io/TensorRT-LLM/memory.html).

## What is TensorRT-LLM?

[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is an hardware-optimized LLM inference engine for Nvidia GPUs, that compiles models to run extremely fast on Nvidia GPUs.
- Mainly used on Nvidia's Datacenter-grade GPUs like the H100s [to produce 10,000 tok/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html).
- Can be used on Nvidia's workstation (e.g. [A6000](https://www.nvidia.com/en-us/design-visualization/rtx-6000/)) and consumer-grade GPUs (e.g. [RTX 4090](https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/))

:::tip[Benefits]

This guide walks you through how to install Jan's official [TensorRT-LLM Extension](https://github.com/janhq/nitro-tensorrt-llm). This extension uses [Nitro-TensorRT-LLM](https://github.com/janhq/nitro-tensorrt-llm) as the AI engine, instead of the default [Nitro-Llama-CPP](https://github.com/janhq/nitro). It includes an efficient C server to natively execute the [TRT-LLM C runtime](https://nvidia.github.io/TensorRT-LLM/gpt_runtime.html). It also comes with additional feature and performance improvements like OpenAI compatibility, tokenizer improvements, and queues.
- Our performance testing shows 20-40% faster token/s speeds on consumer-grade GPUs
- On datacenter-grade GPUs, TensorRT-LLM can go up to 10,000 tokens/s
- TensorRT-LLM is a relatively new library, that was [released in Sept 2023](https://github.com/NVIDIA/TensorRT-LLM/graphs/contributors). We anticipate performance and resource utilization improvements in the future.

\*Compared to using LlamaCPP engine.
:::

:::warning
This feature is only available for Windows users. Linux is coming soon.
:::warning[Caveats]

Additionally, we only prebuilt a few demo models. You can always build your desired models directly on your machine. [Read here](#build-your-own-tensorrt-models).
- TensorRT-LLM requires models to be compiled into GPU and OS-specific "Model Engines" (vs. GGUF's "convert once, run anywhere" approach)
- TensorRT-LLM Model Engines tend to utilize larger amount of VRAM and RAM in exchange for performance
- This usually means only people with top-of-the-line Nvidia GPUs can use TensorRT-LLM

:::


## Requirements

- A Windows PC
### Hardware

- Windows PC
- Nvidia GPU(s): Ada or Ampere series (i.e. RTX 4000s & 3000s). More will be supported soon.
- 3GB of disk space to download TRT-LLM artifacts and a Nitro binary

**Compatible GPUs**

| Architecture | Supported? | Consumer-grade | Workstation-grade |
| ------------ | --- | -------------- | ----------------- |
| Ada | ✅ | 4050 and above | RTX A2000 Ada |
| Ampere | ✅ | 3050 and above | A100 |
| Turing | ❌ | Not Supported | Not Supported |
dan-homebrew marked this conversation as resolved.
Show resolved Hide resolved

:::info

Please ping us in Discord's [#tensorrt-llm](https://discord.com/channels/1107178041848909847/1201832734704795688) channel if you would like Turing support.

:::

### Software

- Jan v0.4.9 or Jan v0.4.8-321 (nightly)
- Nvidia Driver v535 ([installation guide](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements))
- CUDA Toolkit v12.2 ([installation guide](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements))
- [Nvidia Driver v535 ](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements)
- [CUDA Toolkit v12.2 ](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements)

## Install TensorRT-Extension
## Getting Started

### Install TensorRT-Extension

1. Go to Settings > Extensions
2. Click install next to the TensorRT-LLM Extension
3. Check that files are correctly downloaded
2. Install the TensorRT-LLM Extension

:::info
You can check if files have been correctly downloaded:

```sh
ls ~\jan\extensions\@janhq\tensorrt-llm-extension\dist\bin
# Your Extension Folder should now include `nitro.exe`, among other artifacts needed to run TRT-LLM
# Your Extension Folder should now include `nitro.exe`, among other `.dll` files needed to run TRT-LLM
```
:::

## Download a Compatible Model

TensorRT-LLM can only run models in `TensorRT` format. These models, aka "TensorRT Engines", are prebuilt specifically for each target OS GPU architecture.
### Download a TensorRT-LLM Model

Jan's Hub has a few pre-compiled TensorRT-LLM models that you can download, which have a `TensorRT-LLM` label

- We automatically download the TensorRT-LLM Model Engine for your GPU architecture
- We have made a few 1.1b models available that can run even on Laptop GPUs with 8gb VRAM


| Model | OS | Ada (40XX) | Ampere (30XX) | Description |
| ------------------- | ------- | ---------- | ------------- | --------------------------------------------------- |
| Llamacorn 1.1b | Windows | ✅ | ✅ | TinyLlama-1.1b, fine-tuned for usability |
| TinyJensen 1.1b | Windows | ✅ | ✅ | TinyLlama-1.1b, fine-tuned on Jensen Huang speeches |
| Mistral Instruct 7b | Windows | ✅ | ✅ | Mistral |

### Importing Pre-built Models

You can import a pre-built model, by creating a new folder in Jan's `/models` directory that includes:

- TensorRT-LLM Engine files (e.g. `tokenizer`, `.engine`, etc)
- `model.json` that registers these files, and specifies `engine` as `nitro-tensorrt-llm`

:::note[Sample model.json]

Note the `engine` is `nitro-tensorrt-llm`: this won't work without it!

```js
{
"sources": [
{
"filename": "config.json",
"url": "https://delta.jan.ai/dist/models/<gpuarch>/<os>/tensorrt-llm-v0.7.1/TinyJensen-1.1B-Chat-fp16/config.json"
},
{
"filename": "mistral_float16_tp1_rank0.engine",
"url": "https://delta.jan.ai/dist/models/<gpuarch>/<os>/tensorrt-llm-v0.7.1/TinyJensen-1.1B-Chat-fp16/mistral_float16_tp1_rank0.engine"
},
{
"filename": "tokenizer.model",
"url": "https://delta.jan.ai/dist/models/<gpuarch>/<os>/tensorrt-llm-v0.7.1/TinyJensen-1.1B-Chat-fp16/tokenizer.model"
},
{
"filename": "special_tokens_map.json",
"url": "https://delta.jan.ai/dist/models/<gpuarch>/<os>/tensorrt-llm-v0.7.1/TinyJensen-1.1B-Chat-fp16/special_tokens_map.json"
},
{
"filename": "tokenizer.json",
"url": "https://delta.jan.ai/dist/models/<gpuarch>/<os>/tensorrt-llm-v0.7.1/TinyJensen-1.1B-Chat-fp16/tokenizer.json"
},
{
"filename": "tokenizer_config.json",
"url": "https://delta.jan.ai/dist/models/<gpuarch>/<os>/tensorrt-llm-v0.7.1/TinyJensen-1.1B-Chat-fp16/tokenizer_config.json"
},
{
"filename": "model.cache",
"url": "https://delta.jan.ai/dist/models/<gpuarch>/<os>/tensorrt-llm-v0.7.1/TinyJensen-1.1B-Chat-fp16/model.cache"
}
],
"id": "tinyjensen-1.1b-chat-fp16",
"object": "model",
"name": "TinyJensen 1.1B Chat FP16",
"version": "1.0",
"description": "Do you want to chat with Jensen Huan? Here you are",
"format": "TensorRT-LLM",
"settings": {
"ctx_len": 2048,
"text_model": false
},
"parameters": {
"max_tokens": 4096
},
"metadata": {
"author": "LLama",
"tags": [
"TensorRT-LLM",
"1B",
"Finetuned"
],
"size": 2151000000
},
"engine": "nitro-tensorrt-llm"
}
```

We offer a handful of precompiled models for Ampere and Ada cards that you can immediately download and play with:
:::

1. Restart the application and go to the Hub
2. Look for models with the `TensorRT-LLM` label in the recommended models list. Click download. This step might take some time. 🙏
### Using a TensorRT-LLM Model

![image](https://hackmd.io/_uploads/rJewrEgRp.png)
You can just select and use a TensorRT-LLM model from Jan's Thread interface.
- Jan will automatically start the TensorRT-LLM model engine in the background
- You may encounter a pop-up from Windows Security, asking for Nitro to allow public and private network access

3. Click use and start chatting!
4. You may need to allow Nitro in your network
:::info[Why does Nitro need network access?]

![alt text](image.png)
- This is because Jan runs TensorRT-LLM using the [Nitro Server](https://github.com/janhq/nitro-tensorrt-llm/)
- Jan makes network calls to the Nitro server running on your computer on a separate port

:::warning
If you are our nightly builds, you may have to reinstall the TensorRT-LLM extension each time you update the app. We're working on better extension lifecyles - stay tuned.
:::

## Configure Settings

You can customize the default parameters for how Jan runs TensorRT-LLM.
### Configure Settings

:::info
:::note
coming soon
:::

## Troubleshooting

### Incompatible Extension vs Engine versions
## Extension Details

Jan's TensorRT-LLM Extension is built on top of the open source [Nitro TensorRT-LLM Server](https://github.com/janhq/nitro-tensorrt-llm), a C inference server on top of TensorRT-LLM that provides an OpenAI-compatible API.

For now, the model versions are pinned to the extension versions.
### Manual Build

To manually build the artifacts needed to run the server and TensorRT-LLM, you can reference the source code. [Read here](https://github.com/janhq/nitro-tensorrt-llm?tab=readme-ov-file#quickstart).

### Uninstall Extension

Expand All @@ -89,11 214,8 @@ For now, the model versions are pinned to the extension versions.
3. Delete the entire Extensions folder.
4. Reopen the app, only the default extensions should be restored.

### Install Nitro-TensorRT-LLM manually

To manually build the artifacts needed to run the server and TensorRT-LLM, you can reference the source code. [Read here](https://github.com/janhq/nitro-tensorrt-llm?tab=readme-ov-file#quickstart).

### Build your own TensorRT models
## Build your own TensorRT models

:::info
coming soon
Expand Down
8 changes: 0 additions & 8 deletions docs/docs/integrations/tensorrt.md

This file was deleted.

4 changes: 4 additions & 0 deletions docs/docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 117,10 @@ const config = {
from: '/guides/using-extensions/',
to: '/guides/extensions/',
},
{
from: '/integrations/tensorrt',
to: '/guides/providers/tensorrt-llm'
},
],
},
],
Expand Down
Loading