GitHub

Flexible Batch of Immediate and Deferred Requests

RA work @ S-Lab at NTU

Motivation

BatchAPI: OpenAI's Batch API allows you to send asynchronous groups of requests at a 50% lower cost, with access to a dedicated pool offering significantly higher rate limits and guaranteed response times within 24 hours. https://platform.openai.com/docs/guides/batch

Nowadays, there are many different demands of requests. Immediate requests, also real-time, are popular in chatbox, conversation, text generation, code generation. Deferred requests (e.g. BatchAPI requests) process multiple requests at once, typically in a delayed or asynchronous manner, making them ideal for use cases that can tolerate slower response times or need high-throughput processing, which are popular in: text-to-image, image-to-text, video-to-text, and so on.

For multi-tenat LM inference, it lacks efficient methods to balance the immediate requests and deferred requests. So we want to redesign a request batch strategy, a new CUDA kernel, and scheduling policies to accelerate inference mix requests under different LLM and VLM.

Experiments

The follwing metrics are under different distribution of request rate seperately. Uniform Distribution influences the most of system performance. But if we increase the request rate highly, the influence doesn't matter as before.

Generate different distribution requests referring to the paper Vexless: A Serverless Vector Data Management System Using Cloud Functions (https://dl.acm.org/doi/10.1145/3654990).

Different Distributions requests: outer and inner distribution Mix: Uniform, Inverse_gaussian, Poisson, Gaussian, and Zipfian Distributions.

Low Bursty

Medium Bursty

High Bursty

Installation

Build from source

cd FlexiBatch
git submodule sync
git submodule update --init

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
benchmarks		benchmarks
build		build
flexi_request		flexi_request
result_curve		result_curve
test		test
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
gins.txt		gins.txt
plan.bin		plan.bin
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flexible Batch of Immediate and Deferred Requests

Motivation

Experiments

Installation

Build from source

About

Releases

Packages

Languages

scao0208/FlexiRequest

Folders and files

Latest commit

History

Repository files navigation

Flexible Batch of Immediate and Deferred Requests

Motivation

Experiments

Installation

Build from source

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages