Rust Runtime47 ArchitecturesModularType-Safe

unillm

A modular LLM inference runtime written in Rust. Supports 47 model architectures through three composable abstractions: TensorCore, ModelCore, and WeightLoaderCore.

Key Features

🏗️

47 Model Architectures

LLaMA, Qwen, Gemma, Phi, DeepSeek, Mistral, GPT-2, Whisper, BERT, and more.

📐

Three-Layer Design

TensorCore, ModelCore, WeightLoaderCore — composable abstractions for clean extensibility.

📦

Multi-Format Weights

Load SafeTensors, GGUF, and PyTorch files through a unified interface.

🧠

Hybrid KV Cache

RadixAttention + PagedAttention for efficient memory management during inference.

Continuous Batching

Request scheduling with continuous batching for production throughput.

🔒

Type-Safe Rust

Full compile-time guarantees with the model_config! macro system.

Quick Start

git clone https://github.com/cognisoc/unillm.git
cd unillm

# Generate text (downloads TinyLlama on first run)
cargo run --bin unillm -p unillm-runtime -- generate --prompt "Explain gravity"

# Use a different model
cargo run --bin unillm -p unillm-runtime -- generate --model llama2:7b --prompt "Hello"

# List cached models
cargo run --bin unillm -p unillm-runtime -- models

Architecture

UniLLM is organized into three composable layers:

  1. TensorCore — Device-agnostic tensor operations (CPU, CUDA, Metal). All ops go through ops_fn::operation().
  2. ModelCore — Universal Model trait with forward() and generate(). Configuration via model_config! macro.
  3. WeightLoaderCore — Format-agnostic weight loading for SafeTensors, GGUF, and PyTorch files.

Adding a Model

model_config!(MyModelConfig {
    vocab_size: usize = 32000,
    hidden_size: usize = 4096,
    num_hidden_layers: usize = 32,
});

impl Model for MyModel {
    type Config = MyModelConfig;

    fn forward(&self, inputs: &ModelInputs) -> Result<ModelOutputs> {
        // model-specific forward pass
    }
}

Supported Models

CategoryModels
Core LLMsLLaMA, Qwen, Gemma, Phi, DeepSeek, Mistral, Mixtral
GPT FamilyGPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT
CodeStarCoder, CodeLlama
MoEDeepSeek-MoE, DBRX, Grok, Arctic, Jamba
RWKVRWKV-4, RWKV-6, RecurrentGemma
Vision-LanguageQwen2-VL, Phi-3-Vision, InternVL, CogVLM, LLaVA, CLIP
Audio / SpeechWav2Vec2, HuBERT, MusicGen, Encodec, Whisper
EncoderBERT, T5
SpecializedMamba, MiniCPM, OLMo, Granite

Project Structure

crates/
  runtime/       Core inference runtime (tensor ops, model trait, weight loading)
  inference/     High-level inference engine and batching
  kv/            Hybrid KV cache (RadixAttention + PagedAttention)
  scheduler/     Request scheduling with continuous batching