No HTTP required

Embed LLMs in
Any Language

Most LLM tools force you to run a separate server and talk to it over HTTP. With cognisoc, you can embed the model directly in your application — same process, zero network overhead. Or run a server when you need one. Your choice.

Embedded vs Server Mode

Recommended for most use cases

Embedded (In-Process)

The model loads inside your application. You call it like any library function — model.generate("Hello"). No HTTP, no sockets, no serialization, no separate process to manage.

  • + Zero latency overhead
  • + No process management
  • + Works offline / no network stack
  • + Simpler deployment (one binary)
  • - Model tied to one process
  • - Memory used by your app

Provided by: mullama bindings, llamafu, unillm, zigllm

For shared / multi-tenant setups

Server (HTTP API)

Run mullama serve and call it from any language via OpenAI-compatible endpoints. The model runs in a separate process and serves multiple clients concurrently.

  • + Share one model across services
  • + Works with any HTTP client
  • + Swap models without redeploying
  • + OpenAI SDK compatible
  • - ~1-5ms per-request overhead
  • - Extra process to manage

Provided by: mullama serve (OpenAI + Anthropic API compatible)

Rule of thumb: If your application is the only consumer of the model, embed it. If multiple applications or users need the same model, run a server. If you're on mobile or embedded hardware, embed — there's no server to run.

Language Guide

Py

Python

Data science, RAG pipelines, backend services, Jupyter notebooks, batch processing

mullama
pip install mullama
Embedded (In-Process)
from mullama import Model, Context

model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)

# Direct function call — no HTTP, no server
response = ctx.generate('Explain embeddings in one paragraph')
print(response)

# Streaming
for token in ctx.stream('Write a haiku about Rust:'):
    print(token, end='', flush=True)
Server (HTTP API)
from openai import OpenAI

# Talk to a running mullama server
client = OpenAI(base_url='http://localhost:8080/v1', api_key='unused')
response = client.chat.completions.create(
    model='llama3.2:1b',
    messages=[{'role': 'user', 'content': 'Hello'}],
)

Most Python ML workflows should use embedded mode. Use server mode only when sharing a model across multiple services.

Rs

Rust

Systems programming, high-throughput servers, CLI tools, inference infrastructure

mullama / unillm
cargo add mullama
Embedded (In-Process)
use mullama::{Model, Context, ContextParams};

let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams {
    n_ctx: 4096,
    n_gpu_layers: 32,
    ..Default::default()
})?;

let response = ctx.generate("What is GGUF?", 256)?;
println!("{}", response);
Server (HTTP API)
// For full runtime control, use unillm directly
use unillm::{Model, ModelInputs};

// unillm powers mullama under the hood — use it
// when you need custom scheduling, KV cache tuning,
// or access to 47 architecture implementations
let model = Model::load("llama3.2-1b.gguf")?;
let output = model.generate("Hello", Default::default())?;

Use mullama for application-level embedding. Use unillm when building inference infrastructure or need runtime control.

Dt

Dart / Flutter

Mobile apps (iOS/Android), offline-first experiences, on-device privacy, edge AI

llamafu
flutter pub add llamafu
Embedded (In-Process)
import 'package:llamafu/llamafu.dart';

final llm = await Llamafu.init(
  modelPath: '/data/models/llama3.2-1b-q4_k_m.gguf',
  threads: 4,         // Match device core count
  contextSize: 2048,  // Keep small on mobile (RAM)
);

// On-device inference — works offline, no server
final result = await llm.complete(
  prompt: 'Summarize this document:',
  maxTokens: 256,
  temperature: 0.7,
);

// Vision / multimodal
final vision = await llm.multimodalComplete(
  prompt: 'What is in this photo?',
  mediaInputs: [MediaInput(type: MediaType.image, data: imgPath)],
);

llm.close(); // Free device memory

On mobile there is no server — the model runs on the device or it doesn't run. Use Q4_K_M quantization for best quality/speed tradeoff.

PHP

PHP

WordPress plugins, Laravel apps, web backends, content generation

mullama
composer require skelf-research/mullama
Embedded (In-Process)
use Mullama\Model;
use Mullama\Context;

$model = Model::load('llama3.2-1b.gguf', ['gpu_layers' => 32]);
$ctx = new Context($model, ['n_ctx' => 4096]);

// Direct inference — no HTTP, no external process
$response = $ctx->generate('Write a SQL query to find duplicates');
echo $response;
Server (HTTP API)
// For short-lived FPM workers, use server mode instead
// (model loading is ~3-5s, too slow per-request)

$client = OpenAI::factory()
    ->withBaseUri('http://localhost:8080/v1')
    ->make();

$response = $client->chat()->create([
    'model' => 'llama3.2:1b',
    'messages' => [['role' => 'user', 'content' => 'Hello']],
]);

PHP is one of the most underserved languages for LLM tooling. If your process lives long enough to amortize model loading (~3-5s), embed. For short-lived FPM requests, use server mode.

Go

Go

Microservices, CLI tools, API gateways, DevOps tooling

mullama
go get github.com/skelf-research/mullama-go
Embedded (In-Process)
import mullama "github.com/skelf-research/mullama-go"

func main() {
    model, err := mullama.LoadModel("llama3.2-1b.gguf",
        mullama.WithGPULayers(32),
    )
    if err != nil { log.Fatal(err) }
    defer model.Close()

    ctx, _ := model.NewContext(mullama.ContextConfig{
        ContextSize: 4096,
    })

    result, _ := ctx.Generate("What is a goroutine?")
    fmt.Println(result)
}
Server (HTTP API)
import "github.com/sashabaranov/go-openai"

client := openai.NewClient("unused",
    openai.WithBaseURL("http://localhost:8080/v1"),
)
resp, _ := client.CreateChatCompletion(ctx,
    openai.ChatCompletionRequest{
        Model: "llama3.2:1b",
        Messages: []openai.ChatCompletionMessage{
            {Role: "user", Content: "Hello"},
        },
    },
)

Go's fast startup makes it excellent for CLI tools with embedded models. For long-running services, embed if single-tenant, server if multi-tenant.

JS

Node.js

Full-stack apps, Electron desktop apps, serverless edge functions, real-time chat

mullama
npm install mullama
Embedded (In-Process)
const { Model, Context } = require('mullama');

// Load once at startup
const model = await Model.load('llama3.2-1b.gguf', {
  gpuLayers: 32,
});
const ctx = new Context(model, { contextSize: 4096 });

// Async, non-blocking
const response = await ctx.generate('Explain WebSockets');
console.log(response);

// Streaming
const stream = ctx.stream('Write a poem about JavaScript:');
for await (const token of stream) {
  process.stdout.write(token);
}
Server (HTTP API)
const OpenAI = require('openai');

const client = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'unused',
});

const response = await client.chat.completions.create({
  model: 'llama3.2:1b',
  messages: [{ role: 'user', content: 'Hello' }],
});

Embedded mode is ideal for Electron apps — ship the model with your app, no server needed, works offline.

C

C / C++

Embedded systems, game engines, native apps, IoT devices, bare-metal appliances

mullama / cllm
# Link against libmullama
Embedded (In-Process)
#include <mullama.h>

mullama_model *model = mullama_load_model("llama3.2-1b.gguf", NULL);
mullama_context *ctx = mullama_new_context(model, NULL);

char output[4096];
mullama_generate(ctx, "Hello, embedded world!", output, sizeof(output));
printf("%s\n", output);

mullama_free_context(ctx);
mullama_free_model(model);

// For the most extreme case — running on bare metal
// with no OS at all — see cllm

For maximum control with zero OS overhead, cllm boots directly into an LLM inference server on bare metal.

Zig

Zig

Learning ML internals, SIMD research, systems programming, custom inference engines

zigllm
git clone https://github.com/cognisoc/zigllm.git
Embedded (In-Process)
const std = @import("std");
const zigllm = @import("zigllm");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const allocator = gpa.allocator();

    const model = try zigllm.Model.load(allocator, "model.gguf");
    defer model.deinit();

    const output = try model.generate("Hello from Zig!", .{
        .max_tokens = 256,
        .temperature = 0.7,
    });
    std.debug.print("{s}\n", .{output});
}

zigllm is educational — it teaches you how every layer of inference works, from tensors to text generation. 285+ tests serve as executable documentation.

Decision Matrix

Not sure which mode or tool to use? Find your scenario.

Scenario Mode Tool
Python data pipeline Embedded mullama
FastAPI serving multiple models Server mullama serve
Flutter mobile app Embedded llamafu
PHP WordPress plugin Either mullama
Rust CLI tool Embedded mullama
Rust inference server Embedded unillm
Go microservice Embedded mullama
Electron desktop app Embedded mullama (Node)
Shared team GPU server Server mullama serve
IoT / embedded system Embedded mullama (C)
Bare-metal appliance Embedded cllm
Learning ML internals Embedded zigllm
Any language, quick prototype Server mullama serve
Coming Soon

Open Hardware for LLM Inference

We're not stopping at software. Cognisoc is exploring open hardware reference designs purpose-built for local LLM inference — open schematics, open firmware, designed to run cognisoc software from boot.

📟

Inference Accelerator Boards

Single-board designs with NPUs and RISC-V cores. Run cllm directly on bare metal — no OS, no overhead. Designed for edge deployment where every watt and millisecond counts.

🧩

FPGA Accelerator Capes

Reconfigurable hardware for custom quantization formats, novel attention mechanisms, and research workloads. Flash new inference kernels without respinning silicon.

🗄️

GPU Cluster Blueprints

Rack-mount configurations with optimized networking for distributed inference using unillm's RPC backend. Open BOM, open thermal design, open orchestration.

Why Open Hardware?

The software is ready. We have the runtime (unillm), the server (mullama), the mobile stack (llamafu), and the unikernel (cllm). The missing piece is hardware designed to run this stack natively — not general-purpose servers with inference bolted on.

Vertical integration matters. When you control both the software and the hardware reference design, you can optimize in ways that generic platforms can't: custom memory layouts for KV caches, tuned PCIe topologies for multi-GPU inference, firmware-level model loading.

Open means auditable. For enterprise and government deployments, proprietary hardware is a black box. Open schematics and open firmware mean you can verify what's running — down to the gate level.

Accessible by design. Reference designs lower the barrier for hardware manufacturers worldwide. Any fabricator can produce inference boards that work with the cognisoc stack out of the box — no licensing, no vendor lock-in.

If you're building hardware for AI inference, working on RISC-V or FPGA platforms, or interested in co-developing open reference designs — let's talk.

Ready to embed?

Pick your language, choose embedded or server mode, and start running LLMs locally in minutes.