main

As a user, I want to run LLM inference locally without external servers, so that I can use elelem without API keys, Ollama, or network connectivity.

SYNOPSIS

Implement a local inference provider that loads and runs models directly in-process.

DESCRIPTION

Create a new provider in lib/elelem/net/ that:

Loads models locally:
- Use the approach determined by Story 001 (llama.cpp bindings or CLI)
- Load GGUF model files from ~/.cache/elelem/models/
- Support GPU acceleration (CUDA, ROCm) when available
- Fall back to CPU inference when no GPU present
Implements the provider interface:
- Match the interface of existing providers (ollama.rb, openai.rb, claude.rb)
- Support streaming responses
- Handle the conversation history format
Performance considerations:
- Model loading may take a few seconds - show appropriate feedback
- Keep model loaded in memory for subsequent prompts (don’t reload per-request)
- Handle memory limits gracefully
Configuration:
- Configurable via .elelem.yml similar to other providers
- Support specifying custom model path
- Support model selection override

Tasks

TBD (filled in design mode)

Acceptance Criteria

Provider loads model from local disk
Provider generates streaming responses
Provider works with GPU acceleration on CUDA
Provider works with GPU acceleration on ROCm
Provider falls back to CPU when no GPU available
Provider integrates with existing elelem conversation flow
Tool calling works with local models (if model supports it)
Works fully offline once model is downloaded

SYNOPSIS

DESCRIPTION

SEE ALSO

Tasks

Acceptance Criteria