rubyllm-red-candle

name: rubyllm/red_candle version: 0.2.0 description: | Local LLM execution with quantized GGUF models for RubyLLM. Use this skill when running models locally for zero latency, no API costs, complete privacy, and offline capability. Supports Metal (macOS), CUDA (NVIDIA), and CPU.

RubyLLM::RedCandle v{{ page.version }}

Local LLM Execution with Quantized Models

Run LLMs locally using quantized GGUF models through the Red Candle gem. Zero latency, no API costs, complete privacy, offline capable.

Gem Version: 0.2.0 GitHub: https://github.com/scientist-labs/ruby_llm-red_candle

Note: Requires Rust toolchain for native extensions.

Installation

gem 'ruby_llm-red_candle'

Rust Toolchain

curl --proto '=https' --sh https://sh.rustup.rs -sSf | sh
rustc --version

Configuration

# config/initializers/ruby_llm.rb
RubyLLM.configure do |config|
  config.red_candle_model_path = '/path/to/model.gguf'
  config.red_candle_n_threads = 8
  config.red_candle_n_gpu_layers = 0  # Set > 0 for GPU acceleration
end

Basic Usage

require 'ruby_llm/red_candle'

chat = RubyLLM.chat(model: 'local', provider: 'red_candle')
response = chat.ask "Hello!"
puts response.content

# Streaming
chat.ask "Write a story" do |chunk|
  print chunk.content
end

Supported Models

Models are automatically downloaded from HuggingFace on first use.

TinyLlama

RubyLLM.configure do |config|
  config.red_candle_model_path = 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'
end

Qwen2.5

RubyLLM.configure do |config|
  config.red_candle_model_path = 'Qwen/Qwen2.5-3B-Instruct-GGUF'
end

Gemma-3

RubyLLM.configure do |config|
  config.red_candle_model_path = 'google/gemma-3-4b-it-GGUF'
end

Phi-3

RubyLLM.configure do |config|
  config.red_candle_model_path = 'microsoft/Phi-3-mini-4k-instruct-GGUF'
end

Mistral-7B

RubyLLM.configure do |config|
  config.red_candle_model_path = 'mistralai/Mistral-7B-Instruct-v0.3-GGUF'
end

Llama-3

RubyLLM.configure do |config|
  config.red_candle_model_path = 'meta-llama/Meta-Llama-3-8B-Instruct-GGUF'
end

Hardware Acceleration

Metal (macOS) — M-series chips

RubyLLM.configure do |config|
  config.red_candle_use_metal = true
  config.red_candle_n_gpu_layers = 35  # Offload layers to GPU
end

CUDA (NVIDIA)

RubyLLM.configure do |config|
  config.red_candle_use_cuda = true
  config.red_candle_n_gpu_layers = 35
end

CPU (Default)

RubyLLM.configure do |config|
  config.red_candle_n_threads = 8
end

Model Download

Models are automatically downloaded from HuggingFace and cached in ~/.cache/red_candle/.

Manual Download

RubyLLM::RedCandle.download_model(
  repo: 'TinyLlama/TinyLlama-1.1B-Chat-v1.0',
  file: 'model.gguf'
)

Performance Tuning

RubyLLM.configure do |config|
  config.red_candle_n_ctx = 2048    # Context window size
  config.red_candle_n_batch = 512   # Prompt batch size
  config.red_candle_use_mmap = true # Memory-map model file
end

Use Cases

Development & Testing Without API Costs

if Rails.env.development?
  RubyLLM.configure do |config|
    config.red_candle_model_path = 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'
    config.red_candle_use_metal = true
    config.red_candle_n_gpu_layers = 35
  end
else
  RubyLLM.configure do |config|
    config.anthropic_api_key = ENV['ANTHROPIC_API_KEY']
  end
end

Privacy-Sensitive Applications

class MedicalAssistant < RubyLLM::Agent
  model 'local', provider: 'red_candle'
  instructions "Handle patient data confidentially."
end

Offline Applications

chat = RubyLLM.chat(model: 'local', provider: 'red_candle')
chat.ask "Help me write"  # No API calls

Limitations

Local models may be less capable than cloud models
Slower than cloud APIs without GPU acceleration
Large models require significant RAM (1GB–10GB+)

Troubleshooting

Rust Compilation Errors

rustup update
gem uninstall ruby_llm-red_candle
gem install ruby_llm-red_candle --no-cache

Out of Memory

RubyLLM.configure do |config|
  config.red_candle_n_ctx = 1024
  config.red_candle_use_mmap = true
end

Slow Performance

RubyLLM.configure do |config|
  config.red_candle_n_gpu_layers = 35
  config.red_candle_n_threads = 4
end