hub-llm - SKILL.md Agent Skill

name: hub-llm description: "Use for language model nodes in dora. Triggers on: dora-qwen, dora-qwen2-5-vl, dora-internvl, Qwen, VLM, LLM, vision language model, large language model, text generation, image understanding, InternVL, Qwen2.5, multimodal, 大语言模型, 视觉语言模型, 文本生成" globs: ["/dataflow.yml", "/dataflow.yaml"] source: "https://github.com/dora-rs/dora-hub"

Language Model Nodes

LLMs and Vision-Language Models for text generation and image understanding

Available LLM Nodes

Node	Install	Description	Type
dora-qwen	`pip install dora-qwen`	Qwen2.5 text LLM	LLM
dora-qwen2-5-vl	`pip install dora-qwen2-5-vl`	Qwen2.5-VL multimodal	VLM
dora-internvl	`pip install dora-internvl`	InternVL multimodal	VLM

dora-qwen

Qwen2.5 large language model for text generation.

YAML Configuration

- id: llm
  build: pip install dora-qwen
  path: dora-qwen
  inputs:
    text: input/text
  outputs:
    - text

Input/Output

Input: StringArray with prompt text Output: StringArray with generated response

# Sending prompt
node.send_output("text", pa.array(["What is the capital of France?"]))

# Receiving response
response = event["value"][0].as_py()

dora-qwen2-5-vl

Qwen2.5-VL vision-language model for image understanding.

YAML Configuration

- id: vlm
  build: pip install dora-qwen2-5-vl
  path: dora-qwen2-5-vl
  inputs:
    image:
      source: camera/image
      queue_size: 1        # Process latest image only
    text: whisper/text     # Question/prompt
  outputs:
    - text                 # Response
  env:
    DEFAULT_QUESTION: "Describe the image in a very short sentence."

Input Format

image: UInt8Array with metadata

metadata = {"width": 640, "height": 480, "encoding": "bgr8"}

text: StringArray with question (optional)

If no text input, uses DEFAULT_QUESTION from env

Output Format

# text: StringArray
response = event["value"][0].as_py()
metadata = {"primitive": "text"}  # for dora-rerun

dora-internvl

InternVL vision-language model.

YAML Configuration

- id: internvl
  build: pip install dora-internvl
  path: dora-internvl
  inputs:
    image: camera/image
    text: input/text
  outputs:
    - text

VLM + Voice Assistant Pipeline

nodes:
  # Camera
  - id: camera
    build: pip install opencv-video-capture
    path: opencv-video-capture
    inputs:
      tick: dora/timer/millis/100
    outputs:
      - image
    env:
      IMAGE_WIDTH: 640
      IMAGE_HEIGHT: 480

  # Microphone
  - id: microphone
    build: pip install dora-microphone
    path: dora-microphone
    inputs:
      tick: dora/timer/millis/100
    outputs:
      - audio

  # Voice activity detection
  - id: vad
    build: pip install dora-vad
    path: dora-vad
    inputs:
      audio: microphone/audio
    outputs:
      - audio

  # Speech to text
  - id: whisper
    build: pip install dora-distil-whisper
    path: dora-distil-whisper
    inputs:
      input: vad/audio
    outputs:
      - text
    env:
      TARGET_LANGUAGE: english

  # Vision Language Model
  - id: vlm
    build: pip install dora-qwen2-5-vl
    path: dora-qwen2-5-vl
    inputs:
      image:
        source: camera/image
        queue_size: 1
      text: whisper/text
    outputs:
      - text
    env:
      DEFAULT_QUESTION: "What do you see in this image?"

  # Text to speech
  - id: tts
    build: pip install dora-kokoro-tts
    path: dora-kokoro-tts
    inputs:
      text: vlm/text
    outputs:
      - audio

  # Speaker
  - id: speaker
    build: pip install dora-pyaudio
    path: dora-pyaudio
    inputs:
      audio: tts/audio

  # Visualization
  - id: rerun
    build: pip install dora-rerun
    path: dora-rerun
    inputs:
      image: camera/image
      vlm_response:
        source: vlm/text
        metadata:
          primitive: "text"

Text-Only LLM Pipeline

nodes:
  # Terminal input
  - id: terminal
    build: pip install terminal-input
    path: terminal-input
    outputs:
      - text

  # LLM
  - id: llm
    build: pip install dora-qwen
    path: dora-qwen
    inputs:
      text: terminal/text
    outputs:
      - text

  # Visualization
  - id: rerun
    build: pip install dora-rerun
    path: dora-rerun
    inputs:
      user_input:
        source: terminal/text
        metadata:
          primitive: "text"
      llm_response:
        source: llm/text
        metadata:
          primitive: "text"

VLM + Object Detection Pipeline

nodes:
  - id: camera
    build: pip install opencv-video-capture
    path: opencv-video-capture
    inputs:
      tick: dora/timer/millis/100
    outputs:
      - image

  - id: yolo
    build: pip install dora-yolo
    path: dora-yolo
    inputs:
      image: camera/image
    outputs:
      - bbox

  - id: vlm
    build: pip install dora-qwen2-5-vl
    path: dora-qwen2-5-vl
    inputs:
      image:
        source: camera/image
        queue_size: 1
    outputs:
      - text
    env:
      DEFAULT_QUESTION: "Describe what you see and any notable objects."

  - id: rerun
    build: pip install dora-rerun
    path: dora-rerun
    inputs:
      image: camera/image
      detections: yolo/bbox
      description: vlm/text

Text Data Format

Sending Text

import pyarrow as pa

# Send text prompt
text = "What is in this image?"
node.send_output("text", pa.array([text]), {"primitive": "text"})

Receiving Text

if event["type"] == "INPUT":
    text = event["value"][0].as_py()
    print(f"Received: {text}")

Queue Size Configuration

For VLMs processing images, use queue_size: 1 to process only the latest frame:

inputs:
  image:
    source: camera/image
    queue_size: 1  # Drop old frames, process latest only

This prevents processing backlogs when inference is slower than frame rate.

Related Skills

hub-audio - Speech-to-text for voice interaction
hub-detection - Object detection with YOLO
hub-visualization - Text and image visualization