Files
dllama/DOCKER_README.md
Chris 42172cbb6f
Some checks failed
main / Linux (amd64, ubuntu-22.04) (push) Successful in 49s
main / Linux (arm64, ubuntu-24.04-arm) (push) Has been cancelled
main / Windows (push) Has been cancelled
init
2025-10-24 11:42:14 +02:00

5.2 KiB

Distributed Llama Docker Setup for Raspberry Pi

This directory contains Docker configurations to run Distributed Llama on Raspberry Pi devices using containers. There are two variants:

  1. Controller (Dockerfile.controller) - Downloads models and runs the API server
  2. Worker (Dockerfile.worker) - Runs worker nodes that connect to the controller

Quick Start with Docker Compose

1. Download a Model

First, download a model using the controller container:

# Create a models directory
mkdir -p models

# Download a model (this will take some time)
docker-compose run --rm controller --download llama3_2_3b_instruct_q40

2. Start the Distributed Setup

# Start all services (1 controller + 3 workers)
docker-compose up

The API will be available at http://localhost:9999

3. Test the API

# List available models
curl http://localhost:9999/v1/models

# Send a chat completion request
curl -X POST http://localhost:9999/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "max_tokens": 100
  }'

Manual Docker Usage

Building the Images

# Build controller image
docker build -f Dockerfile.controller -t distributed-llama-controller .

# Build worker image  
docker build -f Dockerfile.worker -t distributed-llama-worker .

Running the Controller

# Download a model first
docker run -v ./models:/app/models distributed-llama-controller --download llama3_2_3b_instruct_q40

# Run API server (standalone mode, no workers)
docker run -p 9999:9999 -v ./models:/app/models distributed-llama-controller \
  --model llama3_2_3b_instruct_q40

# Run API server with workers
docker run -p 9999:9999 -v ./models:/app/models distributed-llama-controller \
  --model llama3_2_3b_instruct_q40 \
  --workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999

Running Workers

# Run a worker on default port 9999
docker run -p 9999:9999 distributed-llama-worker

# Run a worker with custom settings
docker run -p 9998:9998 distributed-llama-worker --port 9998 --nthreads 2

Available Models

You can download any of these models:

  • llama3_1_8b_instruct_q40
  • llama3_1_405b_instruct_q40 (very large, 56 parts)
  • llama3_2_1b_instruct_q40
  • llama3_2_3b_instruct_q40
  • llama3_3_70b_instruct_q40
  • deepseek_r1_distill_llama_8b_q40
  • qwen3_0.6b_q40
  • qwen3_1.7b_q40
  • qwen3_8b_q40
  • qwen3_14b_q40
  • qwen3_30b_a3b_q40

Configuration Options

Controller Options

  • --model <name>: Model name to use (required)
  • --port <port>: API server port (default: 9999)
  • --nthreads <n>: Number of threads (default: 4)
  • --max-seq-len <n>: Maximum sequence length (default: 4096)
  • --buffer-float-type <type>: Buffer float type (default: q80)
  • --workers <addresses>: Space-separated worker addresses
  • --download <model>: Download a model and exit

Worker Options

  • --port <port>: Worker port (default: 9999)
  • --nthreads <n>: Number of threads (default: 4)

Environment Variables (Docker Compose)

You can customize the setup using environment variables:

# Set model and thread counts
MODEL_NAME=llama3_2_1b_instruct_q40 \
CONTROLLER_NTHREADS=2 \
WORKER_NTHREADS=2 \
docker-compose up

Available variables:

  • MODEL_NAME: Model to use (default: llama3_2_3b_instruct_q40)
  • CONTROLLER_NTHREADS: Controller threads (default: 4)
  • WORKER_NTHREADS: Worker threads (default: 4)
  • MAX_SEQ_LEN: Maximum sequence length (default: 4096)
  • BUFFER_FLOAT_TYPE: Buffer float type (default: q80)

Multi-Device Setup

To run across multiple Raspberry Pi devices:

Device 1 (Controller)

# Run controller
docker run -p 9999:9999 -v ./models:/app/models distributed-llama-controller \
  --model llama3_2_3b_instruct_q40 \
  --workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999

Devices 2-4 (Workers)

# Run worker on each device
docker run -p 9999:9999 distributed-llama-worker --nthreads 4

Performance Tips

  1. Thread Count: Set --nthreads to the number of CPU cores on each device
  2. Memory: Larger models require more RAM. Monitor usage with docker stats
  3. Network: Use wired Ethernet connections for better performance between devices
  4. Storage: Use fast SD cards (Class 10 or better) or USB 3.0 storage for model files

Troubleshooting

Model Download Issues

# Check if model files exist
ls -la models/llama3_2_3b_instruct_q40/

# Re-download if corrupted
docker-compose run --rm controller --download llama3_2_3b_instruct_q40

Worker Connection Issues

# Check worker logs
docker-compose logs worker1

# Test network connectivity
docker exec -it <controller_container> ping 172.20.0.11

Resource Issues

# Monitor resource usage
docker stats

# Reduce thread count if CPU usage is too high
CONTROLLER_NTHREADS=2 WORKER_NTHREADS=2 docker-compose up

Web Interface

You can use the web chat interface at llama-ui.js.org:

  1. Open the website
  2. Go to settings
  3. Set base URL to: http://your-pi-ip:9999
  4. Save and start chatting

License

This Docker setup follows the same license as the main Distributed Llama project.