init
Some checks failed
main / Linux (amd64, ubuntu-22.04) (push) Successful in 49s
main / Linux (arm64, ubuntu-24.04-arm) (push) Has been cancelled
main / Windows (push) Has been cancelled

This commit is contained in:
2025-10-24 11:42:14 +02:00
commit 42172cbb6f
85 changed files with 40316 additions and 0 deletions

View File

@@ -0,0 +1,32 @@
# How to Convert 🤗 Hugging Face Model
Currently, Distributed Llama supports these Hugging Face models: `llama`, `mistral`, `qwen3` and `qwen3_moe`. You can try to convert any compatible Hugging Face model and run it with Distributed Llama.
> [!IMPORTANT]
> All converters are in the early stages of development. After conversion, the model may not work correctly.
1. Download a model, for example: [Mistral-7B-v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3/tree/main).
2. The downloaded model should contain `config.json`, `tokenizer.json`, `tokenizer_config.json` and `tokenizer.model` and safetensor files.
3. Run the converter of the model:
```sh
cd converter
python convert-hf.py path/to/hf/model q40 mistral-7b-0.3
```
4. Run the converter of the tokenizer:
```sh
python convert-tokenizer-hf.py path/to/hf/model mistral-7b-0.3
```
5. That's it! Now you can run the Distributed Llama.
```sh
./dllama inference \
--prompt "Hello world" \
--steps 64 \
--model dllama_model_mistral-7b-0.3_q40.m \
--tokenizer dllama_tokenizer_mistral-7b-0.3.t \
--buffer-float-type q80
```

34
docs/HOW_TO_RUN_GPU.md Normal file
View File

@@ -0,0 +1,34 @@
# How to Run Distributed Llama on 🧠 GPU
Distributed Llama can run on GPU devices using Vulkan API. This article describes how to build and run the project on GPU.
Before you start here, please check how to build and run Distributed Llama on CPU:
* [🍓 How to Run on Raspberry Pi](./HOW_TO_RUN_RASPBERRYPI.md)
* [💻 How to Run on Linux, MacOS or Windows](./HOW_TO_RUN_LINUX_MACOS_WIN.md)
To run on GPU, please follow these steps:
1. Install Vulkan SDK for your platform.
* Linux: please check [this article](https://vulkan.lunarg.com/doc/view/latest/linux/getting_started_ubuntu.html).
* MacOS: download SDK [here](https://vulkan.lunarg.com/sdk/home#mac).
2. Build Distributed Llama with GPU support:
```bash
DLLAMA_VULKAN=1 make dllama
DLLAMA_VULKAN=1 make dllama-api
```
3. Now `dllama` and `dllama-api` binaries supports arguments related to GPU usage.
```
--gpu-index <index> Use GPU device with given index (use `0` for first device)
```
4. You can run the root node or worker node on GPU by specifying the `--gpu-index` argument. Vulkan backend requires single thread, so you should also set `--nthreads 1`.
```bash
./dllama inference ... --nthreads 1 --gpu-index 0
./dllama chat ... --nthreads 1 --gpu-index 0
./dllama worker ... --nthreads 1 --gpu-index 0
./dllama-api ... --nthreads 1 --gpu-index 0
```

View File

@@ -0,0 +1,89 @@
# How to Run Distributed Llama on 💻 Linux, MacOS or Windows
This article describes how to run Distributed Llama on 4 devices, but you can also run it on 1, 2, 4, 8... devices. Please adjust the commands and topology according to your configuration.
````
[🔀 SWITCH OR ROUTER]
| | | |
| | | |_______ 🔸 device1 (ROOT) 10.0.0.1
| | |_________ 🔹 device2 (WORKER 1) 10.0.0.2:9999
| |___________ 🔹 device3 (WORKER 2) 10.0.0.3:9999
|_____________ 🔹 device4 (WORKER 3) 10.0.0.4:9999
````
1. Install Git and C++ compiler on **🔸🔹 ALL** devices:
* Linux:
```
sudo apt install git build-essential
```
* MacOS
```
brew install git
```
* Windows
Install Git and Mingw (via [Chocolatey](https://chocolatey.org/install)):
```powershell
choco install git mingw
```
2. Connect **🔸🔹 ALL** devices to your **🔀 SWITCH OR ROUTER** via Ethernet cable. If you're using only two devices, it's better to connect them directly without a switch.
3. Clone this repository and compile Distributed Llama on **🔸🔹 ALL** devices:
```sh
git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-api
```
4. Download the model to the **🔸 ROOT** device using the `launch.py` script. You don't need to download the model on worker devices.
```sh
python3 launch.py # Prints a list of available models
python3 launch.py llama3_2_3b_instruct_q40 # Downloads the model to the root device
```
5. Start workers on all **🔹 WORKER** devices:
```sh
./dllama worker --port 9999 --nthreads 4
```
6. Run the inference to test if everything works fine on the **🔸 ROOT** device:
```sh
./dllama inference \
--prompt "Hello world" \
--steps 32 \
--model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
--tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
--buffer-float-type q80 \
--nthreads 4 \
--max-seq-len 4096 \
--workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
```
7. To run the API server, start it on the **🔸 ROOT** device:
```sh
./dllama-api \
--port 9999 \
--model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
--tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
--buffer-float-type q80 \
--nthreads 4 \
--max-seq-len 4096 \
--workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
```
Now you can connect to the API server:
```
http://10.0.0.1:9999/v1/models
```
8. When the API server is running, you can open the web chat in your browser, open [llama-ui.js.org](https://llama-ui.js.org/), go to the settings and set the base URL to: `http://10.0.0.1:9999`. Press the "save" button and start chatting!

View File

@@ -0,0 +1,96 @@
# How to Run Distributed Llama on 🍓 Raspberry Pi
This article describes how to run Distributed Llama on 4 Raspberry Pi devices, but you can also run it on 1, 2, 4, 8... devices. Please adjust the commands and topology according to your configuration.
````
[🔀 SWITCH OR ROUTER]
| | | |
| | | |_______ 🔸 raspberrypi1 (ROOT) 10.0.0.1
| | |_________ 🔹 raspberrypi2 (WORKER 1) 10.0.0.2:9999
| |___________ 🔹 raspberrypi3 (WORKER 2) 10.0.0.3:9999
|_____________ 🔹 raspberrypi4 (WORKER 3) 10.0.0.4:9999
````
1. Install `Raspberry Pi OS Lite (64 bit)` on your **🔸🔹 ALL** Raspberry Pi devices. This OS doesn't have desktop environment but you can easily connect via SSH to manage it.
2. Connect **🔸🔹 ALL** devices to your **🔀 SWITCH OR ROUTER** via Ethernet cable. If you're using only two devices, it's better to connect them directly without a switch.
3. Connect to all devices via SSH from your computer.
```
ssh user@raspberrypi1.local
ssh user@raspberrypi2.local
ssh user@raspberrypi3.local
ssh user@raspberrypi4.local
```
4. Install Git on **🔸🔹 ALL** devices:
```sh
sudo apt install git
```
5. Clone this repository and compile Distributed Llama on **🔸🔹 ALL** devices:
```sh
git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-api
```
6. Download the model to the **🔸 ROOT** device using the `launch.py` script. You don't need to download the model on worker devices.
```sh
python3 launch.py # Prints a list of available models
python3 launch.py llama3_2_3b_instruct_q40 # Downloads the model to the root device
```
7. Assign static IP addresses on **🔸🔹 ALL** devices. Each device must have a unique IP address in the same subnet.
```sh
sudo ip addr add 10.0.0.1/24 dev eth0 # 🔸 ROOT
sudo ip addr add 10.0.0.2/24 dev eth0 # 🔹 WORKER 1
sudo ip addr add 10.0.0.3/24 dev eth0 # 🔹 WORKER 2
sudo ip addr add 10.0.0.4/24 dev eth0 # 🔹 WORKER 3
```
8. Start workers on all **🔹 WORKER** devices:
```sh
sudo nice -n -20 ./dllama worker --port 9999 --nthreads 4
```
9. Run the inference to test if everything works fine on the **🔸 ROOT** device:
```sh
sudo nice -n -20 ./dllama inference \
--prompt "Hello world" \
--steps 32 \
--model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
--tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
--buffer-float-type q80 \
--nthreads 4 \
--max-seq-len 4096 \
--workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
```
10. To run the API server, start it on the **🔸 ROOT** device:
```sh
sudo nice -n -20 ./dllama-api \
--port 9999 \
--model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
--tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
--buffer-float-type q80 \
--nthreads 4 \
--max-seq-len 4096 \
--workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
```
Now you can connect to the API server from your computer:
```
http://raspberrypi1.local:9999/v1/models
```
11. When the API server is running, you can open the web chat in your browser, open [llama-ui.js.org](https://llama-ui.js.org/), go to the settings and set the base URL to: `http://raspberrypi1.local:9999`. Press the "save" button and start chatting!