init
This commit is contained in:
32
docs/HOW_TO_CONVERT_HF_MODEL.md
Normal file
32
docs/HOW_TO_CONVERT_HF_MODEL.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# How to Convert 🤗 Hugging Face Model
|
||||
|
||||
Currently, Distributed Llama supports these Hugging Face models: `llama`, `mistral`, `qwen3` and `qwen3_moe`. You can try to convert any compatible Hugging Face model and run it with Distributed Llama.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> All converters are in the early stages of development. After conversion, the model may not work correctly.
|
||||
|
||||
1. Download a model, for example: [Mistral-7B-v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3/tree/main).
|
||||
2. The downloaded model should contain `config.json`, `tokenizer.json`, `tokenizer_config.json` and `tokenizer.model` and safetensor files.
|
||||
3. Run the converter of the model:
|
||||
|
||||
```sh
|
||||
cd converter
|
||||
python convert-hf.py path/to/hf/model q40 mistral-7b-0.3
|
||||
```
|
||||
|
||||
4. Run the converter of the tokenizer:
|
||||
|
||||
```sh
|
||||
python convert-tokenizer-hf.py path/to/hf/model mistral-7b-0.3
|
||||
```
|
||||
|
||||
5. That's it! Now you can run the Distributed Llama.
|
||||
|
||||
```sh
|
||||
./dllama inference \
|
||||
--prompt "Hello world" \
|
||||
--steps 64 \
|
||||
--model dllama_model_mistral-7b-0.3_q40.m \
|
||||
--tokenizer dllama_tokenizer_mistral-7b-0.3.t \
|
||||
--buffer-float-type q80
|
||||
```
|
||||
34
docs/HOW_TO_RUN_GPU.md
Normal file
34
docs/HOW_TO_RUN_GPU.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# How to Run Distributed Llama on 🧠 GPU
|
||||
|
||||
Distributed Llama can run on GPU devices using Vulkan API. This article describes how to build and run the project on GPU.
|
||||
|
||||
Before you start here, please check how to build and run Distributed Llama on CPU:
|
||||
* [🍓 How to Run on Raspberry Pi](./HOW_TO_RUN_RASPBERRYPI.md)
|
||||
* [💻 How to Run on Linux, MacOS or Windows](./HOW_TO_RUN_LINUX_MACOS_WIN.md)
|
||||
|
||||
To run on GPU, please follow these steps:
|
||||
|
||||
1. Install Vulkan SDK for your platform.
|
||||
* Linux: please check [this article](https://vulkan.lunarg.com/doc/view/latest/linux/getting_started_ubuntu.html).
|
||||
* MacOS: download SDK [here](https://vulkan.lunarg.com/sdk/home#mac).
|
||||
2. Build Distributed Llama with GPU support:
|
||||
|
||||
```bash
|
||||
DLLAMA_VULKAN=1 make dllama
|
||||
DLLAMA_VULKAN=1 make dllama-api
|
||||
```
|
||||
|
||||
3. Now `dllama` and `dllama-api` binaries supports arguments related to GPU usage.
|
||||
|
||||
```
|
||||
--gpu-index <index> Use GPU device with given index (use `0` for first device)
|
||||
```
|
||||
|
||||
4. You can run the root node or worker node on GPU by specifying the `--gpu-index` argument. Vulkan backend requires single thread, so you should also set `--nthreads 1`.
|
||||
|
||||
```bash
|
||||
./dllama inference ... --nthreads 1 --gpu-index 0
|
||||
./dllama chat ... --nthreads 1 --gpu-index 0
|
||||
./dllama worker ... --nthreads 1 --gpu-index 0
|
||||
./dllama-api ... --nthreads 1 --gpu-index 0
|
||||
```
|
||||
89
docs/HOW_TO_RUN_LINUX_MACOS_WIN.md
Normal file
89
docs/HOW_TO_RUN_LINUX_MACOS_WIN.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# How to Run Distributed Llama on 💻 Linux, MacOS or Windows
|
||||
|
||||
This article describes how to run Distributed Llama on 4 devices, but you can also run it on 1, 2, 4, 8... devices. Please adjust the commands and topology according to your configuration.
|
||||
|
||||
````
|
||||
[🔀 SWITCH OR ROUTER]
|
||||
| | | |
|
||||
| | | |_______ 🔸 device1 (ROOT) 10.0.0.1
|
||||
| | |_________ 🔹 device2 (WORKER 1) 10.0.0.2:9999
|
||||
| |___________ 🔹 device3 (WORKER 2) 10.0.0.3:9999
|
||||
|_____________ 🔹 device4 (WORKER 3) 10.0.0.4:9999
|
||||
````
|
||||
|
||||
1. Install Git and C++ compiler on **🔸🔹 ALL** devices:
|
||||
|
||||
* Linux:
|
||||
```
|
||||
sudo apt install git build-essential
|
||||
```
|
||||
* MacOS
|
||||
```
|
||||
brew install git
|
||||
```
|
||||
* Windows
|
||||
|
||||
Install Git and Mingw (via [Chocolatey](https://chocolatey.org/install)):
|
||||
```powershell
|
||||
choco install git mingw
|
||||
```
|
||||
|
||||
2. Connect **🔸🔹 ALL** devices to your **🔀 SWITCH OR ROUTER** via Ethernet cable. If you're using only two devices, it's better to connect them directly without a switch.
|
||||
|
||||
3. Clone this repository and compile Distributed Llama on **🔸🔹 ALL** devices:
|
||||
|
||||
```sh
|
||||
git clone https://github.com/b4rtaz/distributed-llama.git
|
||||
cd distributed-llama
|
||||
make dllama
|
||||
make dllama-api
|
||||
```
|
||||
|
||||
4. Download the model to the **🔸 ROOT** device using the `launch.py` script. You don't need to download the model on worker devices.
|
||||
|
||||
```sh
|
||||
python3 launch.py # Prints a list of available models
|
||||
|
||||
python3 launch.py llama3_2_3b_instruct_q40 # Downloads the model to the root device
|
||||
```
|
||||
|
||||
5. Start workers on all **🔹 WORKER** devices:
|
||||
|
||||
```sh
|
||||
./dllama worker --port 9999 --nthreads 4
|
||||
```
|
||||
|
||||
6. Run the inference to test if everything works fine on the **🔸 ROOT** device:
|
||||
|
||||
```sh
|
||||
./dllama inference \
|
||||
--prompt "Hello world" \
|
||||
--steps 32 \
|
||||
--model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
|
||||
--tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
|
||||
--buffer-float-type q80 \
|
||||
--nthreads 4 \
|
||||
--max-seq-len 4096 \
|
||||
--workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
|
||||
```
|
||||
|
||||
7. To run the API server, start it on the **🔸 ROOT** device:
|
||||
|
||||
```sh
|
||||
./dllama-api \
|
||||
--port 9999 \
|
||||
--model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
|
||||
--tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
|
||||
--buffer-float-type q80 \
|
||||
--nthreads 4 \
|
||||
--max-seq-len 4096 \
|
||||
--workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
|
||||
```
|
||||
|
||||
Now you can connect to the API server:
|
||||
|
||||
```
|
||||
http://10.0.0.1:9999/v1/models
|
||||
```
|
||||
|
||||
8. When the API server is running, you can open the web chat in your browser, open [llama-ui.js.org](https://llama-ui.js.org/), go to the settings and set the base URL to: `http://10.0.0.1:9999`. Press the "save" button and start chatting!
|
||||
96
docs/HOW_TO_RUN_RASPBERRYPI.md
Normal file
96
docs/HOW_TO_RUN_RASPBERRYPI.md
Normal file
@@ -0,0 +1,96 @@
|
||||
# How to Run Distributed Llama on 🍓 Raspberry Pi
|
||||
|
||||
This article describes how to run Distributed Llama on 4 Raspberry Pi devices, but you can also run it on 1, 2, 4, 8... devices. Please adjust the commands and topology according to your configuration.
|
||||
|
||||
````
|
||||
[🔀 SWITCH OR ROUTER]
|
||||
| | | |
|
||||
| | | |_______ 🔸 raspberrypi1 (ROOT) 10.0.0.1
|
||||
| | |_________ 🔹 raspberrypi2 (WORKER 1) 10.0.0.2:9999
|
||||
| |___________ 🔹 raspberrypi3 (WORKER 2) 10.0.0.3:9999
|
||||
|_____________ 🔹 raspberrypi4 (WORKER 3) 10.0.0.4:9999
|
||||
````
|
||||
|
||||
1. Install `Raspberry Pi OS Lite (64 bit)` on your **🔸🔹 ALL** Raspberry Pi devices. This OS doesn't have desktop environment but you can easily connect via SSH to manage it.
|
||||
2. Connect **🔸🔹 ALL** devices to your **🔀 SWITCH OR ROUTER** via Ethernet cable. If you're using only two devices, it's better to connect them directly without a switch.
|
||||
3. Connect to all devices via SSH from your computer.
|
||||
|
||||
```
|
||||
ssh user@raspberrypi1.local
|
||||
ssh user@raspberrypi2.local
|
||||
ssh user@raspberrypi3.local
|
||||
ssh user@raspberrypi4.local
|
||||
```
|
||||
|
||||
4. Install Git on **🔸🔹 ALL** devices:
|
||||
|
||||
```sh
|
||||
sudo apt install git
|
||||
```
|
||||
|
||||
5. Clone this repository and compile Distributed Llama on **🔸🔹 ALL** devices:
|
||||
|
||||
```sh
|
||||
git clone https://github.com/b4rtaz/distributed-llama.git
|
||||
cd distributed-llama
|
||||
make dllama
|
||||
make dllama-api
|
||||
```
|
||||
|
||||
6. Download the model to the **🔸 ROOT** device using the `launch.py` script. You don't need to download the model on worker devices.
|
||||
|
||||
```sh
|
||||
python3 launch.py # Prints a list of available models
|
||||
|
||||
python3 launch.py llama3_2_3b_instruct_q40 # Downloads the model to the root device
|
||||
```
|
||||
|
||||
7. Assign static IP addresses on **🔸🔹 ALL** devices. Each device must have a unique IP address in the same subnet.
|
||||
|
||||
```sh
|
||||
sudo ip addr add 10.0.0.1/24 dev eth0 # 🔸 ROOT
|
||||
sudo ip addr add 10.0.0.2/24 dev eth0 # 🔹 WORKER 1
|
||||
sudo ip addr add 10.0.0.3/24 dev eth0 # 🔹 WORKER 2
|
||||
sudo ip addr add 10.0.0.4/24 dev eth0 # 🔹 WORKER 3
|
||||
```
|
||||
|
||||
8. Start workers on all **🔹 WORKER** devices:
|
||||
|
||||
```sh
|
||||
sudo nice -n -20 ./dllama worker --port 9999 --nthreads 4
|
||||
```
|
||||
|
||||
9. Run the inference to test if everything works fine on the **🔸 ROOT** device:
|
||||
|
||||
```sh
|
||||
sudo nice -n -20 ./dllama inference \
|
||||
--prompt "Hello world" \
|
||||
--steps 32 \
|
||||
--model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
|
||||
--tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
|
||||
--buffer-float-type q80 \
|
||||
--nthreads 4 \
|
||||
--max-seq-len 4096 \
|
||||
--workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
|
||||
```
|
||||
|
||||
10. To run the API server, start it on the **🔸 ROOT** device:
|
||||
|
||||
```sh
|
||||
sudo nice -n -20 ./dllama-api \
|
||||
--port 9999 \
|
||||
--model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
|
||||
--tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
|
||||
--buffer-float-type q80 \
|
||||
--nthreads 4 \
|
||||
--max-seq-len 4096 \
|
||||
--workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
|
||||
```
|
||||
|
||||
Now you can connect to the API server from your computer:
|
||||
|
||||
```
|
||||
http://raspberrypi1.local:9999/v1/models
|
||||
```
|
||||
|
||||
11. When the API server is running, you can open the web chat in your browser, open [llama-ui.js.org](https://llama-ui.js.org/), go to the settings and set the base URL to: `http://raspberrypi1.local:9999`. Press the "save" button and start chatting!
|
||||
Reference in New Issue
Block a user