2.7 KiB
2.7 KiB
How to Run Distributed Llama on 💻 Linux, MacOS or Windows
This article describes how to run Distributed Llama on 4 devices, but you can also run it on 1, 2, 4, 8... devices. Please adjust the commands and topology according to your configuration.
[🔀 SWITCH OR ROUTER]
| | | |
| | | |_______ 🔸 device1 (ROOT) 10.0.0.1
| | |_________ 🔹 device2 (WORKER 1) 10.0.0.2:9999
| |___________ 🔹 device3 (WORKER 2) 10.0.0.3:9999
|_____________ 🔹 device4 (WORKER 3) 10.0.0.4:9999
- Install Git and C++ compiler on 🔸🔹 ALL devices:
-
Linux:
sudo apt install git build-essential -
MacOS
brew install git -
Windows
Install Git and Mingw (via Chocolatey):
choco install git mingw
-
Connect 🔸🔹 ALL devices to your 🔀 SWITCH OR ROUTER via Ethernet cable. If you're using only two devices, it's better to connect them directly without a switch.
-
Clone this repository and compile Distributed Llama on 🔸🔹 ALL devices:
git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-api
- Download the model to the 🔸 ROOT device using the
launch.pyscript. You don't need to download the model on worker devices.
python3 launch.py # Prints a list of available models
python3 launch.py llama3_2_3b_instruct_q40 # Downloads the model to the root device
- Start workers on all 🔹 WORKER devices:
./dllama worker --port 9999 --nthreads 4
- Run the inference to test if everything works fine on the 🔸 ROOT device:
./dllama inference \
--prompt "Hello world" \
--steps 32 \
--model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
--tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
--buffer-float-type q80 \
--nthreads 4 \
--max-seq-len 4096 \
--workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
- To run the API server, start it on the 🔸 ROOT device:
./dllama-api \
--port 9999 \
--model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m \
--tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t \
--buffer-float-type q80 \
--nthreads 4 \
--max-seq-len 4096 \
--workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
Now you can connect to the API server:
http://10.0.0.1:9999/v1/models
- When the API server is running, you can open the web chat in your browser, open llama-ui.js.org, go to the settings and set the base URL to:
http://10.0.0.1:9999. Press the "save" button and start chatting!