Run Large Language Models Locally with Ollama and Hugging Face
Running large language models locally has never been easier, thanks to tools like Ollama. This post provides a step-by-step guide to install, configure, and use Ollama to run models directly from Hugging Face repositories on your local machine, even with commodity hardware.
Why Use Ollama?
Ollama simplifies running large language models by utilizing GGML format (a quantized version of large models). Instead of requiring high-precision formats like FP32 or FP16, Ollama supports models in 4-bit or 8-bit precision, making them compatible with modest hardware setups.
Recently, Ollama introduced a significant feature: direct integration with Hugging Face repositories, eliminating the need for uploading models to Ollama’s registry. This new approach reduces dependency on hosted registries and opens up access to over 45,000 models in the Hugging Face library.
Prerequisites
A machine with:
- Windows, macOS, or Linux.
- Basic terminal skills.
- For GPU usage, compatible hardware (e.g., NVIDIA RTX A6000 with 48 GB VRAM, as used in this guide).
Installed Ollama software (installation steps provided below).
Internet access to download models from Hugging Face.
Installation and Setup
Step 1: Install or Upgrade Ollama
If you don’t already have Ollama installed, download it from their official website. Use the appropriate installation method for your OS:
- Windows: Download the executable.
- macOS: Use the provided installer.
- Linux: Execute the following commands:
wget https://ollama.ai/download/latest -O ollama_installer
chmod +x ollama_installer
./ollama_installer
For users with Ollama already installed, the same command will upgrade you to the latest version.
Step 2: Configure Ollama
After installation, you need to:
Move the binary to a system path (Linux example):
sudo mv ollama /usr/local/bin
Reload system daemon:
sudo systemctl daemon-reload
Restart Ollama service:
sudo systemctl restart ollama
To verify the installation, check the version:
ollama --version
Ensure it reflects the latest version.
Running Models from Hugging Face
Ollama now allows you to run models directly from Hugging Face repositories. Here’s how:
Step 1: Run a Model
Use the following command to run a model from Hugging Face:
ollama run hf <hugging_face_repo_path>:latest
Replace <hugging_face_repo_path>
with the path to your desired model. For example:
ollama run hf huggingface/llama-3.21b:latest
- This command:
- Downloads the model.
- Automatically verifies and sets it up.
If you want to simply download the model without running it, replace run
with pull
:
ollama pull hf <hugging_face_repo_path>:latest
Step 2: Choose Specific Quantization (e.g., Q8, Q4)
Quantized versions of models offer different trade-offs between performance and memory usage. To specify a particular quantization, update the command:
ollama run hf <hugging_face_repo_path>:q8_0
For example:
ollama run hf huggingface/llama-3.21b:q8_0
Behind the Scenes
Ollama downloads the GGML model file from Hugging Face and renames it to a hash string for internal use. While this limits direct reuse, the approach optimizes local management. You can list available models with:
ollama list
Exploring Model Options
With access to Hugging Face’s vast library, you can experiment with numerous models in GGML format. Visit Hugging Face and:
- Browse models.
- Check quantization options under the
Files
section. - Copy the repository path for use with Ollama.
Example Workflow
Here’s an example of running the LLaMA-3.21B model with quantization:
Download and prepare the model:
ollama pull hf huggingface/llama-3.21b:q8_0
Run the model:
ollama run hf huggingface/llama-3.21b:q8_0
Interact with the model:
Hello, how can I assist you today?
Verify the model:
ollama list
Tips and Tricks
- Use lightweight quantizations (Q4, Q8) for reduced memory consumption.
- Regularly update Ollama to ensure compatibility with new features.
- Explore Hugging Face for innovative models and applications.
Conclusion
Ollama’s integration with Hugging Face marks a significant leap in accessibility for running AI models locally. Whether you’re experimenting with lightweight models or deploying sophisticated systems, Ollama’s simplicity and flexibility make it a go-to solution for developers. Try it out today and unlock the potential of local AI model deployment!