Generative AI

Ollama and Hugging Face

Pro.Dev 2024. 12. 9. 20:54
반응형

Run Large Language Models Locally with Ollama and Hugging Face

Running large language models locally has never been easier, thanks to tools like Ollama. This post provides a step-by-step guide to install, configure, and use Ollama to run models directly from Hugging Face repositories on your local machine, even with commodity hardware.

Why Use Ollama?

Ollama simplifies running large language models by utilizing GGML format (a quantized version of large models). Instead of requiring high-precision formats like FP32 or FP16, Ollama supports models in 4-bit or 8-bit precision, making them compatible with modest hardware setups.

Recently, Ollama introduced a significant feature: direct integration with Hugging Face repositories, eliminating the need for uploading models to Ollama’s registry. This new approach reduces dependency on hosted registries and opens up access to over 45,000 models in the Hugging Face library.


Prerequisites

  1. A machine with:

    • Windows, macOS, or Linux.
    • Basic terminal skills.
    • For GPU usage, compatible hardware (e.g., NVIDIA RTX A6000 with 48 GB VRAM, as used in this guide).
  2. Installed Ollama software (installation steps provided below).

  3. Internet access to download models from Hugging Face.


Installation and Setup

Step 1: Install or Upgrade Ollama

If you don’t already have Ollama installed, download it from their official website. Use the appropriate installation method for your OS:

  • Windows: Download the executable.
  • macOS: Use the provided installer.
  • Linux: Execute the following commands:
wget https://ollama.ai/download/latest -O ollama_installer
chmod +x ollama_installer
./ollama_installer

For users with Ollama already installed, the same command will upgrade you to the latest version.

Step 2: Configure Ollama

After installation, you need to:

  1. Move the binary to a system path (Linux example):

     sudo mv ollama /usr/local/bin
  2. Reload system daemon:

     sudo systemctl daemon-reload
  3. Restart Ollama service:

     sudo systemctl restart ollama

To verify the installation, check the version:

ollama --version

Ensure it reflects the latest version.


Running Models from Hugging Face

Ollama now allows you to run models directly from Hugging Face repositories. Here’s how:

Step 1: Run a Model

Use the following command to run a model from Hugging Face:

ollama run hf <hugging_face_repo_path>:latest

Replace <hugging_face_repo_path> with the path to your desired model. For example:

ollama run hf huggingface/llama-3.21b:latest
  • This command:
    • Downloads the model.
    • Automatically verifies and sets it up.

If you want to simply download the model without running it, replace run with pull:

ollama pull hf <hugging_face_repo_path>:latest

Step 2: Choose Specific Quantization (e.g., Q8, Q4)

Quantized versions of models offer different trade-offs between performance and memory usage. To specify a particular quantization, update the command:

ollama run hf <hugging_face_repo_path>:q8_0

For example:

ollama run hf huggingface/llama-3.21b:q8_0

Behind the Scenes

Ollama downloads the GGML model file from Hugging Face and renames it to a hash string for internal use. While this limits direct reuse, the approach optimizes local management. You can list available models with:

ollama list

Exploring Model Options

With access to Hugging Face’s vast library, you can experiment with numerous models in GGML format. Visit Hugging Face and:

  1. Browse models.
  2. Check quantization options under the Files section.
  3. Copy the repository path for use with Ollama.

Example Workflow

Here’s an example of running the LLaMA-3.21B model with quantization:

  1. Download and prepare the model:

     ollama pull hf huggingface/llama-3.21b:q8_0
  2. Run the model:

     ollama run hf huggingface/llama-3.21b:q8_0
  3. Interact with the model:

     Hello, how can I assist you today?
  4. Verify the model:

     ollama list

Tips and Tricks

  • Use lightweight quantizations (Q4, Q8) for reduced memory consumption.
  • Regularly update Ollama to ensure compatibility with new features.
  • Explore Hugging Face for innovative models and applications.

Conclusion

Ollama’s integration with Hugging Face marks a significant leap in accessibility for running AI models locally. Whether you’re experimenting with lightweight models or deploying sophisticated systems, Ollama’s simplicity and flexibility make it a go-to solution for developers. Try it out today and unlock the potential of local AI model deployment!

반응형