Convert Models to GGUF Format

This guide explains the steps to convert models published on HuggingFace to GGUF format.

The GGUF format offers the possibility to run models efficiently on our computers. For this, we can use tools like Llama.cpp or Ollama.

Download Llama.cpp

To convert to GGUF format locally on our machine, we first need to download llama.cpp to be able to use its conversion tools.

# Clone the llama.cpp repository
git clone https://github.com/ggml-org/llama.cpp.git

Prepare the environment

We will use uv to prepare the environment and install necessary dependencies. If you haven’t installed it, you will need to install it.

# Create a virtual environment
cd llama.cpp
uv venv

# If you encounter issues, try specifying the Python version
# uv venv --python 3.11

# Activate the environment
source .venv/bin/activate

Install dependencies

In this case, we don’t need to install the full llama.cpp to convert models; we just need the libraries to use the conversion tools:

uv pip install -r requirements/requirements-convert_hf_to_gguf.txt

Convert to GGUF format

By passing the identifier of the model located on HuggingFace (<user_id>/<repo_id>) and specifying the desired output type, it will create our gguf file:

In this case, we will convert the 4B version of the HiTZ/Latxa-Qwen3-VL-4B-Instruct model, with f16 and q8_0 quantizations.

# F16
uv run convert_hf_to_gguf.py HiTZ/Latxa-Qwen3-VL-4B-Instruct --remote --outtype f16

# Q8_0
uv run convert_hf_to_gguf.py HiTZ/Latxa-Qwen3-VL-4B-Instruct --remote --outtype q8_0

These commands will create the gguf files (e.g. HiTZ-Latxa-Qwen3-VL-4B-Instruct-q8_0.gguf).

`mmproj` image processor for multimodal models (VL)

To be able to pass images to VL type multimodal models (e.g. Latxa VL models), we will also need the image encoder. They are usually files named mmproj, which are often shared along with the model weights. We can download and use the original one (in this case, Qwen model’s one).

However, we can also create this file ourselves, using the same conversion script and passing the --mmproj parameter.

# Create mmproj file by adding `--mmproj`
uv run convert_hf_to_gguf.py HiTZ/Latxa-Qwen3-VL-2B-Instruct --remote --outfile mmproj-HiTZ-Latxa-Qwen3-VL-2B-Instruct-q8_0.gguf --outtype q8_0 --mmproj

Obtaining other quantizations

If you want to obtain other quantizations, you will need to follow the Llama.cpp installation steps and use the llama-quantize command.

# Quantize as Q4_K_M
./llama-quantize ggml-model-f16.gguf ggml-model-Q4_K_M.gguf Q4_K_M

For more information, read the llama-quantize guide

Usage with `llama.cpp`

Once our gguf files are created, we can run the models locally on our computer using llama.cpp.

For this, yes, we will need to install llama.cpp.

# Run llama-cli with our model
llama-cli -m Orai-Kimu-9B-GGUF-Q4_0.gguf

This command will allow us to do quick tests right in the terminal.

I prefer, however, to run a server compatible with the OpenAI API, complete with a web interface!

llama-server \
    --host localhost \
    --port 8000 \
    --model Gemma-2b-GGUF-Q4_0.gguf

If everything works well, go to http://localhost:8000 and there you will see a neat web interface for chatting running.

If existing models are published on HuggingFace, we can run them directly from there without manually downloading them using the -hf parameter:

llama-server
    --host localhost \
    --port 8000 \
    -hf unsloth/Qwen3-VL-8B-Instruct-GGUF:UD-Q4_K_XL

Multimodal inference

As mentioned before, the peculiarity of multimodal models (VL type) is the ability to process images. For this, specifying the mmproj file is essential:

llama-server --host localhost --port 8000 \
      --model models/Qwen3VL/Qwen3-VL-8B-Instruct-UD-Q4_K_XL.gguf \
      --mmproj models/Qwen3VL/mmproj-F16.gguf \
      --ctx-size 4096 \
      --temp 0.7 \
      --flash-attn on \
      --jinja \
      --n-gpu-layers 99 \
      --top-k 20 \
      --top-p 0.8 \
      --min-p 0.0 \
      --presence-penalty 1.5

To know the meaning of the rest of the parameters, read the help for the llama-server command.

Usage with Ollama

If we want to use our GGUF files with Ollama, create a Modelfile with the following lines:

FROM /path/to/file.gguf

And then, just:

ollama create mymodel

For more information, read the documentation on the Ollama website

Upload GGUF files to our HF repository

Finally, if we want to make these files available, we can upload them to our HuggingFace repository using the hf command.

hf upload itzune/Latxa-Qwen3-VL-4B-GGUF HiTZ-Latxa-Qwen3-VL-4B-Instruct-q8_0.gguf

Download Llama.cpp#

Prepare the environment#

Install dependencies#

Convert to GGUF format#

mmproj image processor for multimodal models (VL)#

Obtaining other quantizations#

Usage with llama.cpp#

Multimodal inference#

Usage with Ollama#

Upload GGUF files to our HF repository#