LLM Quantization Demystified: GGUF vs GPTQ vs AWQ

Your 7B model has 14 billion numbers. Here’s exactly how to shrink them — and what you lose in the process.

The fundamental challenge facing anyone attempting to run LLMs locally is that most GPUs have less than 8 GB of memory while a 7B parameter model requires approximately 14 GB. Quantization solves this by reducing the model’s memory footprint at the expense of some precision. A 14 GB model compressed with INT4 quantization drops to about 4 GB — small enough to run on a consumer GPU.

However, quantization is not one-size-fits-all. There exists a complex ecosystem of techniques (GGUF, GPTQ, AWQ), file formats (Q4_K_M, IQ3_S, Q8_0), and tools (llama.cpp, vLLM, LM Studio, Ollama) that each claim different trade-offs between accuracy and memory usage. The goal of this guide is to clarify what actually happens to model weights during quantization, which method to use based on your hardware and use case, and the associated accuracy trade-off.

First Principles: What Happens to the Numbers

The Precision Ladder

Every parameter in a neural network is a floating-point value stored in binary. The precision of that number determines how much memory it consumes and how accurately it represents the learned weight:

Precision	Bits per Weight	7B Model Size	Relative Quality
FP32	32	~28 GB	100% (baseline)
FP16	16	~14 GB	~99.9%
INT8	8	~7 GB	~99%
INT4	4	~3.5 GB	~92–95%
INT3	3	~2.6 GB	~85–90%
INT2	2	~1.75 GB	~75–80%

FP16/BF16 (16 bits) is the standard training and distribution precision — this is what you get when you download “the full model” from HuggingFace. INT8 delivers roughly 2x compression with minimal quality loss. INT4 is the sweet spot most practitioners target: 4x compression, and for most tasks the quality difference is barely noticeable. Below INT4 you start making real trade-offs.

The Core Math

Quantization maps a range of floating-point values to a smaller set of integers:

Quantize: q = round((w − zero_point) / scale)

Dequantize: w ≈ q × scale + zero_point

Think of it as rounding. The number 3.14159 rounded to two decimal places is 3.14 — you lost the last two digits, but the number is still close enough. Quantization is simply controlled rounding of billions of parameters.

Block quantization divides weights into small blocks (typically 32 to 128 weights) and computes a separate scale and zero_point for each block. This adapts to local weight distributions and significantly improves quality over global quantization.

Weight distribution showing most weights near zero with critical outliers

Why Neural Networks Tolerate Quantization

Neural network weights follow an approximate normal distribution — most weights lie close to zero, and there are only a few outliers with large values. Small weights contribute little to the layer’s computation, so quantization errors there have minimal impact. The problem lies in the outliers: a handful of very large weights are crucially important. Bad quantization of these outlier weights can cascade through the network. This insight separates good quantization methods from naive ones — AWQ, for example, identifies and explicitly protects these critical weights.

A Critical Clarification

GGUF is a file format, not a quantization method. It is a container that stores quantized weights along with metadata, model architecture, and the tokenizer — all in a single file. The actual quantization algorithms inside GGUF are K-quants, I-quants, and legacy quants.

GPTQ and AWQ are quantization algorithms — they describe how to reduce weight precision, not just where to store the result.

Takeaway: When someone says “GGUF Q4_K_M,” they mean a file in GGUF format that uses the K-quant algorithm at approximately 4-bit precision. When someone says “GPTQ 4-bit,” they mean a model quantized using the GPTQ algorithm at 4-bit precision, stored in GPTQ’s own format.

The Big Three: GGUF, GPTQ, and AWQ

GGUF — The Standard for Local Deployment

GGUF was developed by the llama.cpp project and is the successor to the earlier GGML format. It is a self-contained binary format combining quantized weights, architecture information, tokenizer data, and metadata into a single file. Download one file, point your inference engine at it, and run.

GGUF’s greatest strength is CPU + GPU hybrid inference. Unlike GPTQ and AWQ, which require GPU-only processing, GGUF models can split layers between GPU and CPU. Use your GPU to process 30 layers while running the remainder on your CPU — enabling you to run models that don’t fully fit in VRAM.

GGUF supports three generations of quantization algorithms:

Legacy Quants (Q4_0, Q5_0, Q8_0): Basic uniform quantization per block. Legacy reasons keep them around, but not recommended for quality-sensitive use.

K-Quants (Q3_K_S through Q6_K): Mixed precision — critical layers (early attention, output projection) receive higher precision; less important layers get compressed more aggressively. The letter suffix denotes size: S(mall), M(edium), L(arge). Q4_K_M is the community sweet spot: approximately 92% quality retention at ~4.1 GB for a 7B model.

I-Quants (IQ1_S through IQ4_NL): Importance-based non-uniform quantization — allocate precision to weights based on their relative importance using non-uniform grids. IQ4_XS offers slightly better quality than Q4_K_M at a similar or smaller file size.

Unsloth Dynamic 2.0 develops model-specific quantization strategies — analyzing each model individually to determine which layers need more precision. Unsloth’s 3-bit DeepSeek V3.1 GGUF exceeds the performance of many full-precision models on coding benchmarks.

Best for: Local deployment, CPU inference, Apple Silicon, hybrid CPU+GPU configurations, LM Studio, Ollama, llama.cpp.

GPTQ — The GPU Inference Powerhouse

GPTQ (Generalized Post-Training Quantization) by Frantar et al. (2023) uses second-order information — the Hessian matrix — to determine which weights can be compressed most aggressively. It quantizes the model layer-by-layer using a calibration dataset of ~128 examples. For each layer, the Hessian indicates which weights the model is most sensitive to; those weights are compressed carefully, while insensitive weights are compressed aggressively.

Quality-wise, GPTQ achieves approximately 90% quality retention at 4-bit — respectable, though slightly lower than AWQ. Speed-wise: using optimized Marlin kernels, GPTQ reaches ~712 tokens/second vs. 461 tokens/second for FP16. Without Marlin, GPTQ runs at ~276 tokens/second. The kernel matters greatly.

The tradeoff: GPTQ requires calibration data and the conversion process is time-consuming. And GPTQ is GPU-only — no CPU inference path.

Best for: High-speed GPU server environments, production API endpoints, vLLM and TGI deployments where throughput matters.

AWQ — The Quality Champion

AWQ (Activation-Aware Weight Quantization) by Lin et al. (2024) identifies which weights have the greatest impact on the model’s output — not just by weight magnitude, but by the activation patterns they influence. Roughly 1% of weights are “salient” and preserved at higher precision. All other weights are quantized aggressively.

W4A16: 4-bit weights, 16-bit activations. AWQ does not quantize activations, which is a significant factor in its quality advantage. At 4-bit precision, AWQ retains roughly 95% of original quality — the highest of any 4-bit method. With Marlin kernels: ~741 tokens/second. Without Marlin: ~67 tokens/second. Like GPTQ, AWQ requires calibration and a CUDA GPU.

Best for: Maximum quality at 4-bit precision, production serving where quality is paramount, GPU-only deployments.

Head-to-Head Comparison

Criterion	GGUF (Q4_K_M)	GPTQ (4-bit)	AWQ (4-bit)
Quality retention	~92%	~90%	~95%
Speed (w/ Marlin)	93 tok/s	712 tok/s	741 tok/s
CPU support	Yes	No	No
Calibration needed	No	Yes	Yes
Ease of use	Highest	Medium	Medium
Best served by	llama.cpp, LM Studio, Ollama	vLLM, TGI	vLLM, TGI

Takeaway: GGUF wins on flexibility. AWQ wins on quality and speed (with proper kernels). GPTQ is the established middle ground for GPU serving. Running locally on consumer hardware: GGUF. Deploying a production API on dedicated GPUs: AWQ with Marlin.

Decoding the GGUF Alphabet

Naming Convention

Q[bits]_[variant] — Legacy: Q4_0, Q5_0, Q8_0
Q[bits]K[size] — K-quants: Q4_K_M, Q5_K_S, Q6_K
IQ[bits]_[variant] — Importance-based: IQ4_XS, IQ3_S, IQ2_M

Reference Table (7B/8B models)

Quant	Avg Bits	7B Size	Quality	Recommendation
Q2_K	2.6	~2.5 GB	Poor	Testing only
Q3_K_M	3.9	~3.5 GB	Good−	Tight VRAM
Q4_K_M	4.8	~4.1 GB	Good+	RECOMMENDED DEFAULT
IQ4_XS	4.3	~3.9 GB	Good+	Better Q4 alternative
Q5_K_M	5.7	~4.9 GB	Very Good+	Quality priority
Q6_K	6.6	~5.5 GB	Excellent	Near-lossless
Q8_0	8.5	~7.2 GB	Near-perfect	Maximum quality

GGUF quant levels: quality vs file size tradeoff

The Four Quants Most People Should Consider

Q4_K_M — “The Default.” Start here unless you have a compelling reason not to.

Q5_K_M — “When You Have Space.” Approximately 20% larger than Q4_K_M with a measurable quality improvement, especially on complex reasoning and fact retrieval.

IQ4_XS — “The New Default?” Same or smaller file size than Q4_K_M with better quality preservation due to non-uniform quantization grids. Use it if your inference engine supports it.

Q8_0 — “Nearly Lossless.” When quality is absolute and VRAM allows. Perplexity differences from FP16 are typically under 0.5%.

The KV Cache Trap

VRAM required = Model Size + KV Cache + Overhead

Context Length	KV Cache
4,096 tokens	~500 MB – 1 GB
8,192 tokens	~1 – 2 GB
32,768 tokens	~4 – 8 GB

On an 8 GB GPU, a 4.1 GB Q4_K_M model leaves ~3.9 GB for KV cache — enough for 4k context, tight for 8k, impossible for 32k without CPU offloading.

Which Quantization Do You Need?

API serving (high throughput, multiple users): AWQ + Marlin kernel — highest throughput and quality. GPTQ + Marlin is a reasonable alternative.

Interactive/local, 8 GB VRAM: GGUF Q4_K_M for 7B models (prefer IQ4_XS if supported).

12 GB VRAM: GGUF Q5_K_M for 7B, Q4_K_M for 13B.

16 GB VRAM: Q5_K_M for 13B, Q4_K_M for some MoE models like Mixtral 8x7B.

24 GB VRAM: Q6_K or Q8_0 for 7–13B, Q4_K_M for 30B+.

Apple Silicon or CPU only: GGUF is your only viable option. llama.cpp is heavily optimized for Apple’s unified memory architecture.

Code generation: Be conservative — Q5_K_M or higher. Code is more sensitive to quantization than natural language.

Fact Q&A: Prefer Q5_K_M or higher. Fact recall drops faster than fluency.

Chat/Conversation, Summarization, Creative Writing: Q4_K_M is acceptable.

For those who don’t care: Download the GGUF Q4_K_M version of your model and use LM Studio or Ollama. 90% of users will be fine.

Get Running in 5 Minutes

LM Studio (2 minutes)

Lowest friction. Download LM Studio, find your model in the built-in browser, select GGUF Q4_K_M, click Download. No command line, no configuration, no dependencies.

Ollama (3 minutes)

ollama run qwen2.5-coder:7b-instruct-q4_K_M

Ollama handles the download, configuration, and serving automatically.

llama.cpp (5 minutes, maximum control)

./llama-server -m model-Q4_K_M.gguf -c 4096 -ngl 35

The -ngl 35 flag moves 35 layers to GPU. -c 4096 sets context window size. For Python:

from llama_cpp import Llama

llm = Llama(
    model_path="./qwen2.5-coder-7b-instruct-Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    verbose=False
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
    max_tokens=512,
    temperature=0.7
)
print(response["choices"][0]["message"]["content"])

No API keys, no cloud costs, no data leaving your machine.

Quick-Reference Cheat Sheet

DEFAULT: GGUF Q4_K_M via LM Studio or Ollama

QUALITY: AWQ 4-bit via vLLM (requires GPU)

SPEED: AWQ or GPTQ with Marlin kernel via vLLM

BUDGET: GGUF Q3_K_M or IQ3_S

NEAR-LOSSLESS: GGUF Q8_0

References

Frantar et al. (2023). “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” arXiv:2210.17323
Lin et al. (2024). “AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration.” arXiv:2306.00978
Dettmers et al. (2024). “QLoRA: Efficient Finetuning of Quantized Language Models.” arXiv:2305.14314
llama.cpp project. github.com/ggerganov/llama.cpp
Unsloth Dynamic 2.0 GGUFs. unsloth.ai