export.py L76: Saving quantized model to auto_awq format crash with AssertionError #388

fbaldassarri · 2024-12-16T22:52:03Z

During the packing of the quantized safetensor in AWQ format, process crashes with output AssertionError

Python Script for AWQ format quantization:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym, device, amp = 4, 128, False, 'cpu', False
group_size=group_size, sym=sym)
autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym, device=device, amp=amp)

autoround.quantize()
output_dir = "./AutoRound/HuggingFaceTB_SmolLM2-135M-Instruct-auto_awq-int4-gs128-asym"
autoround.save_quantized(output_dir, format='auto_awq', inplace=True)

Output:

config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 861/861 [00:00<00:00, 4.04MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 269M/269M [00:06<00:00, 42.5MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [00:00<00:00, 646kB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.76k/3.76k [00:00<00:00, 18.0MB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 801k/801k [00:00<00:00, 2.66MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 2.31MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.10M/2.10M [00:00<00:00, 5.11MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 655/655 [00:00<00:00, 3.17MB/s]
2024-12-16 21:07:57 INFO autoround.py L230: using torch.float32 for quantization tuning                                                                                                                                                                 
2024-12-16 21:07:57 INFO autoround.py L300: start to cache block inputs                                                                                                                                                                                 
2024-12-16 21:07:57,475 INFO config.py L54: PyTorch version 2.5.1 cpu available.                                                                                                                                                                        
README.md: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 373/373 [00:00<00:00, 6.14MB/s]
dataset_infos.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 921/921 [00:00<00:00, 4.59MB/s]
2024-12-16 21:08:00 INFO autoround.py L305: caching done                                                                                                                                                                                                
Quantizing model.layers.0:   0%|                                                                                                                                                                                                 | 0/30 [00:00<?, ?it/s]
2024-12-16 21:08:44 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.042908 -> iter 165: 0.016126                                                                                                                             
Quantizing model.layers.1:   3%|██████▏                                                                                                                                                                                  | 1/30 [00:46<22:26, 46.44s/it]
2024-12-16 21:09:32 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.057766 -> iter 180: 0.020574                                                                                                                             
Quantizing model.layers.2:   7%|████████████▎                                                                                                                                                                            | 2/30 [01:34<22:05, 47.34s/it]
2024-12-16 21:10:20 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.117392 -> iter 179: 0.046410                                                                                                                             
Quantizing model.layers.3:  10%|██████████████████▌                                                                                                                                                                      | 3/30 [02:22<21:22, 47.49s/it]
2024-12-16 21:11:07 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.112092 -> iter 82: 0.055568                                                                                                                              
Quantizing model.layers.4:  13%|████████████████████████▋                                                                                                                                                                | 4/30 [03:09<20:29, 47.28s/it]
2024-12-16 21:11:55 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.133687 -> iter 163: 0.079176                                                                                                                             
Quantizing model.layers.5:  17%|██████████████████████████████▊                                                                                                                                                          | 5/30 [03:56<19:47, 47.50s/it]
2024-12-16 21:12:42 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.169473 -> iter 90: 0.094575                                                                                                                              
Quantizing model.layers.6:  20%|█████████████████████████████████████                                                                                                                                                    | 6/30 [04:43<18:55, 47.31s/it]
2024-12-16 21:13:30 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.227091 -> iter 111: 0.109237                                                                                                                             
Quantizing model.layers.7:  23%|███████████████████████████████████████████▏                                                                                                                                             | 7/30 [05:31<18:11, 47.45s/it]
2024-12-16 21:14:17 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.241385 -> iter 176: 0.128480                                                                                                                             
Quantizing model.layers.8:  27%|█████████████████████████████████████████████████▎                                                                                                                                       | 8/30 [06:19<17:24, 47.47s/it]
2024-12-16 21:15:04 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.309878 -> iter 180: 0.155084                                                                                                                             
Quantizing model.layers.9:  30%|███████████████████████████████████████████████████████▌                                                                                                                                 | 9/30 [07:06<16:32, 47.28s/it]
2024-12-16 21:15:51 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.352337 -> iter 79: 0.194922                                                                                                                              
Quantizing model.layers.10:  33%|█████████████████████████████████████████████████████████████                                                                                                                          | 10/30 [07:53<15:44, 47.22s/it]
2024-12-16 21:16:38 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 0.396558 -> iter 137: 0.217531                                                                                                                             
Quantizing model.layers.11:  37%|███████████████████████████████████████████████████████████████████                                                                                                                    | 11/30 [08:40<14:55, 47.15s/it]
2024-12-16 21:17:25 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 2.827387 -> iter 10: 1.987019                                                                                                                              
Quantizing model.layers.12:  40%|█████████████████████████████████████████████████████████████████████████▏                                                                                                             | 12/30 [09:27<14:08, 47.14s/it]
2024-12-16 21:18:12 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 2.524235 -> iter 153: 1.815139                                                                                                                             
Quantizing model.layers.13:  43%|███████████████████████████████████████████████████████████████████████████████▎                                                                                                       | 13/30 [10:14<13:21, 47.17s/it]
2024-12-16 21:18:59 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 12.081470 -> iter 167: 1.833520                                                                                                                            
Quantizing model.layers.14:  47%|█████████████████████████████████████████████████████████████████████████████████████▍                                                                                                 | 14/30 [11:01<12:33, 47.10s/it]
2024-12-16 21:19:47 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 2.778519 -> iter 194: 1.829415                                                                                                                             
Quantizing model.layers.15:  50%|███████████████████████████████████████████████████████████████████████████████████████████▌                                                                                           | 15/30 [11:49<11:50, 47.38s/it]
2024-12-16 21:20:35 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 2.937676 -> iter 191: 1.847591                                                                                                                             
Quantizing model.layers.16:  53%|█████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                     | 16/30 [12:36<11:01, 47.27s/it]
2024-12-16 21:21:22 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 2.679263 -> iter 189: 1.794979                                                                                                                             
Quantizing model.layers.17:  57%|███████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                               | 17/30 [13:24<10:16, 47.43s/it]
2024-12-16 21:22:09 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 2.616241 -> iter 105: 1.922610                                                                                                                             
Quantizing model.layers.18:  60%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                                         | 18/30 [14:11<09:27, 47.29s/it]
2024-12-16 21:22:57 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 3.128466 -> iter 159: 1.956689                                                                                                                             
Quantizing model.layers.19:  63%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                                   | 19/30 [14:58<08:41, 47.42s/it]
2024-12-16 21:23:44 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 3.316611 -> iter 173: 1.978719                                                                                                                             
Quantizing model.layers.20:  67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                             | 20/30 [15:45<07:52, 47.29s/it]
2024-12-16 21:24:31 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 3.341628 -> iter 145: 2.362982                                                                                                                             
Quantizing model.layers.21:  70%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                       | 21/30 [16:32<07:04, 47.19s/it]
2024-12-16 21:25:18 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 3.499481 -> iter 193: 2.600739
Quantizing model.layers.22:  73%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                | 22/30 [17:19<06:17, 47.14s/it]
2024-12-16 21:26:05 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 4.670743 -> iter 196: 2.837814
Quantizing model.layers.23:  77%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                          | 23/30 [18:06<05:29, 47.10s/it]
2024-12-16 21:26:52 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 6.239635 -> iter 177: 3.741080
Quantizing model.layers.24:  80%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                    | 24/30 [18:53<04:42, 47.08s/it]
2024-12-16 21:27:39 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 6.106061 -> iter 193: 4.236810
Quantizing model.layers.25:  83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                              | 25/30 [19:40<03:55, 47.04s/it]2024-12-16 21:28:26 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 6.275647 -> iter 60: 5.022381
Quantizing model.layers.26:  87%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                        | 26/30 [20:27<03:08, 47.04s/it]2024-12-16 21:29:13 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 8.735580 -> iter 131: 5.655555
Quantizing model.layers.27:  90%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                  | 27/30 [21:14<02:21, 47.04s/it]2024-12-16 21:30:00 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 11.789701 -> iter 113: 7.237806
Quantizing model.layers.28:  93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊            | 28/30 [22:02<01:34, 47.06s/it]2024-12-16 21:30:47 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 14.128303 -> iter 41: 12.078333
Quantizing model.layers.29:  97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉      | 29/30 [22:49<00:47, 47.07s/it]2024-12-16 21:31:34 INFO autoround.py L1139: quantized 7/7 layers in the block, loss iter 0: 44.861595 -> iter 122: 19.207760
Quantizing model.layers.29: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [23:36<00:00, 47.04s/it]2024-12-16 21:31:36 INFO autoround.py L340: quantization tuning time 1419.5446853637695
2024-12-16 21:31:36 INFO autoround.py L356: Summary: quantized 210/211 in the model,  ['lm_head'] have not been quantized
Quantizing model.layers.29: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [23:36<00:00, 47.21s/it]
2024-12-16 21:31:36 INFO export.py L76: Saving quantized model to auto_awq format
packing model.layers.0.self_attn.q_proj:   0%|                                                                                                                                                                                        | 0/211 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/fbald_000/GenAI/quant_scripts/HuggingFaceTB_SmolLM2-135M-Instruct-auto_awq-int4-gs128-asym.py", line 21, in <module>
    autoround.save_quantized(output_dir, format='auto_awq', inplace=True)
  File "/home/fbald_000/GenAI/auto-round-0.4.3/auto_round/autoround.py", line 1320, in save_quantized
    compressed_model = save_quantized_as_format(  ##TODO refine the code
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fbald_000/GenAI/auto-round-0.4.3/auto_round/export/__init__.py", line 50, in _save_quantized_as_autoawq
    return save_quantized_as_autoawq(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fbald_000/GenAI/auto-round-0.4.3/auto_round/export/export_to_awq/export.py", line 95, in save_quantized_as_autoawq
    for _ in executor.map(wrapper, names):
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fbald_000/anaconda3/envs/auto-round-0.4.3-cpu/lib/python3.12/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fbald_000/anaconda3/envs/auto-round-0.4.3-cpu/lib/python3.12/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/fbald_000/anaconda3/envs/auto-round-0.4.3-cpu/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/fbald_000/anaconda3/envs/auto-round-0.4.3-cpu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/fbald_000/anaconda3/envs/auto-round-0.4.3-cpu/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fbald_000/GenAI/auto-round-0.4.3/auto_round/export/export_to_awq/export.py", line 93, in wrapper
    pack_layer(name, compressed_model, layer_config, backend, pbar)
  File "/home/fbald_000/GenAI/auto-round-0.4.3/auto_round/export/export_to_awq/export.py", line 53, in pack_layer
    q_linear = WQLinear_GEMM.from_linear(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fbald_000/GenAI/auto-round-0.4.3/auto_round/export/export_to_awq/utils.py", line 203, in from_linear
    awq_linear = cls(
                 ^^^^
  File "/home/fbald_000/GenAI/auto-round-0.4.3/auto_round/export/export_to_awq/utils.py", line 160, in __init__
    assert self.in_features % self.group_size == 0
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Then the final safetensor file(s) are not exported/saved:

$ ls -las ./Autoround/HuggingFaceTB_SmolLM2-135M-Instruct-auto_awq-int4-gs128-asym
total 4700
   4 drwxrwxr-x 2 fbald_000 fbald_000    4096 Dec 16 21:31 .
   4 drwxrwxr-x 3 fbald_000 fbald_000    4096 Dec 16 21:31 ..
 456 -rw-rw-r-- 1 fbald_000 fbald_000  466391 Dec 16 21:31 merges.txt
   4 -rw-rw-r-- 1 fbald_000 fbald_000     655 Dec 16 21:31 special_tokens_map.json
   4 -rw-rw-r-- 1 fbald_000 fbald_000    3794 Dec 16 21:31 tokenizer_config.json
3444 -rw-rw-r-- 1 fbald_000 fbald_000 3522656 Dec 16 21:31 tokenizer.json
 784 -rw-rw-r-- 1 fbald_000 fbald_000  800662 Dec 16 21:31 vocab.json

From what I see you already have in plan a revision/optimization of the export.py code. Just reporting the issue in v0.4.3.

The text was updated successfully, but these errors were encountered:

wenhuach21 · 2024-12-17T01:51:09Z

The primary issue is that the GEMM kernel of AWQ does not support the group size for some layers in this model.

assert self.in_features % self.group_size == 0

We encountered a similar problem with GPTQ for Falcon. You might try changing the group size to 64 or 32 as a workaround.

On our end, we could implement a pre-quantization check to log a warning and exclude this layer from quantization if the group size is unsupported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

export.py L76: Saving quantized model to auto_awq format crash with AssertionError #388

export.py L76: Saving quantized model to auto_awq format crash with AssertionError #388

fbaldassarri commented Dec 16, 2024

wenhuach21 commented Dec 17, 2024

export.py L76: Saving quantized model to auto_awq format crash with AssertionError #388

export.py L76: Saving quantized model to auto_awq format crash with AssertionError #388

Comments

fbaldassarri commented Dec 16, 2024

wenhuach21 commented Dec 17, 2024