Llama中文大模型-模型量化
对中文微调的模型参数进行了量化,方便以更少的计算资源运行。目前已经在Hugging Face上传了13B中文微调模型FlagAlpha/Llama2-Chinese-13b-Chat的4bit压缩版本FlagAlpha/Llama2-Chinese-13b-Chat-4bit,具体调用方式如下:
(图片来源网络,侵删)
环境准备:
pip install git+https://github.com/PanQiWei/AutoGPTQ.git
from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM model = AutoGPTQForCausalLM.from_quantized('FlagAlpha/Llama2-Chinese-13b-Chat-4bit', device="cuda:0") tokenizer = AutoTokenizer.from_pretrained('FlagAlpha/Llama2-Chinese-13b-Chat-4bit',use_fast=False) input_ids = tokenizer(['Human: 怎么登上火星\nAssistant: '], return_tensors="pt",add_special_tokens=False).input_ids.to('cuda') generate_input = { "input_ids":input_ids, "max_new_tokens":512, "do_sample":True, "top_k":50, "top_p":0.95, "temperature":0.3, "repetition_penalty":1.3, "eos_token_id":tokenizer.eos_token_id, "bos_token_id":tokenizer.bos_token_id, "pad_token_id":tokenizer.pad_token_id } generate_ids = model.generate(**generate_input) text = tokenizer.decode(generate_ids[0]) print(text)
(图片来源网络,侵删)
(图片来源网络,侵删)
文章版权声明:除非注明,否则均为主机测评原创文章,转载或复制请以超链接形式并注明出处。
还没有评论,来说两句吧...