选型

参考问题

选型必然考虑优缺点、benchmark指标，比如准确性、模型大小及要求（能否在个人PC部署）、推理速度、是否开源/国产（防止被封和维护问题）

要求

最好开源且4070 12G GPU上推理或者api调用

准确性较高、速度较快

Benchmark Leaderboard

open-llm-leaderboard (Open LLM Leaderboard) (huggingface.co)

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models (cevalbenchmark.com)

Alpaca Eval Leaderboard (tatsu-lab.github.io)

In Details

AutoGen

AutoGen帮助开发者创建基于大语言模型的复杂应用程序。依赖于openai付费的apikey，弃用

fastchat

lm-sys/FastChat: An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena. (github.com)

中文LLaMA&Alpaca

ymcui/Chinese-LLaMA-Alpaca: 中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs) (github.com)

问题

需不需要自己prompt工程

实践

使用Xwin-LM的hugging face实现（弃用）

号称超过gpt！安装hugging face、pytorch-gpu和transformers，并然后进行测试，详见hugging-face-startup.md

使用example测试

Xwin-LM还有一个实现是vLLM库.使用hf实现，看到使用了30G来下模型，很头大，没有更改缓存，C盘一下就满了

#env4hf
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Xwin-LM/Xwin-LM-7B-V0.1")
tokenizer = AutoTokenizer.from_pretrained("Xwin-LM/Xwin-LM-7B-V0.1")
(
    prompt := "A chat between a curious user and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the user's questions. "
            "USER: Hello, can you help me? "
            "ASSISTANT:"
)
inputs = tokenizer(prompt, return_tensors="pt")
samples = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
output = tokenizer.decode(samples[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(output) 
# Of course! I'm here to help. Please feel free to ask your question or describe the issue you're having, and I'll do my best to assist you.

pytorch_model-00001-of-00003.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.88G/9.88G [17:20<00:00, 9.50MB/s] 
pytorch_model-00002-of-00003.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.89G/9.89G [17:21<00:00, 9.50MB/s]
pytorch_model-00003-of-00003.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.18G/7.18G [12:34<00:00, 9.52MB/s]
Downloading shards: 100%|947.04s/it

存在两个问题，1.莫名其妙卡死 2.C盘快满了，不得不处理缓存问题

先处理缓存问题，参考https://huggingface.co/docs/transformers/v4.34.0/en/installation#installation中cache setup，默认为user/XX/.cache/huggingface/hub,
缓存位置的优先级为三个环境变量：

1
2
3

1(default)  HUGGINGFACE_HUB_CACHE or TRANSFORMERS_CACHE.
2 HF_HOME.
3 XDG_CACHE_HOME + /huggingface.

直接修改环境变量即可，然后我们把他搬过去,打印环境变量windows是%%而linux才是#,以上环境变量都没有设，很奇怪为什么7B的模型有25G这么大，能放到显存吗？

LLM回答：一个7B的LLM语言大模型占用多少空间？

一个7B的LLM语言大模型的显存占用量根据参数类型有所不同。

推理：

7B-float类型，需要28GB显存
7B-BF16类型，需要14GB显存；
7B-int8类型，需7GB显存。

在训练过程中，所需显存可能会达到推理时的4倍，以7B-float类型为例，其训练显存大约为112GB。然而，这只是理论估算，实际使用中可能因为batch size和sequence length等因素的不同，显存占用会更大。所以想要进行模型训练，至少需要的显存是推理的3-4倍。至于硬盘空间的占用，由于模型参数量大，全精度模型加载可能需要78G至104G的硬盘空间。

WTF???看了看12G的现存，好像也只能支持7B-int8，可是为什么没有11B-int8模型呢，而且再训练也做不了，只能使用云服务器。难道之前的尝试都错了，7B-int8应该效果不行吧，不如api啊~~~~😢我是小丑？？
除非使用int8推理，但不能满足训练和其他要求。

使用

我为什么放弃了 LangChain？

Chinese-LLaMA-Alpaca中，transformers接口要求20G以上显存，只能使用C接口，而且只能用13B-4bit或7B-8bit量化，但是量化的部署依赖于C++部署：llamacpp_zh Chinese-LLaMA-Alpaca-2

星火模型

个人免费200万tokens，每秒查询率QPS限制2，缺点：一年期限

python案例

test.py只依赖你的三个api,以及websocket-client库

1
2
3

#1.修改appid,api_secret,api_key为自己的
#2.
conda create -n env4spark python==3.10 websocket-client

就有了自己的问答接口小程序

理解test.py程序,反注释掉text（或查看变量表）

我:a  
星火:I'm sorry, I am an AI language model and I do not understand what you are trying to ask. Can you please provide more context or clarify your question?[{'role': 'user', 'content': 'a'}, {'role': 'assistant', 'content': "I'm sorry, I am an AI language model and I do not understand what you are trying to ask. Can you please provide more context or clarify your question?"}]

我:can you speak chinese
星火:是的，我可以说中文。有什么我可以帮你的吗？[{'role': 'user', 'content': 'a'}, {'role': 'assistant', 'content': "I'm sorry, I am an AI language model and I do not understand what you are trying to ask. Can you 
please provide more context or clarify your question?"}, {'role': 'user', 'content': 'can you speak chinese'}, {'role': 'assistant', 'content': '是的，我可以说中文。有什么我可以帮你的吗？'}]

text和question变量都一样，都是，只有SparkApi.answer只包含每次回答的结果,输出结果是在 SparkApi.main中完成的,更详细是在SparkApi.py的97行print(content,end =””)，作为on_message回调函数触发的，每次打印部分消息，把它修改成print(content,end =”\n”)就会发生神奇的一幕

星火:Hello
! How
 can I assist
 you today?

间隔非常大，说明Hello、! How、 can I assist等是一次次返回出来的