简单的生成式语言模型(GLM), 用于了解GLM训练推理过程
- tinyshakespeare.txt
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O datas/tinyshakespeare.txt
- 红楼梦.txt
wget https://raw.githubusercontent.com/shjwudp/shu/master/books/红楼梦.txt -O datas/红楼梦.txt
这里英文以一个单个字符来分词,字典是数据集中的字母集合,进行排序后,对应token为字母对应的索引。
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
tips: 这里用数据集里的字符集作为一个简单的tokenizer 词表用来分词, 简单实现; 未使用像sentencepiece、tiktoken 这些分词器库,对分词器进行训练,获取词表。
-
Bigram LM: 一个包含vocab_size张量的嵌入权重层模块(nn.Embedding),用来训练学习类似二元语言模型token的概率
$P(t_i|t_{i-1})$ 权重参数$W_e$ -
MLP(Multilayer Perceptron) LM: 每层为全连接线性权重层;
-
GPT(Generative Pre-trained Transformer) LM: 使用类似GPT2模型,加入位置embedding, block(attention机制, FFN(MLP)前馈层, 以及残差连接), 以及对输入权重参数进行了初始化(如果初始化为0,反向传播时更新权重变的没有意义;为了防止"权重均一化"(严格地讲,是为了瓦解权重的对称结构),必须随机生成初始值;常采用定义标准差正太分布(高斯分布),这里标准差std=0.02),
-
layer/block wise scaling GPT LM: 使用类似GPT2模型,使用 layer/block wise scaling (scaling attention qkv heads, ffn intermediate sizes), 详情见:
-
MoE(mixture of experts) LM: 稀疏专家混合语言模型(SMoE)
- 稀疏专家混合而不是孤立的前馈神经网络。
- 使用了top-k门控和noisy top-k门控实现。
- 模型训练初始化 - 这里使用了Kaiming He初始化,
-
MoA(SMoE MultiHeadAttention)-MoE(mixture of experts) LM: 模块化来源于稀疏专家混合语言模型(ModuleFormer)
- 引入专家容量(Expert Capacity factor)
- Load Balancing Loss
- 相关详情见:Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
git clone https://github.com/weedge/baby-llm.git
cd baby-llm && make -p {datas,models}
# datas/tinyshakespeare.txt
#wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O datas/tinyshakespeare.txt
prepare: download datasets -> tokenizer --encode--> tokenids (train.bin, val.bin)
# shakespeare_char
python3 simpleLM/datasets/shakespeare_char/prepare.py
# train 超参数可调整
python3 simpleLM/train.py --model_name=bigramLM
python3 simpleLM/train.py --model_name=mlpLM
python3 simpleLM/train.py --model_name=gptLM
python3 simpleLM/train.py --model_name=block_wise_scaling_gptLM
python3 simpleLM/train.py --model_name=moeLM
python3 simpleLM/train.py --model_name=moa_moeLM
# plot train/validation loss
ls loss_*.log | python3 simpleLM/plot.py
# tips: 这里没有使用wandb来记录loss, 简单直接通过plot来绘制曲线图
附:
- https://lena-voita.github.io/nlp_course/language_modeling.html
- karpathy/min-char-rnn.py
- https://en.wikipedia.org/wiki/Activation_function
- https://karpathy.ai/zero-to-hero.html
- https://github.com/karpathy/ng-video-lecture
- https://github.com/antirez/simple-language-model
- https://github.com/karpathy/makemore
- https://github.com/AviSoori1x/makeMoE/blob/main/makeMoE_from_Scratch.ipynb
- https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py#L384
- https://github.com/myshell-ai/JetMoE
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- Understanding the backward pass through Batch Normalization Layer
- Character-Level Language Modeling with Deeper Self-Attention
- A Neural Probabilistic Language Model
- GPT1-Improving Language Understanding by Generative Pre-Training
- Outrageosly Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts layer
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Mixtral of Experts
- ModuleFormer: Modularity Emerges from Mixture-of-Experts
- JetMoE: Reaching Llama2 Performance with 0.1M Dollars