MiniMind学习流程CamelliaV の Blog - Public, Java；前端；AI；

MiniMind学习流程

type

Post

status

Published

date

Dec 11, 2025

slug

minimind

summary

minimind walkthrough

tags

开发

category

技术分享

titleIcon

password

icon

insider

训练显存要求并不高，时间要求会很高

Tokenizer

是什么

类似图片使用像素表征自然图像(RGB→色块)

分词器用数值表征自然语言(ID→语言块)

怎么做

BPE

计数所有相邻两个字符的组合，选择最多的一对合并为新的单元字符，持续迭代至词典大小(最后为每个字符与合并创建的各个新字符组成)
在词典大小允许情况下，常见词会变成单元字符，不常见的会存在其一部分，如unfamiliar会存在un(如果其足够频繁)这样的一部分

Model

总览

对比原本的Transformer，结构上主要有以下变化：

Norm: LayerNorm → RMSNorm
Attention: MHA → GQA
Positional Encoding: Sinusoidal Positional Encoding → ROPE

notion image

notion image

RMSNorm

令μ = 0，带入LayerNorm

意味着不做平移，只做缩放，减少计算要求，同时仍能维持训练稳定性

LayerNorm

notion image

notion image

notion image

RMSNorm

notion image

notion image

GQA

将Queries分成x组，每组内合用一对kv

notion image

一些相比原本的Attention（Transformer中）额外加的东西

分组复用的kv主要通过更小的线性层输出体现
kv cache用于decode推演时缓存（prefill为user prompt → 首个token，decode为自回归，之前的 → 下一个），在prefill阶段，一次性并行处理完user prompt，seq_len即为其tokenizer后长度，decode时，恒为1，在计算完kv与应用rope后，拼接进kv cache，seq长度变为上次长度+1

notion image

ROPE

记录相对位置，而非绝对位置 → 外推性好（训练在短文本如512，但可以在768，1024等更长文本仍有不错表现）

位置编码用于表达位置信息，对m位置向量（embedding）与n位置向量建模相对位置m-n可通过空间旋转建模，有如下过程：

notion image

SwiGLU FFN

多出个门控，也就是在扩中间层时候，分两个Linear出来，一个是原本的，一个是做完后走SiLU激活，然后点对点乘，实现门控筛选

notion image

Transformer Block

MiniMind

若果没有kv cache，就是prefill，此时开始位置为0，输入为全局，一次算出user prompt里所有的kv入cache，当此次过后，有kv cache，长度为user prompt tokenize后，其长度对应下标为起始位置，输入为单个token（prefill得到的首个token，之后是每次的上次生成的单个token），算出自己的kv入cache，自己的q与kv cache算结果（即只需要输入这一token，便通过cache得到所有截止当前的所有kv，从user prompt开头到当前token）

Embedding

作用：ID → Vector

即将tokenizer输出结果转为对应hidden_size大小的vector，数值初始随机，在学习过程中调整

共享输入输出的转换权重，节省空间

Pretrain

是什么

任务为通过之前的tokens预测下一token（causal language modeling）

学习通用语言理解

Learning Objective	Example
Grammar	"The cat sits" not "The cat sit"
Facts	"Paris is the capital of France"
Patterns	Code syntax, mathematical notations
Multilingual	Chinese characters, English words
Context	Long-range dependencies

怎么做

数据

数据量相对大，可以通过head命令看一眼数据pattern

notion image

数据内容均为纯文本，jsonl里存放的每个json对象只有text属性，内容为一句句话，每句开头结尾为特殊token

notion image

特殊token的配置

notion image

相关代码

读入文本，对过长的截断，对过短的padding，进行tokenizer编码，定义loss_mask对padding出来的内容不做loss计算，由于任务为下一token预测，开始的token不为标签，结尾的token不为输入，开始的token也不算loss

训练

loss_mask盖住padding部分

选择CE Loss

结果

单卡4090 - 使用MiniMind2 104M参数配置

~4h完成单epoch

notion image

notion image

notion image

与仓库效果图拟合

notion image

SFT/IFT

是什么

Supervised Fine-Tuning (SFT) / Instruction Fine-Tuning

让预训练的模型适配指令遵循与对话，而不是仅续写

怎么做

数据

一个json对象里有conversations属性，其中含一个个句子，每个由content与对应role组成，后续只有assistant对应role求loss

notion image

将assistant的部分定为起始id与结束id

找上方对assistant部分定义的起始与结束id，对中间内容标记mask为1，约束仅对此部分计算loss

训练

数据格式与计算出的loss mask pretrain与sft唯一的区别

对比train_pretrain.py与train_full_sft.py，在训练逻辑上没有任何区别

notion image

结果

并非完全连续训，但确实训练时间很长，应该不少于两天

notion image

notion image

难评，看上去都是震荡的

notion image

notion image

RL

P → Policy，待训练LLM充当RL领域对应角色，O → Optimization

Preference|Reasoning Fine-Tuning

DPO | RLHF(Human Feedback) & RLVR(Verifiable Rewards)(PPO → GRPO → SPO)

DPO

对比学习，要求数据集里准备Chosen与Rejected对

通过与冻住的基线模型结果作差算Loss

有限程度地最大化chosen与rejected概率差

notion image

建模偏好为概率

log → 数值连乘改求和

- → 最大改最小

bata约束相比ref模型偏离程度

代码中调整顺序

GRPO

增大相比组内其他答案的优势，减小相对原本的偏移

notion image

notion image

SPO

PPO

Reason

Lora

Distillation

θ角

参考与拓展阅读

Prefill and Decode for Concurrent Requests - Optimizing LLM Performance

A Blog post by TNG Technology Consulting GmbH on Hugging Face

Prefill and Decode for Concurrent Requests - Optimizing LLM Performance

https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests

Prefill and Decode for Concurrent Requests - Optimizing LLM Performance

zhuanlan.zhihu.com

https://zhuanlan.zhihu.com/p/647109286

【研2基本功面试篇 (3)】大模型常用组件与位置编码~_哔哩哔哩_bilibili

https://cloud.tsinghua.edu.cn/f/a3f057c31d724b72ac6a/, 视频播放量 15461、弹幕量 4、点赞数 416、投硬币枚数 188、收藏人数 1095、转发人数 60, 视频作者 happy魇, 作者简介 Enjoy life | Think far&Think more，相关视频：旋转位置编码RoPE的简单理解，RoPE(Rotary positional embeddings)旋转位置编码，你还不懂旋转位置编码吗？，位置编码|transformer|三角式位置编码|Positional Encoding，通俗易懂-大模型的关键技术之一：旋转位置编码rope （1），模块插入什么位置效果最佳？，通俗易懂-大模型的关键技术之一：旋转位置编码rope （2），【零基础10分钟学透RoPE】旋转位置编码RoPE及其各种外推方法，【Transformer】9分钟认识位置编码 | 数学解析 | 旋转位置编码，GPT位置编码设计原理逐步拆解！揭秘可学习编码为何主流 | 白板GPT#10

【研2基本功面试篇 (3)】大模型常用组件与位置编码~_哔哩哔哩_bilibili

https://www.bilibili.com/video/BV1hBJzzbEgg/?spm_id_from=333.1387.upload.video_card.click&vd_source=93059b03205e8ca15ad74b3d998769b9

【研2基本功面试篇 (3)】大模型常用组件与位置编码~_哔哩哔哩_bilibili

Fantastic KL Divergence and How to (Actually) Compute It

Kullback–Leibler (KL) divergence measures the difference between two probability distributions. But where does that come from? In this video, we provide an overview of KL divergence and discuss how to develop a practical method for estimating it. 00:00 Introduction 00:52 Surprise (Self-information) 01:55 Entropy 03:24 Cross-entropy 03:42 KL divergence 04:33 Asymmetry in KL divergence 06:34 Computation challenge of KL divergence 07:13 Monte Earlo estimation 09:11 Biased estimator 10:23 Unbiased and low-variance estimator Reference: - The low-variance Monte-Carlo estimator discussed in the second half of the video is from John Schulman's blog post. If you want to learn more, definitely check it out for more details! http://joschu.net/blog/kl-approx.html Video made with Manim: https://www.manim.community/

Fantastic KL Divergence and How to (Actually) Compute It

https://www.youtube.com/watch?v=tXE23653JrU

Fantastic KL Divergence and How to (Actually) Compute It

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

In this video, I break down DeepSeek's Group Relative Policy Optimization (GRPO) from first principles, without assuming prior knowledge of Reinforcement Learning. By the end, you’ll understand the core RL building blocks that led to GRPO, including: 🔵 Policy Gradient Methods 🔵 The REINFORCE Algorithm 🔵 Actor-Critic Models 🔵 PPO (Proximal Policy Optimization) 🔵 GRPO (Group-Relative policy Optimization) Papers: GRPO paper (DeepSeekMath): https://arxiv.org/pdf/2402.03300 DeepSeek-R1 paper: https://arxiv.org/pdf/2501.12948 PPO paper: https://arxiv.org/pdf/1707.06347 GAE paper: https://arxiv.org/pdf/1506.02438 TRPO paper: https://arxiv.org/pdf/1502.05477 Mother of all RL books (Barto & Sutton): http://incompleteideas.net/book/RLboo... 00:00 Intro 00:53 Where GRPO fits within the LLM training pipeline 04:17 RL fundamentals for LLMs 08:25 Policy Gradient Methods & REINFORCE 11:58 Reward baselines & Actor-Critic Methods 14:10 GRPO 21:42 Wrap-up: PPO vs GRPO 22:32 Research papers are like Instagram

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

https://www.youtube.com/watch?v=xT4jxQUl0X8&list=PLWIL_KWO5qAmzXv6HjtylXKKaEyabvStz&index=15

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

© Camelliav 2025