LLM construction stages, from pretraining to LoRA
A language model moves through stages: pretraining, supervised tuning, preference tuning, evaluation, serving, retrieval, and adapter training. LoRA enters as a compact adaptation layer after the expensive base model exists.
Most confusion about LLMs starts with one overloaded verb: train.
A model gets “trained” during trillion-token pretraining. A chat model gets “trained” again during supervised instruction tuning. A preference stage may steer it through RL, short for Reinforcement Learning, or DPO, short for Direct Preference Optimization. A company may “train” an adapter with LoRA, short for Low-Rank Adaptation. A product team may say “train” when it really means adding documents to RAG, short for Retrieval-Augmented Generation.
Those are different operations with different cost, risk, and control.
The simple allegory: a generalist becoming a specialist
Picture a generalist moving into a specialist role.
First comes broad reading. The person reads books, articles, code, worked examples, arguments, jokes, and documentation. This stage builds general language sense.
Then comes instruction practice. The person learns the expected response format: answer the question, show work, summarize clearly, refuse unsafe requests, and follow directions.
Then comes review. Judges compare answers and mark better ones. The person starts preferring clear, useful, safer answers over messy ones.
Then comes local specialization. The general capability already exists, but a small notebook of local rules helps with one setting: house style, ticket labels, response format, or domain phrasing.
LoRA is like the local notebook. It does not create the generalist from scratch. It adds a compact specialty layer after broad capability already exists.
The pipeline
The useful picture is a staged pipeline.
raw data
-> filtering and deduplication
-> tokenization
-> pretraining
-> SFT: Supervised Fine-Tuning
-> preference tuning: RLHF, DPO, or related methods
-> safety evaluation
-> serving and monitoring
-> RAG, LoRA, tools, and agents LoRA belongs near the end. It is not how most foundation models are born. It is a way to adapt an already-trained model without updating every weight inside it.
- raw data
- filter
- tokens
- pretrain
- SFT
- RLHF / DPO
- evals
- serve
- RAG / LoRA / tools
The attainable version: downloading data and training something
This whole process can sound sealed inside frontier labs. The practical truth is more useful: public training data exists, public training code exists, and small models can train on ordinary developer hardware or rented cloud GPUs.
The first successful local run should not aim at a frontier model. It should aim at contact with the machinery.
download a small text dataset
train a tokenizer
train a tiny language model
watch loss go down
sample text
fine-tune a small model
train a LoRA adapter
compare outputs The same pipeline can run at small scale:
Public corpora make the first handle concrete. C4, short for Colossal Clean Crawled Corpus, comes from cleaned Common Crawl web text. Dolma is an open three-trillion-token corpus released for OLMo-style pretraining research. Common Pile focuses on openly licensed and public-domain text. Hugging Face hosts many smaller datasets suitable for experiments.
Scale still matters. Training a tiny model teaches the mechanics. Training a useful domain model requires careful data, evaluation, and compute. Training a frontier model requires industrial infrastructure.
The practical budget ladder looks roughly like this:
- Laptop or small cloud box: tokenizer practice, tiny models, small fine-tunes, data cleaning drills.
- Single rented GPU or small GPU box: serious LoRA or QLoRA work on open models, task adapters, classification behavior, format control, domain tone.
- $5k-class local AI workstation: DGX Spark, high-memory Mac Studio, or similar machines can make local adapter work feel operational instead of academic.
- Several GPUs over days or weeks: small-model pretraining, domain-specific continued pretraining, stronger SFT runs, more credible evaluation.
- Tens of thousands of dollars: plausible company pilot for a small or mid-sized domain model experiment, especially when the goal is not frontier capability. This budget can buy data curation, GPU time, repeated runs, evaluation, and deployment hardening.
- Millions and up: frontier-scale pretraining, broad assistant capability, large safety programs, heavy infrastructure, and repeated failed runs.
The attainable goal is not “build GPT in a weekend.” The attainable goal is: run the same class of process at a small scale, then understand why pretraining, SFT, RAG, LoRA, evaluation, and serving remain separate levers.
The workstation LoRA path as of May 2026
As of May 2026, a machine such as NVIDIA DGX Spark or a high-memory Mac Studio does not turn local hardware into a frontier lab, but it does make serious adapter work reachable.
NVIDIA positions DGX Spark as a compact Grace Blackwell machine with 128GB unified memory. NVIDIA says the box can fine-tune models up to the 70B class locally. LoRA and QLoRA make this kind of claim practical: freeze the base model, train a small adapter, evaluate, repeat.
Apple’s M3 Ultra Mac Studio class has a different shape. It offers very large unified-memory configurations, strong local developer ergonomics, and the MLX software path for Apple silicon. It is less CUDA-native than the NVIDIA path, but the memory pool makes local model loading and adapter experiments realistic.
For this tier, the examples stop being toy models:
- Qwen3 dense models in the 4B, 8B, 14B, and 32B range.
- Qwen3.6-style MoE models such as a 35B-A3B class model, where only a smaller active slice participates per token.
- NVIDIA Nemotron Nano and Nemotron 3 Nano class models, including small dense models and 30B-A3B style MoE models.
- Embedding and reranker models for RAG systems, where local fine-tuning can matter as much as chat fine-tuning.
This is a credible company pilot shape:
choose an open base model
collect 500 to 10,000 high-quality examples
train a LoRA or QLoRA adapter locally
evaluate against held-out cases
compare against prompting and RAG
keep the adapter only if it beats the simpler path - examples
- adapter train
- held-out eval
- compare
- keep or discard
- data fix
The practical output is not a new foundation model. It is a local, inspectable adapter proving one bounded behavior: ticket routing, structured extraction, house-style rewriting, policy classification, code review labeling, or domain-specific response format.
This matters because a $5k-class box can sit inside a small team. Experiments stop waiting on procurement, cluster slots, or vendor tickets. The learning loop tightens: data issue, adapter run, evaluation failure, data fix, adapter run again.
Stage 1: collect and clean the data
The earliest stage looks less glamorous than the demo. A model starts with a large text and code corpus: web pages, books, articles, documentation, forums, repositories, math data, synthetic data, licensed collections, and internal datasets where applicable.
Training systems cannot simply pour raw data into a model. Duplicates distort training. Low-quality pages teach low-quality patterns. Private data creates legal and security risk. Toxic or spammy content changes model behavior. Code with secrets creates a different failure class.
So the first stage handles filtering, deduplication, classification, and mixture design. The model is not learning yet; the training team assembles the diet.
The important tradeoff: data quality becomes model behavior later. A model can only learn patterns present in its training mixture, and it will inherit some unwanted patterns unless filtering, evaluation, and post-training catch them.
- collect web, code, docs, books
- dedupe remove repeated text
- classify quality, license, risk
- mix balance domains
- train corpus enters run
Stage 2: turn text into tokens
Models do not read words exactly the way people do. Text gets split into tokens: chunks of characters, words, or word pieces. A common word may become one token. A rare word may become several. Code, punctuation, and whitespace also become tokens.
Tokenization matters because the model predicts tokens, not ideas directly. The training task is mechanically simple:
given previous tokens -> predict the next token This simple objective scales. With enough data, model size, and compute, the next-token task forces the model to learn grammar, facts, style, code structure, reasoning patterns, and many statistical regularities of language.
The earlier loose database metaphor works for this reason: the model does not store rows in a table. It stores pattern weights for likely continuations.
The model predicts the next token.
text becomes chunks; chunks become prediction targets
Stage 3: pretrain the base model
Pretraining is the expensive stage.
A transformer model starts with billions of mostly random parameters. During training, it reads token sequences and predicts masked or next tokens depending on the objective. When the prediction is wrong, the training system adjusts the model weights slightly. Repeat this at enormous scale.
The transformer architecture matters because attention lets tokens relate to other tokens across a context window. A model can connect a variable name to its later use, a pronoun to an earlier noun, or a requirement to a later implementation detail. The original Transformer paper made attention the central mechanism.
The output of pretraining is a base model. It knows many patterns. It can complete text. It may know facts. It can imitate many registers. But it is not necessarily a good assistant.
Base models complete prompts. Assistant models follow instructions.
- sample tokens
- predict next token
- measure loss
- adjust weights
- repeat
continues text
Question: ... Answer:follows task shape
Instruction -> answerStage 4: SFT, or Supervised Fine-Tuning
SFT means Supervised Fine-Tuning. It teaches the model the shape of helpful interaction.
Instead of only predicting arbitrary next text, the model trains on task-shaped examples:
instruction -> good answer
question -> grounded explanation
bug report -> useful diagnosis
policy -> compliant refusal or allowed response This stage often uses human-written examples, curated data, synthetic examples, or mixtures of all three. The goal is not to teach every fact again. The goal is to shift the pretrained model toward following instructions in a recognizable useful format.
SFT changes the interface contract. A base model might continue a prompt in character. An instruction-tuned model should answer the task.
Stage 5: preference tuning, RLHF, and DPO
Instruction tuning still leaves a problem: multiple answers can be plausible, but judges prefer only some of them.
Preference tuning adds comparison data. Humans or other judging systems rank outputs: answer A beats answer B. Several common acronyms appear here.
RL means Reinforcement Learning. In RL, a system learns behavior from rewards instead of only copying labeled examples.
RLHF means Reinforcement Learning from Human Feedback. In the common LLM pattern, humans compare candidate answers, a reward model learns those preferences, then RL nudges the language model toward higher-scoring behavior.
RM means Reward Model. It scores model outputs according to preference data.
PPO means Proximal Policy Optimization. It is one reinforcement-learning algorithm used in some RLHF pipelines.
DPO means Direct Preference Optimization. It uses preference pairs more directly and can skip a separate reinforcement-learning loop.
This stage shapes behavior: helpfulness, harmlessness, refusal style, concision, formatting, honesty about uncertainty, and avoidance of certain unsafe instructions.
Preference tuning does not make the model omniscient. It changes what the model tends to produce when several continuations are possible.
- SFT copy good examples
- RM score preferences
- RLHF / DPO prefer better answers
- safety evals probe failure modes
- serving policy ship inside controls
Stage 6: evaluate, red-team, and ship
Before serving, a model needs evaluation. This includes benchmark tests, safety tests, jailbreak attempts, hallucination checks, coding tests, latency checks, regression tests, and product-specific acceptance tests.
Evaluation does not prove the model is safe. It gives evidence about known behaviors under known probes. This distinction matters. A model can pass a benchmark and still fail in a new deployment context.
The serving layer adds more machinery:
- model hosting
- batching and caching
- content filters
- system prompts
- tool calling
- retrieval
- rate limits
- observability
- incident response
At this point, the model has become part of a system. The system behavior is not just “the weights.” It is weights plus runtime policy, prompts, tools, retrieval, monitoring, and human approval paths.
The acronym map
The industry vocabulary gets easier once each acronym maps to one job.
| Acronym | Spelled out | Job |
|---|---|---|
| LLM | Large Language Model | The model family trained to predict and generate language tokens. |
| GPU | Graphics Processing Unit | Common accelerator for training and inference. |
| TPU | Tensor Processing Unit | Google accelerator for large matrix workloads. |
| SFT | Supervised Fine-Tuning | Teaches task-following from labeled examples. |
| RL | Reinforcement Learning | Learns behavior from rewards. |
| RLHF | Reinforcement Learning from Human Feedback | Uses human preference judgments to steer model behavior. |
| RM | Reward Model | Scores outputs during preference tuning. |
| PPO | Proximal Policy Optimization | Reinforcement-learning algorithm often associated with RLHF. |
| DPO | Direct Preference Optimization | Optimizes from preference pairs without a separate RL loop. |
| RAG | Retrieval-Augmented Generation | Pulls external documents into context before generation. |
| PEFT | Parameter-Efficient Fine-Tuning | Adapts a model by training only a small parameter subset. |
| LoRA | Low-Rank Adaptation | PEFT method using small trainable low-rank matrices. |
| QLoRA | Quantized Low-Rank Adaptation | LoRA plus quantization to reduce memory during tuning. |
| QA-LoRA | Quantization-Aware Low-Rank Adaptation | Quantization-aware LoRA path for efficient tuning and deployment. |
| LongLoRA | Long-context Low-Rank Adaptation | LoRA-style method for extending context length efficiently. |
| S-LoRA | Serving-focused LoRA system | Runtime system for serving many LoRA adapters concurrently. |
| X-LoRA | Mixture of LoRA experts | Routes through multiple LoRA adapter experts. |
| AdaLoRA | Adaptive Low-Rank Adaptation | Allocates rank budget across layers based on importance. |
| DoRA | Weight-Decomposed Low-Rank Adaptation | Splits magnitude and direction updates for stronger adaptation. |
| MoE | Mixture of Experts | Model architecture routing tokens through selected expert subnetworks. |
The key split:
Pretraining creates broad capability.
SFT creates task-following behavior.
RLHF or DPO creates preference-shaped behavior.
RAG supplies external facts at runtime.
LoRA supplies compact behavioral adaptation.
Tools supply action.
Agents supply loops around tools. Where LoRA enters
Full fine-tuning updates many or all model weights. For a large model, this path costs memory, compute, storage, and operational complexity.
LoRA takes a different approach.
LoRA stands for Low-Rank Adaptation. The core idea: many fine-tuning changes fit inside much smaller matrices inserted alongside parts of the original model. The base model weights stay frozen. Training updates only the small adapter weights.
Instead of making a new full copy of the model for each adaptation, LoRA creates a compact patch.
base model weights: frozen
LoRA adapter: small trainable addition
runtime behavior: base model + adapter The practical effect: adapting a model becomes cheaper and more portable. A team can train an adapter for a style, domain, classification pattern, or task behavior without paying the cost of full retraining.
LoRA can also compress stable instruction burden. If the same rubric, schema, label set, refusal boundary, or house rule appears in every prompt, an adapter can learn the pattern once instead of spending context tokens on it every run.
This is not free truth storage. The trade is clear:
prompt/RAG tokens -> adapter weights
runtime cost -> training, eval, and versioning cost
explicit context -> compiled behavior The graph or retrieval layer should still hold cited truth, current facts, and provenance. LoRA should carry durable judgment patterns and stable instruction shape. Evals have to prove compression preserved the rule instead of distorting it.
LoRA sits inside the broader PEFT family. PEFT means Parameter-Efficient Fine-Tuning. The goal: adapt a large model while training far fewer parameters than full fine-tuning.
QLoRA means Quantized Low-Rank Adaptation. Quantization stores model numbers in lower precision, reducing memory pressure. QLoRA uses quantization plus LoRA so smaller hardware can fine-tune larger models.
many weights change
highest costbase frozen, adapter trained
portable patchquantized base, adapter trained
lower memoryLoRA variants: same idea, different pressure points
LoRA became a family of methods because teams hit different bottlenecks. Some need cheaper training. Some need longer context. Some need many adapters live at once. Some need better accuracy from the same adapter budget.
The practical map:
| Variant | Expanded name or plain-English meaning | Main problem |
|---|---|---|
| LoRA | Low-Rank Adaptation | Cheap task adaptation with frozen base weights. |
| QLoRA | Quantized Low-Rank Adaptation | Fit larger fine-tuning runs into less memory. |
| QA-LoRA | Quantization-Aware Low-Rank Adaptation | Fine-tune with quantization in mind from the start. |
| LongLoRA | Long-context LoRA | Extend context length without full expensive long-context tuning. |
| LongQLoRA | Long-context Quantized Low-Rank Adaptation | Combine long-context extension with QLoRA-style memory savings. |
| S-LoRA | Serving many LoRA adapters | Keep many adapters available at runtime with lower overhead. |
| X-LoRA | Mixture of LoRA adapter experts | Combine several adapter experts through routing. |
| AdaLoRA | Adaptive Low-Rank Adaptation | Spend more adapter rank where the model needs it most. |
| LoRA+ | LoRA with adjusted optimization rates | Improve learning dynamics for large-width models. |
| DoRA | Weight-Decomposed Low-Rank Adaptation | Adapt weight direction and magnitude more explicitly. |
QLoRA: memory pressure
QLoRA keeps the base model frozen and quantized, often around 4-bit precision, then trains LoRA adapters through it. The core result: fine-tuning a larger model becomes possible on less hardware because the frozen model consumes less memory.
Use QLoRA when the blocker is memory, not model choice. It does not magically improve the training data. It makes the adaptation run cheaper.
QA-LoRA: quantization-aware adaptation
QA-LoRA means Quantization-Aware Low-Rank Adaptation. It treats quantization as part of the adaptation design rather than a final compression step. The goal is practical deployment: tune efficiently and land in a quantized model shape with less accuracy loss.
Use QA-LoRA when the final serving target is low-bit deployment and post-training compression risk matters.
LongLoRA and LongQLoRA: context length pressure
LongLoRA targets long-context fine-tuning. The problem is not “teach a new style.” The problem is adapting a model to handle longer sequences without paying the full cost of dense long-context training.
LongQLoRA combines long-context extension with quantized LoRA-style savings. The design pressure is clear: long context increases memory and compute, so quantization plus adapter training can keep the run practical.
Use these when the model needs longer documents, longer code files, or longer conversation state. Do not use them as a substitute for retrieval when the real problem is fresh external knowledge.
S-LoRA: serving pressure
S-LoRA is about runtime, not just training. A platform may have one base model and thousands of customer or task adapters. Loading and unloading adapters naively can create latency, memory fragmentation, and throughput problems.
S-LoRA focuses on serving many LoRA adapters concurrently. It matters for multi-tenant systems: one base model, many specialized adapters, many users.
Use S-LoRA patterns when adapter count and serving throughput become the problem.
X-LoRA: routing pressure
X-LoRA treats adapters like experts. Instead of choosing one adapter for a whole task, the system can route through multiple low-rank adapter experts. This resembles a mixture-of-experts idea at the adapter layer.
Use X-LoRA when one model needs several specialized behaviors and a router can choose among them more effectively than one merged adapter.
AdaLoRA, LoRA+, and DoRA: adapter quality pressure
AdaLoRA means Adaptive Low-Rank Adaptation. Instead of giving every target layer the same rank budget, it reallocates rank based on importance. The goal is better use of a limited parameter budget.
LoRA+ changes optimization dynamics. The method uses different learning rates for the two LoRA matrices, targeting faster or better adaptation in wide models.
DoRA means Weight-Decomposed Low-Rank Adaptation. It separates weight magnitude and direction, then applies low-rank adaptation in a way closer to full fine-tuning behavior.
Use these when plain LoRA works operationally but leaves accuracy or convergence on the table.
The simple rule: LoRA variants are not a ladder from bad to good. They are answers to different bottlenecks.
memory bottleneck -> QLoRA, QA-LoRA
context bottleneck -> LongLoRA, LongQLoRA
serving bottleneck -> S-LoRA
multi-skill routing bottleneck -> X-LoRA
adapter quality bottleneck -> AdaLoRA, LoRA+, DoRA What LoRA is good for
LoRA is useful when the target behavior is narrow enough to teach with examples.
Good fits:
- output format discipline
- domain-specific phrasing
- classification labels
- a recurring transformation
- product-specific tone
- narrow code or config patterns
- task behavior repeated across many examples
Weak fits:
- fresh facts changing daily
- large private knowledge bases
- questions requiring source citation
- broad new reasoning ability
- actions requiring live system state
- policy changing faster than adapter review
This boundary matters. LoRA changes model behavior. Retrieval changes visible context. Tools change system capability. These are separate levers.
LoRA versus RAG
LoRA and RAG often get confused because both can make a model feel more specialized.
RAG means retrieval-augmented generation. A system searches documents, pulls relevant chunks into context, and asks the model to answer using those chunks. The facts stay outside the model.
LoRA changes model weights through adapter training. The learned behavior moves into the adapter.
Use retrieval when the problem is knowledge access. Use LoRA when the problem is behavior shape.
Examples:
- “Answer from this current policy manual” -> retrieval.
- “Always produce a strict triage JSON object” -> LoRA may help.
- “Use this week’s product catalog” -> retrieval.
- “Classify support tickets into stable routing labels” -> LoRA may help.
- “Cite exact source passages” -> retrieval.
- “Adopt a recurring house style” -> LoRA may help.
The deeper version appears in LoRA plus RAG composition: the strongest systems often combine both, but the levers should stay mentally separate.
retrieve documents into context
facts stay outside weightstrain a compact adapter
behavior moves into adapterWhat changes mathematically
Pretraining changes the full model parameter set. The model starts with random weights and gradient descent adjusts those weights across a very large dataset. Each update nudges the model toward lower prediction error.
SFT also updates weights, but the dataset looks like task examples instead of raw web-scale continuation. It moves the model from “continue this text” toward “respond to this instruction.”
RLHF adds an optimization target based on preference. A reward model approximates human preference, then the policy model moves toward higher reward while staying near the SFT model. PPO is one way to control this movement. DPO simplifies the setup by directly optimizing preference pairs.
LoRA assumes fine-tuning updates often have low intrinsic rank. Instead of changing a large weight matrix directly, LoRA adds two small trainable matrices whose product approximates the needed update. In simplified form:
original weight: W
full fine-tune: W changes directly
LoRA fine-tune: W stays frozen, plus small update A x B The matrices A and B contain far fewer trainable values than W. This makes adapters smaller, cheaper to train, easier to swap, and easier to version. It also creates a clean operational boundary: one base model can carry several task adapters.
MoE, short for Mixture of Experts, solves a different scaling problem. Instead of activating the whole model for every token, an MoE model routes tokens through selected expert subnetworks. It changes compute routing inside the model, not the same concern as LoRA or RAG.
datasettrained_onadapter
adapterscored_byeval set
runproducesscore
scoresupportsdecision
decisiongatesdeployment
Bonus: frontier terms as runtime adapters
Closed frontier APIs do not expose model weights. LoRA-style adaptation still has a useful cousin at runtime: named operating terms.
A term like red team carries a compact procedure. In security, it means adversarial testing to expose weaknesses before an attacker uses them. In model work, the same term usually triggers a nearby procedure: challenge assumptions, search for failure modes, stress boundaries, and propose fixes.
The term works because it compresses a pattern:
term -> procedure -> failure mode -> evidence expected This makes a term catalog act like a soft adapter for API models:
frontier model
+ graph-backed term catalog
+ task-specific prompt pack
+ eval receipt
-> more consistent agent behavior Useful terms have operational shape:
| Term | Procedure carried by the term |
|---|---|
| red team | Challenge assumptions and name exploit paths. |
| invariant | State the rule a system must preserve. |
| rubric | Score output against explicit criteria. |
| holdout | Test against examples outside the training or tuning set. |
| ablation | Remove one factor and measure the change. |
| rollback | Preserve a known-good return path. |
| provenance | Keep source, version, and decision lineage. |
| blast radius | Bound damage from a bad action. |
| sentinel | Watch for silent failure. |
| canary | Expose a small slice before broad release. |
The runtime split stays clean:
LoRA = compiled behavior in adapter weights
term pack = explicit behavior in runtime context
graph = canonical meaning and relationships
MCP = governed retrieval surface
evals = proof the term became procedure Term packs are weaker than weights because they still spend context. They are stronger than vibes because agents can retrieve, cite, execute, and score them. A proven term pack can later become SFT or LoRA data for a local model.
The one-page mental model
The failure mode is treating all of those as one blob called “training.”
This makes architecture decisions worse. Fresh facts get fine-tuned into adapters when retrieval would be safer. Stable behavior gets shoved into prompts when an adapter would be cleaner. Tool authority hides inside a model discussion when it belongs in system design.
Model building has stages. Each stage changes a different part of the system. LoRA is one useful stage-adjacent lever, not a miniature version of building GPT from scratch.
The companion Learn piece, Prompt, context, fine-tune, gate, maps those stages back onto the Determinism Ladder.
Sources
- Vaswani et al., Transformer architecture paper
- Ouyang et al., Training language models to follow instructions with human feedback
- Hu et al., LoRA: Low-Rank Adaptation of Large Language Models
- Dettmers et al., QLoRA: Efficient Finetuning of Quantized LLMs
- Xu et al., QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
- Chen et al., LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
- Sheng et al., S-LoRA: Serving Thousands of Concurrent LoRA Adapters
- McNaughton et al., X-LoRA: Mixture of Low-Rank Adapter Experts
- Zhang et al., AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
- Hayou et al., LoRA+: Efficient Low Rank Adaptation of Large Models
- Liu et al., DoRA: Weight-Decomposed Low-Rank Adaptation
- AllenAI, Dolma dataset
- AllenAI, Dolma corpus paper
- AllenAI, C4 dataset
- Common Pile team, Common Pile v0.1 dataset collection
- NVIDIA, DGX Spark product page
- NVIDIA, DGX Spark hardware overview
- Apple, Mac Studio technical specifications
- Qwen, Qwen3.6-35B-A3B model card
- NVIDIA, Nemotron models
- NVIDIA, Nemotron 3 research page
Axioms touched
Lighter touch than the Learn series — primer pieces don't usually lean heavily on the axiom catalog, but where they do it's noted.
- #2 Push work down toward determinism held
Separates base-model training, adaptation, retrieval, and serving into distinct boundaries.
- #11 Cite or be silent held
Grounds transformer, instruction tuning, preference tuning, and LoRA in primary papers.
- #13 Ship with the failure mode named held
Names the failure mode: treating all model improvement methods as the same kind of training.
