2026-05-17· 13 min read

LLM construction stages, from pretraining to LoRA

A language model moves through stages: pretraining, supervised tuning, preference tuning, evaluation, serving, retrieval, and adapter training. LoRA enters as a compact adaptation layer after the expensive base model exists.

demystifyllmtraininglorafine-tuningrlhfragtransformersprimer

Most confusion about LLMs starts with one overloaded verb: train.

A model gets “trained” during trillion-token pretraining. A chat model gets “trained” again during supervised instruction tuning. A preference stage may steer it through RL, short for Reinforcement Learning, or DPO, short for Direct Preference Optimization. A company may “train” an adapter with LoRA, short for Low-Rank Adaptation. A product team may say “train” when it really means adding documents to RAG, short for Retrieval-Augmented Generation.

Those are different operations with different cost, risk, and control.

The simple allegory: a generalist becoming a specialist

Picture a generalist moving into a specialist role.

First comes broad reading. The person reads books, articles, code, worked examples, arguments, jokes, and documentation. This stage builds general language sense.

Then comes instruction practice. The person learns the expected response format: answer the question, show work, summarize clearly, refuse unsafe requests, and follow directions.

Then comes review. Judges compare answers and mark better ones. The person starts preferring clear, useful, safer answers over messy ones.

Then comes local specialization. The general capability already exists, but a small notebook of local rules helps with one setting: house style, ticket labels, response format, or domain phrasing.

LoRA is like the local notebook. It does not create the generalist from scratch. It adds a compact specialty layer after broad capability already exists.

The pipeline

The useful picture is a staged pipeline.

raw data
  -> filtering and deduplication
  -> tokenization
  -> pretraining
  -> SFT: Supervised Fine-Tuning
  -> preference tuning: RLHF, DPO, or related methods
  -> safety evaluation
  -> serving and monitoring
  -> RAG, LoRA, tools, and agents

LoRA belongs near the end. It is not how most foundation models are born. It is a way to adapt an already-trained model without updating every weight inside it.

LLM build pipeline

raw data
filter
tokens
pretrain
SFT
RLHF / DPO
evals
serve
RAG / LoRA / tools

The attainable version: downloading data and training something

This whole process can sound sealed inside frontier labs. The practical truth is more useful: public training data exists, public training code exists, and small models can train on ordinary developer hardware or rented cloud GPUs.

The first successful local run should not aim at a frontier model. It should aim at contact with the machinery.

download a small text dataset
train a tokenizer
train a tiny language model
watch loss go down
sample text
fine-tune a small model
train a LoRA adapter
compare outputs

The same pipeline can run at small scale:

Public corpora make the first handle concrete. C4, short for Colossal Clean Crawled Corpus, comes from cleaned Common Crawl web text. Dolma is an open three-trillion-token corpus released for OLMo-style pretraining research. Common Pile focuses on openly licensed and public-domain text. Hugging Face hosts many smaller datasets suitable for experiments.

Scale still matters. Training a tiny model teaches the mechanics. Training a useful domain model requires careful data, evaluation, and compute. Training a frontier model requires industrial infrastructure.

The practical budget ladder looks roughly like this:

Laptop or small cloud box: tokenizer practice, tiny models, small fine-tunes, data cleaning drills.
Single rented GPU or small GPU box: serious LoRA or QLoRA work on open models, task adapters, classification behavior, format control, domain tone.
$5k-class local AI workstation: DGX Spark, high-memory Mac Studio, or similar machines can make local adapter work feel operational instead of academic.
Several GPUs over days or weeks: small-model pretraining, domain-specific continued pretraining, stronger SFT runs, more credible evaluation.
Tens of thousands of dollars: plausible company pilot for a small or mid-sized domain model experiment, especially when the goal is not frontier capability. This budget can buy data curation, GPU time, repeated runs, evaluation, and deployment hardening.
Millions and up: frontier-scale pretraining, broad assistant capability, large safety programs, heavy infrastructure, and repeated failed runs.

The attainable goal is not “build GPT in a weekend.” The attainable goal is: run the same class of process at a small scale, then understand why pretraining, SFT, RAG, LoRA, evaluation, and serving remain separate levers.

Compute ladder

laptop rented GPU DGX Spark / Mac Studio small cluster frontier lab

The workstation LoRA path as of May 2026

As of May 2026, a machine such as NVIDIA DGX Spark or a high-memory Mac Studio does not turn local hardware into a frontier lab, but it does make serious adapter work reachable.

NVIDIA positions DGX Spark as a compact Grace Blackwell machine with 128GB unified memory. NVIDIA says the box can fine-tune models up to the 70B class locally. LoRA and QLoRA make this kind of claim practical: freeze the base model, train a small adapter, evaluate, repeat.

Apple’s M3 Ultra Mac Studio class has a different shape. It offers very large unified-memory configurations, strong local developer ergonomics, and the MLX software path for Apple silicon. It is less CUDA-native than the NVIDIA path, but the memory pool makes local model loading and adapter experiments realistic.

For this tier, the examples stop being toy models:

Qwen3 dense models in the 4B, 8B, 14B, and 32B range.
Qwen3.6-style MoE models such as a 35B-A3B class model, where only a smaller active slice participates per token.
NVIDIA Nemotron Nano and Nemotron 3 Nano class models, including small dense models and 30B-A3B style MoE models.
Embedding and reranker models for RAG systems, where local fine-tuning can matter as much as chat fine-tuning.

This is a credible company pilot shape:

choose an open base model
collect 500 to 10,000 high-quality examples
train a LoRA or QLoRA adapter locally
evaluate against held-out cases
compare against prompting and RAG
keep the adapter only if it beats the simpler path

Workstation LoRA pilot loop

examples
adapter train
held-out eval
compare
keep or discard
data fix

The practical output is not a new foundation model. It is a local, inspectable adapter proving one bounded behavior: ticket routing, structured extraction, house-style rewriting, policy classification, code review labeling, or domain-specific response format.

This matters because a $5k-class box can sit inside a small team. Experiments stop waiting on procurement, cluster slots, or vendor tickets. The learning loop tightens: data issue, adapter run, evaluation failure, data fix, adapter run again.

Stage 1: collect and clean the data

The earliest stage looks less glamorous than the demo. A model starts with a large text and code corpus: web pages, books, articles, documentation, forums, repositories, math data, synthetic data, licensed collections, and internal datasets where applicable.

Training systems cannot simply pour raw data into a model. Duplicates distort training. Low-quality pages teach low-quality patterns. Private data creates legal and security risk. Toxic or spammy content changes model behavior. Code with secrets creates a different failure class.

So the first stage handles filtering, deduplication, classification, and mixture design. The model is not learning yet; the training team assembles the diet.

The important tradeoff: data quality becomes model behavior later. A model can only learn patterns present in its training mixture, and it will inherit some unwanted patterns unless filtering, evaluation, and post-training catch them.

Training data funnel

collect web, code, docs, books
dedupe remove repeated text
classify quality, license, risk
mix balance domains
train corpus enters run

Stage 2: turn text into tokens

Models do not read words exactly the way people do. Text gets split into tokens: chunks of characters, words, or word pieces. A common word may become one token. A rare word may become several. Code, punctuation, and whitespace also become tokens.

Tokenization matters because the model predicts tokens, not ideas directly. The training task is mechanically simple:

given previous tokens -> predict the next token

This simple objective scales. With enough data, model size, and compute, the next-token task forces the model to learn grammar, facts, style, code structure, reasoning patterns, and many statistical regularities of language.

The earlier loose database metaphor works for this reason: the model does not store rows in a table. It stores pattern weights for likely continuations.

Tokenization sketch

The model predicts the next token.

Themodelpredictsthenexttoken.

text becomes chunks; chunks become prediction targets

Stage 3: pretrain the base model

Pretraining is the expensive stage.

A transformer model starts with billions of mostly random parameters. During training, it reads token sequences and predicts masked or next tokens depending on the objective. When the prediction is wrong, the training system adjusts the model weights slightly. Repeat this at enormous scale.

The transformer architecture matters because attention lets tokens relate to other tokens across a context window. A model can connect a variable name to its later use, a pronoun to an earlier noun, or a requirement to a later implementation detail. The original Transformer paper made attention the central mechanism.

The output of pretraining is a base model. It knows many patterns. It can complete text. It may know facts. It can imitate many registers. But it is not necessarily a good assistant.

Base models complete prompts. Assistant models follow instructions.

Pretraining loop

sample tokens
predict next token
measure loss
adjust weights
repeat

Base model versus assistant model

base model

continues text

Question: ... Answer:

assistant model

follows task shape

Instruction -> answer

Stage 4: SFT, or Supervised Fine-Tuning

SFT means Supervised Fine-Tuning. It teaches the model the shape of helpful interaction.

Instead of only predicting arbitrary next text, the model trains on task-shaped examples:

instruction -> good answer
question -> grounded explanation
bug report -> useful diagnosis
policy -> compliant refusal or allowed response

This stage often uses human-written examples, curated data, synthetic examples, or mixtures of all three. The goal is not to teach every fact again. The goal is to shift the pretrained model toward following instructions in a recognizable useful format.

SFT changes the interface contract. A base model might continue a prompt in character. An instruction-tuned model should answer the task.

Stage 5: preference tuning, RLHF, and DPO

Instruction tuning still leaves a problem: multiple answers can be plausible, but judges prefer only some of them.

Preference tuning adds comparison data. Humans or other judging systems rank outputs: answer A beats answer B. Several common acronyms appear here.

RL means Reinforcement Learning. In RL, a system learns behavior from rewards instead of only copying labeled examples.

RLHF means Reinforcement Learning from Human Feedback. In the common LLM pattern, humans compare candidate answers, a reward model learns those preferences, then RL nudges the language model toward higher-scoring behavior.

RM means Reward Model. It scores model outputs according to preference data.

PPO means Proximal Policy Optimization. It is one reinforcement-learning algorithm used in some RLHF pipelines.

DPO means Direct Preference Optimization. It uses preference pairs more directly and can skip a separate reinforcement-learning loop.

This stage shapes behavior: helpfulness, harmlessness, refusal style, concision, formatting, honesty about uncertainty, and avoidance of certain unsafe instructions.

Preference tuning does not make the model omniscient. It changes what the model tends to produce when several continuations are possible.

Post-training stack

SFT copy good examples
RM score preferences
RLHF / DPO prefer better answers
safety evals probe failure modes
serving policy ship inside controls

Stage 6: evaluate, red-team, and ship

Before serving, a model needs evaluation. This includes benchmark tests, safety tests, jailbreak attempts, hallucination checks, coding tests, latency checks, regression tests, and product-specific acceptance tests.

Evaluation does not prove the model is safe. It gives evidence about known behaviors under known probes. This distinction matters. A model can pass a benchmark and still fail in a new deployment context.

The serving layer adds more machinery:

model hosting
batching and caching
content filters
system prompts
tool calling
retrieval
rate limits
observability
incident response

At this point, the model has become part of a system. The system behavior is not just “the weights.” It is weights plus runtime policy, prompts, tools, retrieval, monitoring, and human approval paths.

The acronym map

The industry vocabulary gets easier once each acronym maps to one job.

Acronym	Spelled out	Job
LLM	Large Language Model	The model family trained to predict and generate language tokens.
GPU	Graphics Processing Unit	Common accelerator for training and inference.
TPU	Tensor Processing Unit	Google accelerator for large matrix workloads.
SFT	Supervised Fine-Tuning	Teaches task-following from labeled examples.
RL	Reinforcement Learning	Learns behavior from rewards.
RLHF	Reinforcement Learning from Human Feedback	Uses human preference judgments to steer model behavior.
RM	Reward Model	Scores outputs during preference tuning.
PPO	Proximal Policy Optimization	Reinforcement-learning algorithm often associated with RLHF.
DPO	Direct Preference Optimization	Optimizes from preference pairs without a separate RL loop.
RAG	Retrieval-Augmented Generation	Pulls external documents into context before generation.
PEFT	Parameter-Efficient Fine-Tuning	Adapts a model by training only a small parameter subset.
LoRA	Low-Rank Adaptation	PEFT method using small trainable low-rank matrices.
QLoRA	Quantized Low-Rank Adaptation	LoRA plus quantization to reduce memory during tuning.
QA-LoRA	Quantization-Aware Low-Rank Adaptation	Quantization-aware LoRA path for efficient tuning and deployment.
LongLoRA	Long-context Low-Rank Adaptation	LoRA-style method for extending context length efficiently.
S-LoRA	Serving-focused LoRA system	Runtime system for serving many LoRA adapters concurrently.
X-LoRA	Mixture of LoRA experts	Routes through multiple LoRA adapter experts.
AdaLoRA	Adaptive Low-Rank Adaptation	Allocates rank budget across layers based on importance.
DoRA	Weight-Decomposed Low-Rank Adaptation	Splits magnitude and direction updates for stronger adaptation.
MoE	Mixture of Experts	Model architecture routing tokens through selected expert subnetworks.

The key split:

Pretraining creates broad capability.
SFT creates task-following behavior.
RLHF or DPO creates preference-shaped behavior.
RAG supplies external facts at runtime.
LoRA supplies compact behavioral adaptation.
Tools supply action.
Agents supply loops around tools.

Where LoRA enters

Full fine-tuning updates many or all model weights. For a large model, this path costs memory, compute, storage, and operational complexity.

LoRA takes a different approach.

LoRA stands for Low-Rank Adaptation. The core idea: many fine-tuning changes fit inside much smaller matrices inserted alongside parts of the original model. The base model weights stay frozen. Training updates only the small adapter weights.

Instead of making a new full copy of the model for each adaptation, LoRA creates a compact patch.

base model weights: frozen
LoRA adapter: small trainable addition
runtime behavior: base model + adapter

The practical effect: adapting a model becomes cheaper and more portable. A team can train an adapter for a style, domain, classification pattern, or task behavior without paying the cost of full retraining.

LoRA can also compress stable instruction burden. If the same rubric, schema, label set, refusal boundary, or house rule appears in every prompt, an adapter can learn the pattern once instead of spending context tokens on it every run.

This is not free truth storage. The trade is clear:

prompt/RAG tokens -> adapter weights
runtime cost -> training, eval, and versioning cost
explicit context -> compiled behavior

The graph or retrieval layer should still hold cited truth, current facts, and provenance. LoRA should carry durable judgment patterns and stable instruction shape. Evals have to prove compression preserved the rule instead of distorting it.

LoRA sits inside the broader PEFT family. PEFT means Parameter-Efficient Fine-Tuning. The goal: adapt a large model while training far fewer parameters than full fine-tuning.

QLoRA means Quantized Low-Rank Adaptation. Quantization stores model numbers in lower precision, reducing memory pressure. QLoRA uses quantization plus LoRA so smaller hardware can fine-tune larger models.

LoRA as a sidecar adapter

frozen base model W

trainable adapter A x B

base + adapter behavior

Fine-tuning options

full fine-tune

many weights change

highest cost

LoRA

base frozen, adapter trained

portable patch

QLoRA

quantized base, adapter trained

lower memory

LoRA variants: same idea, different pressure points

LoRA became a family of methods because teams hit different bottlenecks. Some need cheaper training. Some need longer context. Some need many adapters live at once. Some need better accuracy from the same adapter budget.

The practical map:

LoRA variant map

LoRA QLoRAmemoryQA-LoRAquantized deployLongLoRAlong contextS-LoRAmany adaptersX-LoRAadapter routingAdaLoRArank budgetLoRA+optimizer pressureDoRAquality

Variant	Expanded name or plain-English meaning	Main problem
LoRA	Low-Rank Adaptation	Cheap task adaptation with frozen base weights.
QLoRA	Quantized Low-Rank Adaptation	Fit larger fine-tuning runs into less memory.
QA-LoRA	Quantization-Aware Low-Rank Adaptation	Fine-tune with quantization in mind from the start.
LongLoRA	Long-context LoRA	Extend context length without full expensive long-context tuning.
LongQLoRA	Long-context Quantized Low-Rank Adaptation	Combine long-context extension with QLoRA-style memory savings.
S-LoRA	Serving many LoRA adapters	Keep many adapters available at runtime with lower overhead.
X-LoRA	Mixture of LoRA adapter experts	Combine several adapter experts through routing.
AdaLoRA	Adaptive Low-Rank Adaptation	Spend more adapter rank where the model needs it most.
LoRA+	LoRA with adjusted optimization rates	Improve learning dynamics for large-width models.
DoRA	Weight-Decomposed Low-Rank Adaptation	Adapt weight direction and magnitude more explicitly.

QLoRA: memory pressure

QLoRA keeps the base model frozen and quantized, often around 4-bit precision, then trains LoRA adapters through it. The core result: fine-tuning a larger model becomes possible on less hardware because the frozen model consumes less memory.

Use QLoRA when the blocker is memory, not model choice. It does not magically improve the training data. It makes the adaptation run cheaper.

QA-LoRA: quantization-aware adaptation

QA-LoRA means Quantization-Aware Low-Rank Adaptation. It treats quantization as part of the adaptation design rather than a final compression step. The goal is practical deployment: tune efficiently and land in a quantized model shape with less accuracy loss.

Use QA-LoRA when the final serving target is low-bit deployment and post-training compression risk matters.

LongLoRA and LongQLoRA: context length pressure

LongLoRA targets long-context fine-tuning. The problem is not “teach a new style.” The problem is adapting a model to handle longer sequences without paying the full cost of dense long-context training.

LongQLoRA combines long-context extension with quantized LoRA-style savings. The design pressure is clear: long context increases memory and compute, so quantization plus adapter training can keep the run practical.

Use these when the model needs longer documents, longer code files, or longer conversation state. Do not use them as a substitute for retrieval when the real problem is fresh external knowledge.

S-LoRA: serving pressure

S-LoRA is about runtime, not just training. A platform may have one base model and thousands of customer or task adapters. Loading and unloading adapters naively can create latency, memory fragmentation, and throughput problems.

S-LoRA focuses on serving many LoRA adapters concurrently. It matters for multi-tenant systems: one base model, many specialized adapters, many users.

Use S-LoRA patterns when adapter count and serving throughput become the problem.

X-LoRA: routing pressure

X-LoRA treats adapters like experts. Instead of choosing one adapter for a whole task, the system can route through multiple low-rank adapter experts. This resembles a mixture-of-experts idea at the adapter layer.

Use X-LoRA when one model needs several specialized behaviors and a router can choose among them more effectively than one merged adapter.

AdaLoRA, LoRA+, and DoRA: adapter quality pressure

AdaLoRA means Adaptive Low-Rank Adaptation. Instead of giving every target layer the same rank budget, it reallocates rank based on importance. The goal is better use of a limited parameter budget.

LoRA+ changes optimization dynamics. The method uses different learning rates for the two LoRA matrices, targeting faster or better adaptation in wide models.

DoRA means Weight-Decomposed Low-Rank Adaptation. It separates weight magnitude and direction, then applies low-rank adaptation in a way closer to full fine-tuning behavior.

Use these when plain LoRA works operationally but leaves accuracy or convergence on the table.

The simple rule: LoRA variants are not a ladder from bad to good. They are answers to different bottlenecks.

memory bottleneck -> QLoRA, QA-LoRA
context bottleneck -> LongLoRA, LongQLoRA
serving bottleneck -> S-LoRA
multi-skill routing bottleneck -> X-LoRA
adapter quality bottleneck -> AdaLoRA, LoRA+, DoRA

What LoRA is good for

LoRA is useful when the target behavior is narrow enough to teach with examples.

Good fits:

output format discipline
domain-specific phrasing
classification labels
a recurring transformation
product-specific tone
narrow code or config patterns
task behavior repeated across many examples

Weak fits:

fresh facts changing daily
large private knowledge bases
questions requiring source citation
broad new reasoning ability
actions requiring live system state
policy changing faster than adapter review

This boundary matters. LoRA changes model behavior. Retrieval changes visible context. Tools change system capability. These are separate levers.

LoRA versus RAG

LoRA and RAG often get confused because both can make a model feel more specialized.

RAG means retrieval-augmented generation. A system searches documents, pulls relevant chunks into context, and asks the model to answer using those chunks. The facts stay outside the model.

LoRA changes model weights through adapter training. The learned behavior moves into the adapter.

Use retrieval when the problem is knowledge access. Use LoRA when the problem is behavior shape.

Examples:

“Answer from this current policy manual” -> retrieval.
“Always produce a strict triage JSON object” -> LoRA may help.
“Use this week’s product catalog” -> retrieval.
“Classify support tickets into stable routing labels” -> LoRA may help.
“Cite exact source passages” -> retrieval.
“Adopt a recurring house style” -> LoRA may help.

The deeper version appears in LoRA plus RAG composition: the strongest systems often combine both, but the levers should stay mentally separate.

RAG changes context; LoRA changes behavior

RAG

retrieve documents into context

facts stay outside weights

LoRA

train a compact adapter

behavior moves into adapter

What changes mathematically

Pretraining changes the full model parameter set. The model starts with random weights and gradient descent adjusts those weights across a very large dataset. Each update nudges the model toward lower prediction error.

SFT also updates weights, but the dataset looks like task examples instead of raw web-scale continuation. It moves the model from “continue this text” toward “respond to this instruction.”

RLHF adds an optimization target based on preference. A reward model approximates human preference, then the policy model moves toward higher reward while staying near the SFT model. PPO is one way to control this movement. DPO simplifies the setup by directly optimizing preference pairs.

LoRA assumes fine-tuning updates often have low intrinsic rank. Instead of changing a large weight matrix directly, LoRA adds two small trainable matrices whose product approximates the needed update. In simplified form:

original weight: W
full fine-tune: W changes directly
LoRA fine-tune: W stays frozen, plus small update A x B

The matrices A and B contain far fewer trainable values than W. This makes adapters smaller, cheaper to train, easier to swap, and easier to version. It also creates a clean operational boundary: one base model can carry several task adapters.

MoE, short for Mixture of Experts, solves a different scaling problem. Instead of activating the whole model for every token, an MoE model routes tokens through selected expert subnetworks. It changes compute routing inside the model, not the same concern as LoRA or RAG.

Training evidence graph

datasettrained_onadapter

adapterscored_byeval set

runproducesscore

scoresupportsdecision

decisiongatesdeployment

Bonus: frontier terms as runtime adapters

Closed frontier APIs do not expose model weights. LoRA-style adaptation still has a useful cousin at runtime: named operating terms.

A term like red team carries a compact procedure. In security, it means adversarial testing to expose weaknesses before an attacker uses them. In model work, the same term usually triggers a nearby procedure: challenge assumptions, search for failure modes, stress boundaries, and propose fixes.

The term works because it compresses a pattern:

term -> procedure -> failure mode -> evidence expected

This makes a term catalog act like a soft adapter for API models:

frontier model
  + graph-backed term catalog
  + task-specific prompt pack
  + eval receipt
  -> more consistent agent behavior

Useful terms have operational shape:

Term	Procedure carried by the term
red team	Challenge assumptions and name exploit paths.
invariant	State the rule a system must preserve.
rubric	Score output against explicit criteria.
holdout	Test against examples outside the training or tuning set.
ablation	Remove one factor and measure the change.
rollback	Preserve a known-good return path.
provenance	Keep source, version, and decision lineage.
blast radius	Bound damage from a bad action.
sentinel	Watch for silent failure.
canary	Expose a small slice before broad release.

The runtime split stays clean:

LoRA = compiled behavior in adapter weights
term pack = explicit behavior in runtime context
graph = canonical meaning and relationships
MCP = governed retrieval surface
evals = proof the term became procedure

Term packs are weaker than weights because they still spend context. They are stronger than vibes because agents can retrieve, cite, execute, and score them. A proven term pack can later become SFT or LoRA data for a local model.

The one-page mental model

The failure mode is treating all of those as one blob called “training.”

This makes architecture decisions worse. Fresh facts get fine-tuned into adapters when retrieval would be safer. Stable behavior gets shoved into prompts when an adapter would be cleaner. Tool authority hides inside a model discussion when it belongs in system design.

Model building has stages. Each stage changes a different part of the system. LoRA is one useful stage-adjacent lever, not a miniature version of building GPT from scratch.

The companion Learn piece, Prompt, context, fine-tune, gate, maps those stages back onto the Determinism Ladder.

Sources

Vaswani et al., Transformer architecture paper
Ouyang et al., Training language models to follow instructions with human feedback
Hu et al., LoRA: Low-Rank Adaptation of Large Language Models
Dettmers et al., QLoRA: Efficient Finetuning of Quantized LLMs
Xu et al., QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Chen et al., LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Sheng et al., S-LoRA: Serving Thousands of Concurrent LoRA Adapters
McNaughton et al., X-LoRA: Mixture of Low-Rank Adapter Experts
Zhang et al., AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
Hayou et al., LoRA+: Efficient Low Rank Adaptation of Large Models
Liu et al., DoRA: Weight-Decomposed Low-Rank Adaptation
AllenAI, Dolma dataset
AllenAI, Dolma corpus paper
AllenAI, C4 dataset
Common Pile team, Common Pile v0.1 dataset collection
NVIDIA, DGX Spark product page
NVIDIA, DGX Spark hardware overview
Apple, Mac Studio technical specifications
Qwen, Qwen3.6-35B-A3B model card
NVIDIA, Nemotron models
NVIDIA, Nemotron 3 research page

Axioms touched

Lighter touch than the Learn series — primer pieces don't usually lean heavily on the axiom catalog, but where they do it's noted.

#2 Push work down toward determinism held
Separates base-model training, adaptation, retrieval, and serving into distinct boundaries.
#11 Cite or be silent held
Grounds transformer, instruction tuning, preference tuning, and LoRA in primary papers.
#13 Ship with the failure mode named held
Names the failure mode: treating all model improvement methods as the same kind of training.