LLM Architecture · Prompt Engineering · LoRA/QLoRA · Production Apps
T→0: greedy (deterministic), T=1: standard, T>1: creative/random
Define persona, constraints, output format, tone. "You are an expert Python engineer. Always provide runnable code with type hints."
Request JSON with schema definition. Use function calling / tool use for reliable structured output. Pydantic + instructor library for automatic parsing.
Update all parameters. Expensive (100s of GPU hours), risk of catastrophic forgetting. Use only when you have large, high-quality domain data.
Add trainable low-rank matrices to attention weights. 0.1-1% of parameters. No additional inference latency — merge into base model when done.
4-bit quantized base model + LoRA. Enables 70B fine-tuning on 1-2 GPUs. Uses NF4 quantization + double quantization for memory efficiency.