DVC Pipeline Architecture Reference¶
Overview¶
This reference document details the DVC (Data Version Control) pipeline architecture used in JuDDGES for reproducible machine learning workflows. DVC orchestrates complex multi-stage pipelines with automatic dependency resolution and caching.
Pipeline DAG (Directed Acyclic Graph)¶
graph TD
subgraph "DVC Pipeline Stages"
Start([Start])
%% Data Preparation
DataPrep["📊 Data Preparation<br/>Convert raw data to Parquet"]
%% Embedding Stage
Embed["🔤 embed<br/>Generate text embeddings<br/>mmlw-roberta-large"]
%% Training Stages
SFT["🎓 sft<br/>Supervised Fine-Tuning<br/>PEFT/LoRA"]
%% Prediction Stage
Predict["🔮 predict<br/>Generate predictions<br/>Batch inference"]
%% Evaluation Stages
Evaluate["📏 evaluate<br/>N-gram metrics<br/>BLEU/ROUGE"]
EvaluateLLM["🤖 evaluate_llm_as_judge<br/>LLM-based evaluation<br/>Quality assessment"]
%% Matrix Expansion
Matrix{{"Matrix Expansion<br/>Models × Datasets"}}
End([End])
end
%% Dependencies
Start --> DataPrep
DataPrep --> Embed
DataPrep --> SFT
Embed --> Predict
SFT --> Predict
Predict --> Evaluate
Predict --> EvaluateLLM
Evaluate --> End
EvaluateLLM --> End
%% Matrix connections
SFT -.->|foreach| Matrix
Predict -.->|foreach| Matrix
Matrix -.-> SFT
Matrix -.-> Predict
style Start fill:#e8f5e9
style End fill:#ffebee
style Matrix fill:#fff3e0
style Embed fill:#e3f2fd
style SFT fill:#f3e5f5
style Predict fill:#ffe0b2
style Evaluate fill:#fce4ec
style EvaluateLLM fill:#fce4ec
Stage Specifications¶
1. Embedding Generation (embed)¶
flowchart LR
subgraph "embed Stage"
Input[("Input<br/>Parquet files")]
Process["Process<br/>• Load documents<br/>• Tokenize text<br/>• Generate embeddings<br/>• Aggregate vectors"]
Output[("Output<br/>Embedding vectors")]
Config["Config<br/>• Model: mmlw-roberta-large<br/>• Batch size: 32<br/>• Max length: 512"]
end
Input --> Process
Config --> Process
Process --> Output
style Input fill:#e3f2fd
style Output fill:#e8f5e9
style Config fill:#fff3e0
Command: dvc repro embed
Dependencies:
- Input:
data/datasets/{pl,en}/raw/*.parquet - Model:
sdadas/mmlw-roberta-large - Output: Embedding files in
data/embeddings/
Parameters:
2. Supervised Fine-Tuning (sft)¶
flowchart TB
subgraph "sft Stage Matrix"
subgraph "Models"
M1["Llama-3.2-3B"]
M2["Mistral-7B-v0.3"]
M3["Bielik-7B-v0.1"]
M4["Phi-4"]
end
subgraph "Datasets"
D1["pl-court-instruct"]
D2["pl-court-frankowe"]
D3["en-legal-instruct"]
end
subgraph "Training Config"
TC["• PEFT/LoRA<br/>• Gradient Accumulation: 4<br/>• Learning Rate: 2e-4<br/>• Epochs: 3<br/>• Warmup: 0.1"]
end
Process["Fine-Tuning Process<br/>Unsloth Optimization"]
Output[("Checkpoints<br/>models/{model}/{dataset}/")]
end
M1 --> Process
M2 --> Process
M3 --> Process
M4 --> Process
D1 --> Process
D2 --> Process
D3 --> Process
TC --> Process
Process --> Output
style Output fill:#ffe0b2
Command: CUDA_VISIBLE_DEVICES=0 NUM_PROC=10 dvc repro sft
Matrix Expansion:
stages:
sft:
foreach:
- model: Llama-3.2-3B-Instruct
dataset: pl-court-instruct-sft
- model: Mistral-7B-Instruct-v0.3
dataset: pl-court-instruct-sft
- model: Bielik-7B-Instruct-v0.1
dataset: pl-court-frankowe-instruct
3. Prediction (predict)¶
flowchart LR
subgraph "predict Stage"
Input1[("Fine-tuned Model")]
Input2[("Test Dataset")]
Input3[("Weaviate Context")]
Process["Inference Pipeline<br/>• Load model<br/>• Retrieve context<br/>• Generate predictions<br/>• Post-process"]
Config["Config<br/>• Batch size: 8<br/>• Max new tokens: 512<br/>• Temperature: 0.7<br/>• Top-p: 0.95"]
Output[("Predictions<br/>JSON/Parquet")]
end
Input1 --> Process
Input2 --> Process
Input3 --> Process
Config --> Process
Process --> Output
style Input1 fill:#ffe0b2
style Input2 fill:#e3f2fd
style Input3 fill:#e8f5e9
style Output fill:#ffebee
Command: CUDA_VISIBLE_DEVICES=0 dvc repro predict
4. Evaluation (evaluate & evaluate_llm_as_judge)¶
flowchart TB
subgraph "Evaluation Pipeline"
Predictions[("Model Predictions")]
subgraph "N-gram Metrics"
BLEU["BLEU Score"]
ROUGE["ROUGE Score"]
METEOR["METEOR Score"]
BertScore["BERTScore"]
end
subgraph "LLM as Judge"
Judge["GPT-4/Claude<br/>Quality Assessment"]
Criteria["• Accuracy<br/>• Completeness<br/>• Relevance<br/>• Coherence"]
end
Reports[("Evaluation Reports<br/>• Metrics JSON<br/>• Analysis HTML")]
end
Predictions --> BLEU
Predictions --> ROUGE
Predictions --> METEOR
Predictions --> BertScore
Predictions --> Judge
Criteria --> Judge
BLEU --> Reports
ROUGE --> Reports
METEOR --> Reports
BertScore --> Reports
Judge --> Reports
style Predictions fill:#ffebee
style Reports fill:#e8f5e9
DVC Configuration Structure¶
graph TD
subgraph "DVC Configuration Files"
Root["dvc.yaml<br/>Pipeline definition"]
Params["params.yaml<br/>Global parameters"]
subgraph "Stage Configs"
SC1["configs/sft_config.yaml"]
SC2["configs/predict_config.yaml"]
SC3["configs/evaluate_config.yaml"]
end
subgraph "Model Configs"
MC1["configs/model/Llama-*.yaml"]
MC2["configs/model/Mistral-*.yaml"]
MC3["configs/model/Bielik-*.yaml"]
MC4["configs/model/Phi-*.yaml"]
end
subgraph "Dataset Configs"
DC1["configs/dataset/pl-court-*.yaml"]
DC2["configs/dataset/en-legal-*.yaml"]
end
end
Root --> Params
Root --> SC1
Root --> SC2
Root --> SC3
SC1 --> MC1
SC1 --> MC2
SC1 --> MC3
SC1 --> MC4
SC1 --> DC1
SC1 --> DC2
style Root fill:#fff3e0
style Params fill:#e3f2fd
Pipeline Execution Flow¶
sequenceDiagram
participant User
participant DVC
participant Git
participant Cache
participant GPU
User->>DVC: dvc repro predict
DVC->>Git: Check pipeline definition
DVC->>Cache: Check cached outputs
alt Output exists in cache
Cache-->>DVC: Return cached result
DVC-->>User: Pipeline up to date
else Output not cached
DVC->>GPU: Allocate resources
DVC->>DVC: Execute pipeline stages
loop For each stage
DVC->>DVC: Check dependencies
DVC->>GPU: Run computation
GPU-->>DVC: Return output
DVC->>Cache: Store output
end
DVC-->>User: Pipeline complete
end
Matrix Execution Strategy¶
graph LR
subgraph "Matrix Definition"
Config["dvc.yaml<br/>foreach definition"]
end
subgraph "Expansion"
Expand["Cartesian Product<br/>Models × Datasets"]
end
subgraph "Parallel Execution"
PE1["Llama + Dataset1"]
PE2["Llama + Dataset2"]
PE3["Mistral + Dataset1"]
PE4["Mistral + Dataset2"]
PE5["...more combinations"]
end
subgraph "Results"
Results["Aggregated Results<br/>Per combination"]
end
Config --> Expand
Expand --> PE1
Expand --> PE2
Expand --> PE3
Expand --> PE4
Expand --> PE5
PE1 --> Results
PE2 --> Results
PE3 --> Results
PE4 --> Results
PE5 --> Results
style Config fill:#fff3e0
style Results fill:#e8f5e9
Command Reference¶
| Command | Description | Environment Variables |
|---|---|---|
dvc repro |
Run entire pipeline | - |
dvc repro embed |
Generate embeddings only | NUM_PROC |
dvc repro sft |
Run fine-tuning | CUDA_VISIBLE_DEVICES, NUM_PROC |
dvc repro predict |
Generate predictions | CUDA_VISIBLE_DEVICES, NUM_PROC |
dvc repro evaluate |
Run n-gram evaluation | - |
dvc repro evaluate_llm_as_judge |
Run LLM evaluation | - |
dvc dag |
Visualize pipeline DAG | - |
dvc status |
Check pipeline status | - |
dvc stage list |
List all stages | - |
Cache Management¶
flowchart TB
subgraph "DVC Cache Structure"
LocalCache[".dvc/cache/<br/>Local cache"]
RemoteCache["Remote Storage<br/>S3/GCS/Azure"]
subgraph "Cache Keys"
InputHash["Input file hash"]
ParamHash["Parameter hash"]
CodeHash["Code hash"]
Combined["Combined hash<br/>→ Cache key"]
end
end
InputHash --> Combined
ParamHash --> Combined
CodeHash --> Combined
Combined --> LocalCache
LocalCache <--> RemoteCache
style LocalCache fill:#e3f2fd
style RemoteCache fill:#e8f5e9
style Combined fill:#fff3e0
Performance Optimization¶
- Stage-level Caching: Reuse outputs when inputs haven't changed
- Parallel Execution: Run independent stages concurrently
- Resource Allocation: Configure GPU/CPU per stage
- Batch Processing: Optimize batch sizes for memory/speed
- Incremental Updates: Only rerun affected stages
Troubleshooting¶
| Issue | Solution |
|---|---|
| Out of memory | Reduce batch size in config |
| Stage fails | Check dvc.log for details |
| Cache miss | Run dvc status to check changes |
| Slow execution | Enable parallel processing with NUM_PROC |
| GPU not used | Set CUDA_VISIBLE_DEVICES |