Component Relationships and Dependencies¶

Overview¶

This document visualizes the relationships between JuDDGES components, their dependencies, and interaction patterns. Understanding these relationships is crucial for development, debugging, and system maintenance.

High-Level Component Architecture¶

graph TB
    subgraph "Core Components"
        subgraph "juddges/"
            Data["📁 data/<br/>Loaders & Database"]
            Embeddings["📁 embeddings/<br/>Vector Generation"]
            Models["📁 models/<br/>LLM Management"]
            Preprocessing["📁 preprocessing/<br/>Text Processing"]
            Evaluation["📁 evaluation/<br/>Metrics & Scoring"]
            Utils["📁 utils/<br/>Helpers & Tools"]
        end

        subgraph "scripts/"
            Dataset["📁 dataset/<br/>Dataset Building"]
            Embed["📁 embed/<br/>Embedding Scripts"]
            SFT["📁 sft/<br/>Fine-tuning Scripts"]
            Predict["📁 predict/<br/>Inference Scripts"]
            Eval["📁 evaluation/<br/>Eval Scripts"]
        end

        subgraph "configs/"
            ModelConf["📁 model/<br/>Model Configs"]
            DatasetConf["📁 dataset/<br/>Dataset Configs"]
            EmbedConf["📁 embedding_model/<br/>Embed Configs"]
            PipelineConf["📁 *.yaml<br/>Pipeline Configs"]
        end

        subgraph "External"
            Weaviate[("Weaviate DB")]
            HuggingFace[("🤗 Hugging Face")]
            DVC[("DVC Pipeline")]
        end
    end

    %% Dependencies
    Data --> Weaviate
    Data --> Preprocessing
    Preprocessing --> Embeddings
    Embeddings --> HuggingFace
    Models --> HuggingFace

    Dataset --> Data
    Dataset --> Preprocessing
    Embed --> Embeddings
    SFT --> Models
    Predict --> Models
    Predict --> Weaviate
    Eval --> Evaluation

    ModelConf --> Models
    DatasetConf --> Data
    EmbedConf --> Embeddings
    PipelineConf --> DVC

    DVC --> Dataset
    DVC --> Embed
    DVC --> SFT
    DVC --> Predict
    DVC --> Eval

    style Data fill:#e3f2fd
    style Models fill:#e8f5e9
    style Weaviate fill:#f3e5f5
    style DVC fill:#fff3e0

Detailed Module Dependencies¶

graph LR
    subgraph "juddges.data"
        DataInit["__init__.py"]
        Loaders["loaders/<br/>• document_loader<br/>• parquet_loader"]
        Database["database/<br/>• weaviate_client<br/>• schema_manager"]
        DataUtils["utils/<br/>• uuid_generator<br/>• validators"]
    end

    subgraph "juddges.preprocessing"
        PrepInit["__init__.py"]
        Chunking["chunking/<br/>• smart_chunker<br/>• size_limiter"]
        Parsing["parsing/<br/>• text_parser<br/>• format_detector"]
        Cleaning["cleaning/<br/>• normalizer<br/>• encoder_fix"]
    end

    subgraph "juddges.embeddings"
        EmbedInit["__init__.py"]
        Generator["generator/<br/>• embedding_model<br/>• batch_processor"]
        Aggregator["aggregator/<br/>• mean_pooling<br/>• weighted_avg"]
        Storage["storage/<br/>• vector_store<br/>• cache_manager"]
    end

    subgraph "juddges.models"
        ModelInit["__init__.py"]
        Factory["factory/<br/>• model_factory<br/>• config_loader"]
        Inference["inference/<br/>• predictor<br/>• streamer"]
        Training["training/<br/>• trainer<br/>• lora_adapter"]
    end

    subgraph "juddges.evaluation"
        EvalInit["__init__.py"]
        Metrics["metrics/<br/>• ngram_metrics<br/>• semantic_metrics"]
        Judge["llm_judge/<br/>• gpt4_judge<br/>• claude_judge"]
        Analysis["analysis/<br/>• error_analysis<br/>• report_generator"]
    end

    %% Internal Dependencies
    Loaders --> Parsing
    Parsing --> Cleaning
    Cleaning --> Chunking
    Chunking --> Generator
    Generator --> Aggregator
    Aggregator --> Storage
    Storage --> Database

    Factory --> Training
    Factory --> Inference
    Inference --> Generator
    Inference --> Database

    Metrics --> Analysis
    Judge --> Analysis

    style DataInit fill:#e3f2fd
    style PrepInit fill:#e3f2fd
    style EmbedInit fill:#e3f2fd
    style ModelInit fill:#e3f2fd
    style EvalInit fill:#e3f2fd

Class Relationships¶

classDiagram
    %% Data Classes
    class DocumentLoader {
        +load_pdf()
        +load_docx()
        +load_txt()
        +validate_format()
    }

    class WeaviateClient {
        +connect()
        +create_collection()
        +insert_batch()
        +search()
        +update()
    }

    %% Preprocessing Classes
    class TextChunker {
        +chunk_by_size()
        +chunk_by_tokens()
        +preserve_context()
    }

    class TextParser {
        +extract_text()
        +detect_language()
        +parse_metadata()
    }

    %% Embedding Classes
    class EmbeddingModel {
        +load_model()
        +encode()
        +batch_encode()
        +get_dim()
    }

    class VectorStore {
        +store_vectors()
        +retrieve_similar()
        +update_vectors()
    }

    %% Model Classes
    class ModelFactory {
        +create_model()
        +load_checkpoint()
        +get_config()
    }

    class Trainer {
        +train()
        +evaluate()
        +save_checkpoint()
    }

    class Predictor {
        +predict()
        +stream_predict()
        +batch_predict()
    }

    %% Evaluation Classes
    class MetricsCalculator {
        +calculate_bleu()
        +calculate_rouge()
        +calculate_bertscore()
    }

    class LLMJudge {
        +evaluate_quality()
        +score_response()
        +generate_feedback()
    }

    %% Relationships
    DocumentLoader --> TextParser : uses
    TextParser --> TextChunker : feeds
    TextChunker --> EmbeddingModel : provides text
    EmbeddingModel --> VectorStore : generates vectors
    VectorStore --> WeaviateClient : stores in

    ModelFactory --> Trainer : creates model for
    ModelFactory --> Predictor : creates model for
    Predictor --> WeaviateClient : retrieves context
    Predictor --> MetricsCalculator : sends predictions
    MetricsCalculator --> LLMJudge : validates with

Data Flow Dependencies¶

flowchart TD
    subgraph "Input Layer"
        RawDocs["Raw Documents"]
        Config["Configuration Files"]
        PretrainedModels["Pretrained Models"]
    end

    subgraph "Processing Layer"
        subgraph "Data Pipeline"
            DocLoad["Document Loading"]
            TextProc["Text Processing"]
            ChunkGen["Chunk Generation"]
        end

        subgraph "Embedding Pipeline"
            EmbedGen["Embedding Generation"]
            VecStore["Vector Storage"]
        end

        subgraph "Training Pipeline"
            DataPrep["Data Preparation"]
            ModelTrain["Model Training"]
            CheckSave["Checkpoint Saving"]
        end
    end

    subgraph "Service Layer"
        WeaviateDB["Weaviate Service"]
        ModelServe["Model Service"]
        EvalServe["Evaluation Service"]
    end

    subgraph "Output Layer"
        Predictions["Predictions"]
        Reports["Evaluation Reports"]
        Artifacts["Model Artifacts"]
    end

    %% Flow
    RawDocs --> DocLoad
    Config --> DocLoad
    DocLoad --> TextProc
    TextProc --> ChunkGen

    ChunkGen --> EmbedGen
    PretrainedModels --> EmbedGen
    EmbedGen --> VecStore
    VecStore --> WeaviateDB

    ChunkGen --> DataPrep
    Config --> DataPrep
    DataPrep --> ModelTrain
    PretrainedModels --> ModelTrain
    ModelTrain --> CheckSave
    CheckSave --> ModelServe

    WeaviateDB --> ModelServe
    ModelServe --> Predictions
    Predictions --> EvalServe
    EvalServe --> Reports
    CheckSave --> Artifacts

    style RawDocs fill:#e3f2fd
    style Predictions fill:#e8f5e9
    style WeaviateDB fill:#f3e5f5

Configuration Dependencies¶

graph TB
    subgraph "Configuration Hierarchy"
        Main["main.yaml<br/>Entry point"]

        subgraph "Hydra Composition"
            Defaults["defaults:<br/>- model: llama<br/>- dataset: pl-court<br/>- embedding: mmlw"]
        end

        subgraph "Model Configs"
            Llama["llama.yaml"]
            Mistral["mistral.yaml"]
            Bielik["bielik.yaml"]
        end

        subgraph "Dataset Configs"
            PLCourt["pl-court.yaml"]
            PLFrank["pl-frankowe.yaml"]
            ENLegal["en-legal.yaml"]
        end

        subgraph "Pipeline Configs"
            SFTConf["sft_config.yaml"]
            PredictConf["predict_config.yaml"]
            EvalConf["evaluate_config.yaml"]
        end

        Runtime["Runtime Configuration<br/>Merged settings"]
    end

    Main --> Defaults
    Defaults --> Llama
    Defaults --> PLCourt
    Defaults --> SFTConf

    Llama --> Runtime
    Mistral --> Runtime
    Bielik --> Runtime
    PLCourt --> Runtime
    PLFrank --> Runtime
    ENLegal --> Runtime
    SFTConf --> Runtime
    PredictConf --> Runtime
    EvalConf --> Runtime

    style Main fill:#fff3e0
    style Runtime fill:#e8f5e9

Error Propagation Paths¶

graph TD
    subgraph "Error Sources"
        DataError["Data Error<br/>• Invalid format<br/>• Missing fields"]
        ModelError["Model Error<br/>• OOM<br/>• Loading failure"]
        DBError["Database Error<br/>• Connection<br/>• Schema mismatch"]
        ConfigError["Config Error<br/>• Invalid params<br/>• Missing files"]
    end

    subgraph "Error Handlers"
        DataHandler["Data Handler<br/>• Validation<br/>• Fallback"]
        ModelHandler["Model Handler<br/>• Retry logic<br/>• Graceful degradation"]
        DBHandler["DB Handler<br/>• Reconnection<br/>• Cache fallback"]
        ConfigHandler["Config Handler<br/>• Defaults<br/>• Validation"]
    end

    subgraph "Recovery Actions"
        Skip["Skip Record"]
        Retry["Retry Operation"]
        Fallback["Use Fallback"]
        Alert["Send Alert"]
        Terminate["Terminate Pipeline"]
    end

    DataError --> DataHandler
    ModelError --> ModelHandler
    DBError --> DBHandler
    ConfigError --> ConfigHandler

    DataHandler --> Skip
    DataHandler --> Retry
    ModelHandler --> Fallback
    ModelHandler --> Retry
    DBHandler --> Retry
    DBHandler --> Alert
    ConfigHandler --> Fallback
    ConfigHandler --> Terminate

    style DataError fill:#ffebee
    style ModelError fill:#ffebee
    style DBError fill:#ffebee
    style ConfigError fill:#ffebee
    style Alert fill:#fff3e0

Testing Dependencies¶

graph LR
    subgraph "Test Structure"
        subgraph "Unit Tests"
            TestData["test_data/<br/>• Loaders<br/>• Validators"]
            TestEmbed["test_embeddings/<br/>• Generation<br/>• Storage"]
            TestModels["test_models/<br/>• Factory<br/>• Inference"]
            TestEval["test_evaluation/<br/>• Metrics<br/>• Judge"]
        end

        subgraph "Integration Tests"
            TestWeaviate["test_weaviate/<br/>• Connection<br/>• CRUD ops"]
            TestPipeline["test_pipeline/<br/>• End-to-end<br/>• DVC stages"]
        end

        subgraph "Test Fixtures"
            SampleData["sample_data/<br/>• Mock documents<br/>• Test configs"]
            MockModels["mock_models/<br/>• Dummy weights<br/>• Stubs"]
        end
    end

    TestData --> SampleData
    TestEmbed --> SampleData
    TestModels --> MockModels
    TestEval --> SampleData

    TestWeaviate --> SampleData
    TestPipeline --> SampleData
    TestPipeline --> MockModels

    style TestData fill:#e3f2fd
    style TestWeaviate fill:#e8f5e9
    style SampleData fill:#fff3e0

Package Dependencies¶

graph TB
    subgraph "Core Dependencies"
        Python["Python 3.10+"]
        PyTorch["PyTorch 2.0+"]
        Transformers["Transformers 4.40+"]
    end

    subgraph "ML Dependencies"
        Unsloth["Unsloth<br/>Training optimization"]
        PEFT["PEFT<br/>LoRA/QLoRA"]
        BitsBytes["BitsAndBytes<br/>Quantization"]
    end

    subgraph "Data Dependencies"
        Pandas["Pandas<br/>Data manipulation"]
        Pyarrow["PyArrow<br/>Parquet support"]
        Weaviate["Weaviate Client<br/>Vector DB"]
    end

    subgraph "Infrastructure"
        DVC_["DVC<br/>Pipeline management"]
        Hydra["Hydra<br/>Configuration"]
        Docker["Docker<br/>Containerization"]
    end

    subgraph "Utilities"
        Rich["Rich<br/>Console output"]
        Loguru["Loguru<br/>Logging"]
        Typer["Typer<br/>CLI interface"]
    end

    Python --> PyTorch
    Python --> Transformers
    PyTorch --> Unsloth
    PyTorch --> PEFT
    PyTorch --> BitsBytes

    Python --> Pandas
    Python --> Pyarrow
    Python --> Weaviate

    Python --> DVC_
    Python --> Hydra
    Python --> Docker

    Python --> Rich
    Python --> Loguru
    Python --> Typer

    style Python fill:#fff3e0
    style PyTorch fill:#e3f2fd
    style DVC_ fill:#e8f5e9

Communication Patterns¶

sequenceDiagram
    participant User
    participant CLI
    participant DVC
    participant Pipeline
    participant Weaviate
    participant Model
    participant Evaluator

    User->>CLI: Execute command
    CLI->>DVC: Trigger pipeline
    DVC->>Pipeline: Run stages

    Pipeline->>Weaviate: Store embeddings
    Pipeline->>Model: Load weights
    Pipeline->>Model: Fine-tune

    Model->>Weaviate: Query context
    Weaviate-->>Model: Return results
    Model->>Model: Generate prediction

    Model->>Evaluator: Send predictions
    Evaluator->>Evaluator: Calculate metrics
    Evaluator-->>User: Return report

Best Practices for Component Design¶

Loose Coupling: Components communicate through interfaces
High Cohesion: Related functionality grouped together
Single Responsibility: Each component has one clear purpose
Dependency Injection: Configuration passed at runtime
Error Isolation: Failures don't cascade unnecessarily
Testability: Components can be tested in isolation
Documentation: Clear interfaces and usage examples