Skip to content

API Reference

Welcome to the JuDDGES API reference documentation. This section provides comprehensive documentation for all public APIs, classes, functions, and modules in the JuDDGES project.

Documentation Structure

The API documentation is organized by module functionality following the Diátaxis framework's Reference category - providing technical specifications and detailed information about the codebase.

Core Modules

Core Configuration

Configuration management using Hydra and OmegaConf for flexible, hierarchical configuration.

Key Components:

  • LLMConfig - Large language model configuration
  • EmbeddingConfig - Embedding model and dataset configuration
  • Configuration loaders and validators

Settings

Application-wide settings using Pydantic for validation.

Key Components:

  • Environment variable management
  • API keys and credentials
  • Database connection settings

Data Models

Core data structures used throughout the application.

Key Components:

  • Document models
  • Judgment models
  • Extraction schemas

Schema

Weaviate schema definitions and validation.

Data Management

Data Loaders

Dataset loading utilities for Weaviate ingestion.

Key Functions:

  • DatasetLoader.load_chunk_dataset() - Load chunk embeddings
  • DatasetLoader.load_document_dataset() - Load document embeddings with column remapping
  • Dataset column mapping configurations

Weaviate Base Database

Base class for Weaviate database operations.

Key Features:

  • Connection management
  • Collection creation and management
  • Batch operations
  • Error handling

Judgments Database

Weaviate database operations for court judgments.

Key Components:

  • WeaviateJudgmentsDatabase - Main database class
  • Judgment and chunk collection management
  • Schema definitions with 50+ fields
  • UMAP coordinate support

Documents Database

Weaviate database operations for generic documents.

Dataset Factory

Factory for creating and managing datasets.

Dataset Mapper

Utilities for mapping between different dataset schemas.

Stream Ingester

Production-grade streaming ingestion pipeline.

Key Features:

  • Batch processing
  • Error handling and retry logic
  • Progress tracking
  • Memory-efficient streaming

LLM Operations

LLM Factory

Factory for creating and configuring language models.

Supported Models:

  • Llama 3.1/3.2
  • Mistral/Nemo
  • Phi-4
  • Bielik (Polish)

Key Functions:

  • get_llm() - Create model from configuration
  • get_llama_3() - Llama-specific setup
  • get_mistral() - Mistral-specific setup
  • Model quantization (4-bit, 8-bit)
  • PEFT/LoRA adapter loading

Prediction

LLM prediction utilities.

Key Functions:

  • predict_with_llm() - Batch prediction with progress tracking
  • DataLoader integration
  • Performance metrics

Information Extraction

Gemini Chain

LangChain extraction chain using Gemini 2.5 Pro/Flash.

Key Components:

  • GeminiExtractionChain - Main extraction class
  • ExtractionSchema - Schema definition
  • DocumentType - Document type enum

Key Features:

  • Structured output parsing
  • SQLite caching
  • Langfuse observability integration
  • Batch extraction support
  • Automatic text truncation

Preprocessing

Text Chunker

Text chunking utilities for document segmentation.

Key Components:

  • TextChunker - Main chunking class
  • Recursive character splitting
  • Token-based chunking
  • Configurable overlap

Text Encoder

Text encoding and tokenization utilities.

Context Truncator

Context window management for LLMs.

Formatters

Text formatting utilities for legal documents.

Parser Base

Base class for document parsers.

PL Court Parser

Parser for Polish court documents.

Evaluation

Metrics

Evaluation metrics for information extraction.

Key Functions:

  • evaluate_date() - Date field evaluation with parsing
  • evaluate_number() - Numeric field evaluation with tolerance
  • evaluate_string_rouge() - ROUGE scores for text fields
  • evaluate_enum() - Enum classification with hallucination detection
  • evaluate_list_greedy() - List matching with precision/recall/F1

Extraction Evaluation

End-to-end extraction evaluation pipeline.

LLM as Judge

Base

Base classes for LLM-as-judge evaluation.

Judge

Single-document LLM judge implementation.

Batched Judge

Batch processing LLM judge.

Data Model

Data models for LLM judge evaluation.

Result Loading

Utilities for loading and processing judge results.

Retrieval

Hybrid search combining semantic and keyword search.

Traditional keyword-based search.

Utilities

Config Utils

Configuration utilities and helpers.

Logging

Logging configuration using loguru.

Pipeline

Pipeline utilities for DVC workflows.

HuggingFace Utils

HuggingFace Hub utilities for dataset and model management.

Date Utils

Date parsing and formatting utilities.

Misc

Miscellaneous utilities.

Quick Navigation

By Use Case

Data Ingestion:

  1. Data Loaders - Load datasets
  2. Stream Ingester - Ingest to Weaviate
  3. Judgments Database - Database operations

Information Extraction:

  1. Gemini Chain - Extract with Gemini
  2. Metrics - Evaluate extractions
  3. LLM as Judge - LLM-based evaluation

Model Training & Inference:

  1. LLM Factory - Create models
  2. Prediction - Generate predictions
  3. Preprocessing - Prepare data

By Module Type

Configuration:

Data Access:

Models:

Evaluation:

Documentation Conventions

Docstring Style

All modules use Google-style docstrings:

def function_name(arg1: str, arg2: int) -> bool:
    """Brief description of function.

    Longer description with more details about what the function does,
    its purpose, and how it should be used.

    Args:
        arg1: Description of first argument
        arg2: Description of second argument

    Returns:
        Description of return value

    Raises:
        ValueError: When input is invalid

    Example:
        >>> result = function_name("test", 42)
        >>> print(result)
        True
    """

Type Annotations

All public APIs include comprehensive type annotations following PEP 484.

Code Examples

Most functions include usage examples in docstrings.

Contributing to API Documentation

Adding Documentation

  1. Update Docstrings: Add or improve docstrings in source code
  2. Regenerate Docs: Run ./scripts/docs/generate_api_docs.sh
  3. Review: Check generated documentation in docs/reference/api/
  4. Commit: Include both source and generated docs in commit

Documentation Standards

Follow the Style Guide for:

  • Docstring formatting
  • Type annotation conventions
  • Example code standards
  • Cross-referencing guidelines

Automation

API documentation is automatically generated from source code docstrings using:

  • MkDocs: Static site generator
  • mkdocstrings: Python documentation plugin
  • Material for MkDocs: Modern theme

Need Help?


Last Updated: 2025-10-11 Coverage: ~60% of public APIs documented Target: 100% coverage by 2025-11-01