Skip to content

Data Loaders

Dataset loading utilities for Weaviate ingestion with column remapping support.

Overview

The juddges.data.loaders module provides the DatasetLoader class for loading and preparing datasets for Weaviate ingestion. It handles:

  • Loading chunk embeddings from parquet files
  • Loading document embeddings with dataset metadata
  • Column mapping between dataset schemas and Weaviate schemas
  • Dataset validation and error handling

Key Features

  • Automatic Column Remapping: Maps dataset columns to Weaviate schema fields
  • Multi-Dataset Support: Predefined mappings for multiple datasets
  • Validation: Checks for embeddings existence and dataset integrity
  • Efficient Loading: Uses Polars for fast data processing
  • Error Handling: Comprehensive error messages and validation

Usage Example

from juddges.config import EmbeddingConfig
from juddges.data.loaders import DatasetLoader

# Create configuration
config = EmbeddingConfig(
    dataset_name="juddges/pl-court-raw",
    agg_embeddings_dir="data/embeddings/agg",
    chunk_embeddings_dir="data/embeddings/chunks"
)

# Initialize loader
loader = DatasetLoader(config)

# Load document dataset with remapped columns
doc_dataset = loader.load_document_dataset()
# Columns are automatically remapped from dataset schema to Weaviate schema

# Load chunk dataset
chunk_dataset = loader.load_chunk_dataset()

Dataset Column Mappings

The module includes predefined column mappings for datasets to ensure compatibility with Weaviate schema:

Polish Court Dataset (juddges/pl-court-raw)

Maps raw dataset columns to Weaviate fields:

Dataset Column Weaviate Field
document_id judgment_id
document_number docket_number
date_issued judgment_date
source_url source
full_text full_text

English Court Dataset (en-court-raw)

Maps English court data to Weaviate schema:

Dataset Column Weaviate Field
document_id judgment_id
issued_on date_issued
case_number document_number
content full_text

API Reference

DatasetLoader

DatasetLoader(config: EmbeddingConfig)

Utility class for loading datasets with embedding data.

PARAMETER DESCRIPTION
config

Embedding configuration with dataset paths

TYPE: EmbeddingConfig

Source code in juddges/data/loaders.py
def __init__(self, config: EmbeddingConfig):
    """
    Initialize dataset loader with embedding configuration.

    Args:
        config: Embedding configuration with dataset paths
    """
    self.config = config
    self._check_embeddings_exist()

Functions

load_chunk_dataset

load_chunk_dataset() -> Dataset

Load the chunk embeddings dataset.

RETURNS DESCRIPTION
Dataset

Hugging Face dataset containing chunk embeddings

RAISES DESCRIPTION
ValueError

If the dataset is empty or not loaded correctly

Exception

If dataset loading fails

Source code in juddges/data/loaders.py
def load_chunk_dataset(self) -> Dataset:
    """
    Load the chunk embeddings dataset.

    Returns:
        Hugging Face dataset containing chunk embeddings

    Raises:
        ValueError: If the dataset is empty or not loaded correctly
        Exception: If dataset loading fails
    """
    logger.info(f"Loading chunk embeddings from {self.config.chunk_embeddings_dir}...")
    try:
        chunk_ds = load_dataset(
            "parquet",
            data_dir=self.config.chunk_embeddings_dir,
            num_proc=DEFAULT_PROCESSING_PROC,
        )
        if not chunk_ds or "train" not in chunk_ds or chunk_ds["train"].num_rows == 0:
            logger.error("Chunk embeddings dataset is empty or not loaded correctly.")
            raise ValueError("Empty chunk embeddings dataset")
        return chunk_ds["train"]
    except Exception as e:
        logger.error(f"Failed to load chunk embeddings dataset: {e}")
        raise

load_document_dataset

load_document_dataset() -> Dataset

Load the document embeddings dataset.

RETURNS DESCRIPTION
Dataset

Hugging Face dataset containing document embeddings

RAISES DESCRIPTION
ValueError

If the dataset is empty or not loaded correctly

Exception

If dataset loading fails

Source code in juddges/data/loaders.py
def load_document_dataset(self) -> Dataset:
    """
    Load the document embeddings dataset.

    Returns:
        Hugging Face dataset containing document embeddings

    Raises:
        ValueError: If the dataset is empty or not loaded correctly
        Exception: If dataset loading fails
    """
    logger.info(
        f"Loading document dataset from {self.config.dataset_name} and {self.config.agg_embeddings_dir}..."
    )
    try:
        ds_polars = pl.scan_parquet(f"hf://datasets/{self.config.dataset_name}/data/*.parquet")
        agg_ds_polars = pl.scan_parquet(f"{self.config.agg_embeddings_dir}/*.parquet")

        logger.info("Preparing aggregated dataset (it may take a few minutes)...")
        with tempfile.NamedTemporaryFile(suffix=".parquet") as temp_file:
            agg_ds_polars.join(ds_polars, on="judgment_id", how="left").sink_parquet(
                temp_file.name
            )
            agg_ds = load_dataset(
                "parquet",
                data_files=temp_file.name,
                num_proc=DEFAULT_PROCESSING_PROC,
            )["train"]

        if not agg_ds or agg_ds.num_rows == 0:
            logger.error("Aggregated embeddings dataset is empty or not loaded correctly.")
            raise ValueError("Empty aggregated embeddings dataset")

        # Remap columns to Weaviate schema
        mapping = DATASET_COLUMN_MAPPINGS.get(self.config.dataset_name)
        if mapping:
            agg_ds = agg_ds.map(lambda row: remap_row(row, mapping))
        else:
            logger.warning(
                f"No column mapping found for dataset: {self.config.dataset_name}, using original columns."
            )

        return agg_ds
    except Exception as e:
        logger.error(f"Failed to load aggregated embeddings dataset: {e}")
        raise

remap_row

remap_row(row: dict, mapping: dict) -> dict

Remap a row's keys according to the mapping dict.

Source code in juddges/data/loaders.py
def remap_row(row: dict, mapping: dict) -> dict:
    """Remap a row's keys according to the mapping dict."""
    return {
        weaviate_key: row.get(dataset_key)
        for dataset_key, weaviate_key in mapping.items()
        if dataset_key in row
    }

DATASET_COLUMN_MAPPINGS module-attribute

DATASET_COLUMN_MAPPINGS = {'juddges/pl-court-raw': {'source': 'source_url', 'judgment_id': 'document_id', 'docket_number': 'document_number', 'judgment_date': 'date_issued', 'publication_date': 'publication_date', 'last_update': 'last_updated', 'court_id': 'source_id', 'department_id': 'issuing_body', 'judgment_type': 'judgment_type', 'excerpt': 'summary', 'xml_content': 'raw_content', 'presiding_judge': 'presiding_judge', 'decision': 'outcome', 'judges': 'judges', 'legal_bases': 'legal_bases', 'publisher': 'publisher', 'recorder': 'recorder', 'reviser': 'reviser', 'keywords': 'keywords', 'num_pages': 'num_pages', 'full_text': 'full_text', 'volume_number': 'volume_number', 'volume_type': 'volume_type', 'court_name': 'court_name', 'department_name': 'department_name', 'extracted_legal_bases': 'extracted_legal_bases', 'references': 'references', 'thesis': 'thesis', 'country': 'country', 'court_type': 'court_type'}, 'en-court-raw': {'judgment_id': 'document_id', 'category': 'document_type', 'content': 'full_text', 'issued_on': 'date_issued', 'case_number': 'document_number', 'lang': 'language', 'country': 'country', 'abstract': 'summary', 'main_point': 'thesis', 'tags': 'keywords'}}

Common Patterns

Adding New Dataset Mapping

To add a new dataset mapping, update DATASET_COLUMN_MAPPINGS:

DATASET_COLUMN_MAPPINGS = {
    "your-org/your-dataset": {
        "weaviate_field": "dataset_column",
        "judgment_id": "doc_id",
        "full_text": "content",
        # Add more mappings
    }
}

Custom Column Remapping

For runtime column remapping:

from juddges.data.loaders import remap_row

# Define custom mapping
custom_mapping = {
    "judgment_id": "custom_id_field",
    "full_text": "custom_text_field"
}

# Remap a row
remapped = remap_row(row_dict, custom_mapping)

Error Handling

The loader validates datasets and provides clear error messages:

try:
    loader = DatasetLoader(config)
    dataset = loader.load_document_dataset()
except AssertionError as e:
    # Embeddings directory doesn't exist
    print(f"Configuration error: {e}")
except ValueError as e:
    # Dataset is empty or not loaded correctly
    print(f"Dataset error: {e}")
except Exception as e:
    # Other loading errors
    print(f"Loading failed: {e}")