Data Loaders¶
Dataset loading utilities for Weaviate ingestion with column remapping support.
Overview¶
The juddges.data.loaders module provides the DatasetLoader class for loading and preparing datasets for Weaviate ingestion. It handles:
- Loading chunk embeddings from parquet files
- Loading document embeddings with dataset metadata
- Column mapping between dataset schemas and Weaviate schemas
- Dataset validation and error handling
Key Features¶
- Automatic Column Remapping: Maps dataset columns to Weaviate schema fields
- Multi-Dataset Support: Predefined mappings for multiple datasets
- Validation: Checks for embeddings existence and dataset integrity
- Efficient Loading: Uses Polars for fast data processing
- Error Handling: Comprehensive error messages and validation
Usage Example¶
from juddges.config import EmbeddingConfig
from juddges.data.loaders import DatasetLoader
# Create configuration
config = EmbeddingConfig(
dataset_name="juddges/pl-court-raw",
agg_embeddings_dir="data/embeddings/agg",
chunk_embeddings_dir="data/embeddings/chunks"
)
# Initialize loader
loader = DatasetLoader(config)
# Load document dataset with remapped columns
doc_dataset = loader.load_document_dataset()
# Columns are automatically remapped from dataset schema to Weaviate schema
# Load chunk dataset
chunk_dataset = loader.load_chunk_dataset()
Dataset Column Mappings¶
The module includes predefined column mappings for datasets to ensure compatibility with Weaviate schema:
Polish Court Dataset (juddges/pl-court-raw)¶
Maps raw dataset columns to Weaviate fields:
| Dataset Column | Weaviate Field |
|---|---|
document_id |
judgment_id |
document_number |
docket_number |
date_issued |
judgment_date |
source_url |
source |
full_text |
full_text |
English Court Dataset (en-court-raw)¶
Maps English court data to Weaviate schema:
| Dataset Column | Weaviate Field |
|---|---|
document_id |
judgment_id |
issued_on |
date_issued |
case_number |
document_number |
content |
full_text |
API Reference¶
DatasetLoader
¶
Utility class for loading datasets with embedding data.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Embedding configuration with dataset paths
TYPE:
|
Source code in juddges/data/loaders.py
Functions¶
load_chunk_dataset
¶
Load the chunk embeddings dataset.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
Hugging Face dataset containing chunk embeddings |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the dataset is empty or not loaded correctly |
Exception
|
If dataset loading fails |
Source code in juddges/data/loaders.py
load_document_dataset
¶
Load the document embeddings dataset.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
Hugging Face dataset containing document embeddings |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the dataset is empty or not loaded correctly |
Exception
|
If dataset loading fails |
Source code in juddges/data/loaders.py
remap_row
¶
Remap a row's keys according to the mapping dict.
DATASET_COLUMN_MAPPINGS
module-attribute
¶
DATASET_COLUMN_MAPPINGS = {'juddges/pl-court-raw': {'source': 'source_url', 'judgment_id': 'document_id', 'docket_number': 'document_number', 'judgment_date': 'date_issued', 'publication_date': 'publication_date', 'last_update': 'last_updated', 'court_id': 'source_id', 'department_id': 'issuing_body', 'judgment_type': 'judgment_type', 'excerpt': 'summary', 'xml_content': 'raw_content', 'presiding_judge': 'presiding_judge', 'decision': 'outcome', 'judges': 'judges', 'legal_bases': 'legal_bases', 'publisher': 'publisher', 'recorder': 'recorder', 'reviser': 'reviser', 'keywords': 'keywords', 'num_pages': 'num_pages', 'full_text': 'full_text', 'volume_number': 'volume_number', 'volume_type': 'volume_type', 'court_name': 'court_name', 'department_name': 'department_name', 'extracted_legal_bases': 'extracted_legal_bases', 'references': 'references', 'thesis': 'thesis', 'country': 'country', 'court_type': 'court_type'}, 'en-court-raw': {'judgment_id': 'document_id', 'category': 'document_type', 'content': 'full_text', 'issued_on': 'date_issued', 'case_number': 'document_number', 'lang': 'language', 'country': 'country', 'abstract': 'summary', 'main_point': 'thesis', 'tags': 'keywords'}}
Related¶
- Judgments Database - Database operations for judgments
- Stream Ingester - Production ingestion pipeline
- Dataset Mapper - Additional mapping utilities
- How-To: Embed and Ingest - End-to-end workflow
Common Patterns¶
Adding New Dataset Mapping¶
To add a new dataset mapping, update DATASET_COLUMN_MAPPINGS:
DATASET_COLUMN_MAPPINGS = {
"your-org/your-dataset": {
"weaviate_field": "dataset_column",
"judgment_id": "doc_id",
"full_text": "content",
# Add more mappings
}
}
Custom Column Remapping¶
For runtime column remapping:
from juddges.data.loaders import remap_row
# Define custom mapping
custom_mapping = {
"judgment_id": "custom_id_field",
"full_text": "custom_text_field"
}
# Remap a row
remapped = remap_row(row_dict, custom_mapping)
Error Handling¶
The loader validates datasets and provides clear error messages:
try:
loader = DatasetLoader(config)
dataset = loader.load_document_dataset()
except AssertionError as e:
# Embeddings directory doesn't exist
print(f"Configuration error: {e}")
except ValueError as e:
# Dataset is empty or not loaded correctly
print(f"Dataset error: {e}")
except Exception as e:
# Other loading errors
print(f"Loading failed: {e}")