Dataset to Weaviate Mapping¶
This document describes how to map HuggingFace dataset records to Weaviate documents and manage raw text versions of legal documents.
Overview¶
The JuDDGES project now supports:
- Bidirectional mapping between HuggingFace datasets and Weaviate documents
- Raw text storage in Weaviate (
raw_contentfield) for unprocessed versions of judgments/tax interpretations - Fast lookups using
document_idordocument_numberas keys - Bulk updates to populate Weaviate documents with raw text from datasets
Architecture¶
Key Components¶
DatasetToWeaviateMapper(dataset_mapper.py) - Core mapping utilityraw_contentfield - New Weaviate property for storing raw unprocessed text- Column mappings - Defined in loaders.py for each dataset
Data Flow¶
HuggingFace Dataset → DatasetToWeaviateMapper → Weaviate Document
(judgment_id) (index lookup) (document_id)
(text field) (raw_content field)
Schema Changes¶
Weaviate LegalDocuments Collection¶
New property added:
wvcc.Property(
name="raw_content",
data_type=wvcc.DataType.TEXT,
description="Raw unprocessed text version of judgment or tax interpretation",
skip_vectorization=True, # Not vectorized to save resources
)
LegalDocument Schema¶
Updated in schemas.py:
class LegalDocument(BaseModel):
# ...
full_text: Optional[str] # Processed/cleaned text
raw_content: Optional[str] # Raw unprocessed text (NEW)
# ...
Column Mappings¶
Polish Court Judgments (juddges/pl-court-raw)¶
| Dataset Field | Weaviate Property | Description |
|---|---|---|
judgment_id |
document_id |
Primary identifier |
docket_number |
document_number |
Secondary identifier |
full_text |
full_text |
Processed text |
text |
raw_content |
Raw unprocessed text |
excerpt |
summary |
Abstract/summary |
Tax Interpretations (AI-TAX/pl-eureka-raw)¶
| Dataset Field | Weaviate Property | Description |
|---|---|---|
id |
document_id |
Primary identifier |
docker_number |
document_number |
Secondary identifier |
html_content |
full_text |
Processed HTML content |
text |
raw_content |
Raw unprocessed text |
introduction |
summary |
Introduction section |
question |
thesis |
Main question |
See loaders.py for complete mappings.
Usage¶
1. Basic Mapping Example¶
from juddges.data import DatasetToWeaviateMapper, WeaviateLegalDocumentsDatabase
# Connect to Weaviate
db = WeaviateLegalDocumentsDatabase(
host="localhost",
port=8222,
grpc_port=8085,
)
# Initialize mapper
mapper = DatasetToWeaviateMapper(
db=db,
dataset_name="juddges/pl-court-raw",
)
# Build index for fast lookups
mapper.build_index(
id_field="judgment_id",
secondary_id_field="docket_number",
)
# Get dataset record by document_id
dataset_record = mapper.get_dataset_record(
document_id="150000000000503_I_C_001234_2020_Uz_2021-05-15_001"
)
# Get Weaviate document by document_id
weaviate_doc = mapper.get_weaviate_document(
document_id="150000000000503_I_C_001234_2020_Uz_2021-05-15_001"
)
2. Update Raw Text in Weaviate¶
Using the Script¶
# Dry run to see what would be updated
python scripts/embed/update_raw_content.py \
--dataset-name juddges/pl-court-raw \
--raw-text-field text \
--dry-run
# Actually update documents
python scripts/embed/update_raw_content.py \
--dataset-name juddges/pl-court-raw \
--raw-text-field text \
--batch-size 100
Using the API¶
# Update all documents with raw_content from dataset
updated_count = mapper.update_raw_content_from_dataset(
raw_content_field="text", # Field in dataset
batch_size=100,
document_id_field="judgment_id",
)
print(f"Updated {updated_count} documents")
3. Find Missing Raw Text¶
# Find documents missing raw_content
missing = mapper.get_missing_raw_content_documents()
print(f"Found {len(missing)} documents without raw_content")
for doc in missing[:5]:
print(f" - {doc['document_id']}: {doc.get('title', 'No title')}")
4. Join Dataset with Weaviate¶
# Get Weaviate documents
collection = db.legal_documents_collection
response = collection.query.fetch_objects(limit=10)
weaviate_docs = [obj.properties for obj in response.objects]
# Enrich with dataset fields
enriched = mapper.join_dataset_to_weaviate(
weaviate_documents=weaviate_docs,
dataset_fields=["text", "excerpt", "judgment_date"],
)
# Access enriched data
for doc in enriched:
print(f"Document: {doc['document_id']}")
print(f"Raw text: {doc.get('dataset_text', 'N/A')[:100]}...")
Docker Usage¶
Run Update Script in Container¶
docker compose run --rm juddges \
python scripts/embed/update_raw_content.py \
--dataset-name juddges/pl-court-raw \
--raw-text-field text \
--weaviate-host weaviate \
--weaviate-port 8080
API Reference¶
DatasetToWeaviateMapper¶
Constructor¶
DatasetToWeaviateMapper(
db: WeaviateLegalDocumentsDatabase,
dataset_name: Optional[str] = None,
dataset: Optional[Dataset] = None,
)
Methods¶
| Method | Description |
|---|---|
build_index(id_field, secondary_id_field) |
Build lookup index |
get_dataset_record(document_id, document_number) |
Get dataset record |
get_weaviate_document(document_id, document_number) |
Get Weaviate doc |
join_dataset_to_weaviate(weaviate_documents, dataset_fields) |
Enrich Weaviate docs |
update_raw_content_from_dataset(raw_content_field, batch_size, document_id_field) |
Bulk update raw_content |
get_missing_raw_content_documents() |
Find docs missing raw_content |
WeaviateLegalDocumentsDatabase (Extended)¶
New Filtering & Analysis Methods¶
| Method | Description |
|---|---|
filter_by_raw_content_presence(has_raw_content, limit) |
Filter documents by raw_content presence |
get_raw_content_statistics() |
Get coverage statistics (total, with/without raw_content, %) |
filter_by_document_type_and_raw_content(document_type, has_raw_content, limit) |
Combined filter by type and raw_content |
compare_text_fields(document_id) |
Compare full_text vs raw_content for a document |
get_weaviate_document(document_id) |
Get document properties by ID |
Example Usage¶
# Get statistics
stats = db.get_raw_content_statistics()
# Returns: {
# "total_documents": 1000,
# "with_raw_content": 850,
# "without_raw_content": 150,
# "coverage_percentage": 85.0
# }
# Filter by raw_content presence
docs_with_raw = db.filter_by_raw_content_presence(has_raw_content=True, limit=100)
docs_without_raw = db.filter_by_raw_content_presence(has_raw_content=False, limit=100)
# Combined filtering
judgments_missing_raw = db.filter_by_document_type_and_raw_content(
document_type="judgment",
has_raw_content=False,
limit=50
)
# Compare text fields
comparison = db.compare_text_fields(document_id="doc-123")
# Returns: {
# "document_id": "doc-123",
# "has_full_text": True,
# "has_raw_content": True,
# "full_text_length": 5000,
# "raw_content_length": 5500,
# "length_difference": 500,
# "length_ratio": 0.909
# }
Performance Considerations¶
Indexing¶
- Build index once at initialization for fast lookups
- Index stores all dataset records in memory
- For large datasets (>100k records), consider batch processing
Batch Updates¶
- Default batch size: 100 documents
- Adjust based on Weaviate server capacity
- Monitor memory usage during bulk updates
Vectorization¶
raw_contentfield hasskip_vectorization=True- Only
full_textis vectorized for semantic search - This saves storage and computational resources
Common Patterns¶
Pattern 1: Incremental Updates¶
# Update only documents missing raw_content
missing = mapper.get_missing_raw_content_documents()
print(f"Updating {len(missing)} documents...")
updated = mapper.update_raw_content_from_dataset(
raw_content_field="text",
batch_size=50,
)
Pattern 2: Dataset-Specific Mapping¶
# Tax interpretations use different ID field
mapper_tax = DatasetToWeaviateMapper(
db=db,
dataset_name="AI-TAX/pl-eureka-raw",
)
mapper_tax.build_index(
id_field="id", # Different from judgments
secondary_id_field="docker_number",
)
Pattern 3: Cross-Reference Validation¶
# Ensure all Weaviate docs have dataset source
collection = db.legal_documents_collection
response = collection.query.fetch_objects(limit=1000)
for obj in response.objects:
doc_id = obj.properties.get("document_id")
dataset_record = mapper.get_dataset_record(document_id=doc_id)
if not dataset_record:
print(f"Warning: {doc_id} not found in dataset")
Troubleshooting¶
Issue: Index Not Built¶
Solution: Always call build_index() before using lookup methods.
Issue: Dataset Record Not Found¶
Causes:
- Incorrect ID field names
- Mismatch between dataset and Weaviate IDs
- Dataset not fully loaded
Solution: Verify ID fields and rebuild index.
Issue: Slow Updates¶
Solution: Increase batch size or use parallel processing:
mapper.update_raw_content_from_dataset(
raw_content_field="text",
batch_size=500, # Increase batch size
)
Migration Guide¶
Updating Existing Weaviate Collections¶
If you have existing collections without raw_content:
- Add the field by recreating the collection (or use Weaviate schema migration)
- Run update script to populate raw_content from datasets
- Verify all documents have raw_content
# Step 1: Recreate collection (optional, field is added automatically)
python scripts/embed/create_weaviate_collections.py
# Step 2: Update raw_content
python scripts/embed/update_raw_content.py \
--dataset-name juddges/pl-court-raw \
--raw-text-field text
# Step 3: Verify
python scripts/embed/analyze_raw_content_coverage.py
Filtering and Analysis¶
Analyze Coverage¶
# Get overall statistics
python scripts/embed/analyze_raw_content_coverage.py
# Show coverage by document type
python scripts/embed/analyze_raw_content_coverage.py --by-type
# Compare text fields for specific document
python scripts/embed/analyze_raw_content_coverage.py \
--compare-document "150000000000503_I_C_001234_2020_Uz_2021-05-15_001"
Filter Documents by raw_content Presence¶
from juddges.data import WeaviateLegalDocumentsDatabase
db = WeaviateLegalDocumentsDatabase(host="localhost", port=8222, grpc_port=8085)
# Get documents WITH raw_content
with_raw = db.filter_by_raw_content_presence(has_raw_content=True, limit=100)
print(f"Found {len(with_raw)} documents with raw_content")
# Get documents WITHOUT raw_content
without_raw = db.filter_by_raw_content_presence(has_raw_content=False, limit=100)
print(f"Found {len(without_raw)} documents missing raw_content")
Get Coverage Statistics¶
stats = db.get_raw_content_statistics()
print(f"Coverage: {stats['coverage_percentage']}%")
print(f"With raw_content: {stats['with_raw_content']}/{stats['total_documents']}")
Filter by Document Type and raw_content¶
# Get judgments WITH raw_content
judgments_with_raw = db.filter_by_document_type_and_raw_content(
document_type="judgment",
has_raw_content=True,
limit=100
)
# Get tax interpretations WITHOUT raw_content
tax_without_raw = db.filter_by_document_type_and_raw_content(
document_type="tax_interpretation",
has_raw_content=False,
limit=100
)
Compare Text Fields¶
# Analyze differences between full_text and raw_content
comparison = db.compare_text_fields(document_id="your-doc-id")
if comparison:
print(f"Full text: {comparison['full_text_length']} chars")
print(f"Raw text: {comparison['raw_content_length']} chars")
print(f"Ratio: {comparison['length_ratio']}")
Examples¶
Full examples available in: