Data Ingestion Guide for Juddges App¶
This guide explains how judgment data from HuggingFace is ingested into your Supabase database.
π Data Sources¶
1. Polish Judgments¶
Dataset: JuDDGES/pl-appealcourt-criminal
DOI: 10.57967/hf/8772
Description: Multi-jurisdiction legal case law dataset in standardized format. Contains Polish court decisions along with cases from other European jurisdictions.
Sample Structure:
{
"id": "12345",
"case_id": "I ACa 123/21",
"jurisdiction": "PL",
"court": "SΔ
d Apelacyjny w Warszawie",
"court_level": "Appeal",
"date": "2024-03-15",
"title": "Appeal regarding tax liability",
"text": "Full judgment text...",
"judges": ["SΔdzia A. Kowalski", "SΔdzia B. Nowak"],
"case_type": "Civil",
"keywords": ["tax", "liability", "appeal"],
"summary": "Brief summary of the case..."
}
Estimated Size: - ~50,000+ Polish cases available - Target sample: 3,000 cases (part of 6K+ goal) - Average case size: 5-15 KB
2. UK Judgments¶
Dataset: JuDDGES/en-appealcourt
DOI: 10.57967/hf/8773
Description: Complete collection of England & Wales Court of Appeal (Criminal Division) judgments. Contains 6,154 judgments with structured metadata.
Sample Structure:
{
"id": "54321",
"neutral_citation": "[2024] EWCA Crim 123",
"case_number": "202400123",
"case_name": "R v Smith",
"judgment_date": "2024-03-15",
"judgment_text": "Full judgment text...",
"judges": ["Lord Justice Smith", "Mr Justice Jones"],
"decision_type": "Judgment",
"outcome": "Appeal dismissed",
"summary": "Brief summary...",
"uri": "https://caselaw.nationalarchives.gov.uk/..."
}
Dataset Statistics: - Total cases: 6,154 judgments - Date range: Up to May 15, 2024 - Court: England & Wales Court of Appeal (Criminal Division) - Format: XML and cleaned structured data - Average case size: 8-25 KB
π Ingestion Pipeline¶
Architecture Overview¶
βββββββββββββββββββββββ
β HuggingFace Hub β
β - pl-appealcourt- β
β criminal (PL) β
β - en-appealcourt β
ββββββββββββ¬βββββββββββ
β Download via datasets library
βΌ
βββββββββββββββββββββββ
β Python Ingestion β
β Script β
β - Transform data β
β - Generate embed. β
ββββββββββββ¬βββββββββββ
β Insert via Supabase client
βΌ
βββββββββββββββββββββββ
β Supabase DB β
β (PostgreSQL) β
β - judgments table β
β - pgvector search β
βββββββββββββββββββββββ
Data Transformation¶
The ingestion script transforms raw HuggingFace data into our unified schema:
Schema Mapping:
| Supabase Field | Polish Source | UK Source |
|---|---|---|
case_number |
case_id |
neutral_citation |
jurisdiction |
'PL' (hardcoded) |
'UK' (hardcoded) |
court_name |
court |
'Court of Appeal (Criminal Division)' |
decision_date |
date |
judgment_date |
title |
title |
case_name |
full_text |
text |
judgment_text |
judges |
judges array |
judges array |
summary |
summary |
summary |
keywords |
keywords |
keywords |
Embedding Generation¶
For each judgment, we generate a vector embedding using OpenAI's text-embedding-3-small model (768 dimensions by default):
# Example embedding generation
embedding = openai.embeddings.create(
model="text-embedding-3-small",
input=full_text[:32000] # Truncate to ~8k tokens
)
Why Embeddings? - Enable semantic search: "Find similar cases about tax liability" - Better than keyword matching for legal concepts - ~$0.0001 per 1K tokens (100 judgments β $0.50)
π Running the Ingestion¶
Basic Usage¶
# Install dependencies
cd scripts
pip install -r requirements.txt
# Set environment variables
export SUPABASE_URL="https://your-project.supabase.co"
export SUPABASE_SERVICE_ROLE_KEY="your-service-role-key"
export OPENAI_API_KEY="sk-your-key" # Optional
# Ingest full target dataset (6K+ judgments)
python ingest_judgments.py --polish 3000 --uk 3000
Command-Line Options¶
# Ingest only Polish judgments
python ingest_judgments.py --polish 3000 --skip-uk
# Ingest only UK judgments
python ingest_judgments.py --uk 3000 --skip-polish
# Small test sample
python ingest_judgments.py --polish 10 --uk 10
# Development sample (faster)
python ingest_judgments.py --polish 100 --uk 100
# Skip embeddings to save API costs
python ingest_judgments.py --polish 3000 --uk 3000 --no-embeddings
Using .env File¶
Create a .env file in the scripts directory:
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_ROLE_KEY=your-service-role-key
OPENAI_API_KEY=sk-your-openai-key
Then run:
π Performance & Costs¶
Ingestion Time¶
| Judgments | Without Embeddings | With Embeddings |
|---|---|---|
| 10 cases | ~30 seconds | ~1 minute |
| 100 cases | ~3 minutes | ~8-10 minutes |
| 1,000 cases | ~30 minutes | ~80-100 minutes |
| 3,000 cases | ~1.5 hours | ~3-4 hours |
| 6,000 cases | ~3 hours | ~6-8 hours |
Bottleneck: OpenAI API calls for embedding generation (rate limits: 3000 req/min)
Storage Requirements¶
| Data Type | Size per 100 cases | Size per 1000 cases |
|---|---|---|
| Text data | ~1-2 MB | ~10-20 MB |
| Embeddings | ~600 KB | ~6 MB |
| Metadata | ~100 KB | ~1 MB |
| Total | ~2-3 MB | ~17-27 MB |
API Costs (OpenAI)¶
- Embedding model: text-embedding-3-small
- Cost: $0.00002 per 1K tokens
- Average judgment: 2K-5K tokens
- 100 judgments: ~$0.30-$0.50
- 1,000 judgments: ~$3-$5
- 3,000 judgments: ~$9-$15
- 6,000 judgments: ~$18-$30 (full target dataset)
π Data Quality Checks¶
After ingestion, verify data quality:
SQL Queries¶
-- Count judgments by jurisdiction
SELECT jurisdiction, COUNT(*)
FROM judgments
GROUP BY jurisdiction;
-- Check for missing embeddings
SELECT COUNT(*)
FROM judgments
WHERE embedding IS NULL;
-- Sample recent judgments
SELECT case_number, title, decision_date
FROM judgments
ORDER BY created_at DESC
LIMIT 10;
-- Check average text length
SELECT
jurisdiction,
AVG(LENGTH(full_text)) as avg_text_length,
MIN(LENGTH(full_text)) as min_text_length,
MAX(LENGTH(full_text)) as max_text_length
FROM judgments
GROUP BY jurisdiction;
Python Verification Script¶
from supabase import create_client
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
# Check total count
response = supabase.table('judgments').select('id', count='exact').execute()
print(f"Total judgments: {response.count}")
# Check by jurisdiction
pl_count = supabase.table('judgments')\
.select('id', count='exact')\
.eq('jurisdiction', 'PL')\
.execute()
print(f"Polish judgments: {pl_count.count}")
uk_count = supabase.table('judgments')\
.select('id', count='exact')\
.eq('jurisdiction', 'UK')\
.execute()
print(f"UK judgments: {uk_count.count}")
π οΈ Troubleshooting¶
Common Issues¶
1. HuggingFace Dataset Load Fails
Solution: - Check internet connection - Try again (HF can have temporary issues) - Usestreaming=True for large datasets
2. OpenAI Rate Limit
Solution: - Add delays between API calls - Reduce batch size - Upgrade OpenAI tier3. Supabase Insert Fails
Solution: - Check if data already exists - Use upsert instead of insert - Clear table and re-run4. Out of Memory
Solution: - Use streaming dataset loading - Process in smaller batches - Increase system RAMDebug Mode¶
Enable verbose logging:
import logging
logging.basicConfig(level=logging.DEBUG)
python ingest_judgments.py --polish 10 --uk 10
π Advanced Usage¶
Custom Data Transformations¶
Modify _transform_polish_judgment() or _transform_uk_judgment() to:
- Extract additional metadata
- Apply custom text cleaning
- Add domain-specific annotations
Incremental Updates¶
To add new judgments without re-ingesting:
# Check last ingested date
last_date = supabase.table('judgments')\
.select('decision_date')\
.order('decision_date', desc=True)\
.limit(1)\
.execute()
# Only ingest newer judgments
# (implement date filtering in transform functions)
Parallel Processing¶
For faster ingestion of large datasets:
from multiprocessing import Pool
def process_batch(cases):
# Process each batch in parallel
pass
with Pool(processes=4) as pool:
pool.map(process_batch, batches)
π― Next Steps¶
- Validate ingested data using SQL queries above
- Test search functionality in your application
- Monitor usage in Supabase dashboard
- Optimize queries based on search patterns
- Scale up ingestion for full datasets when ready
π Dataset Statistics¶
After ingestion, you can query statistics:
-- Judgments by year
SELECT
EXTRACT(YEAR FROM decision_date) as year,
jurisdiction,
COUNT(*) as count
FROM judgments
WHERE decision_date IS NOT NULL
GROUP BY year, jurisdiction
ORDER BY year DESC, jurisdiction;
-- Most common courts
SELECT
court_name,
COUNT(*) as count
FROM judgments
GROUP BY court_name
ORDER BY count DESC
LIMIT 10;
-- Average judgment length by jurisdiction
SELECT
jurisdiction,
AVG(LENGTH(full_text)) / 1000 as avg_kb
FROM judgments
GROUP BY jurisdiction;
Need help? Check the docs home or setup guide for more information.