Polish Judgment Dataset Curation¶
This script (curate_polish_dataset.py) implements the dataset curation pipeline for GitHub Issue #12 - curating ~6K Polish court judgments with topic coverage matching the UK corpus.
Prerequisites¶
Before running this script, ensure you have completed Issues #10 and #11:
- Issue #10: UK topic taxonomy must exist at
data/uk_topics_taxonomy.json - Issue #11: Cross-jurisdictional queries must exist at
data/cross_jurisdictional_queries.json
Setup¶
-
Install dependencies:
-
Set your HuggingFace token:
Usage¶
Basic Usage¶
Testing¶
# Test with smaller sample (faster, good for development)
python scripts/curate_polish_dataset.py --sample 100 --target 300
Custom Target¶
Outputs¶
The script creates two main files in the data/ directory:
1. polish_judgments_6k.parquet¶
Curated dataset with columns:
- id: Unique document identifier
- text: Full judgment text
- court: Court name
- date: Judgment date
- topic_primary: Primary topic label (e.g., "Criminal Sentencing")
- topic_secondary: Legal category (e.g., "Criminal Law")
- topic_id: Numeric topic ID matching UK taxonomy
- source_dataset: Source HuggingFace dataset
- query_used: Polish queries used for retrieval
- relevance_score: Document relevance score (0.0-1.0)
- signature: Case signature/sygnatura for deduplication
2. polish_dataset_stats.json¶
Curation statistics including: - Topic distribution (UK vs Polish counts) - Source dataset breakdown - Coverage gaps and shortfalls - Per-topic statistics
Data Sources¶
The script attempts to load from multiple HuggingFace datasets in priority order:
- Primary:
JuDDGES/pl-court-raw-enriched(hasfactual_stateandlegal_statefields) - Secondary:
JuDDGES/pl-nsa-enriched(administrative courts) - Fallback:
JuDDGES/pl-court-raw,JuDDGES/pl-nsa
Algorithm¶
- Query-driven retrieval: Use Polish legal queries from Issue #11 to score documents for topic relevance
- Balanced sampling: Allocate documents proportionally to match UK topic distribution
- Deduplication: Remove duplicates based on case signatures
- Quality filtering: Filter out very short judgments (< 500 characters)
Performance Notes¶
- Uses streaming mode to handle large datasets without memory issues
- Processes documents in batches for efficient memory usage
- Text preprocessing includes normalization for better Polish text matching
Troubleshooting¶
Missing Prerequisites¶
Run the Issue #10 script first to generate the UK topic taxonomy.Missing HF Token¶
Set your HuggingFace token in the environment.Low Coverage for Specific Topics¶
Check polish_dataset_stats.json for gaps. Some specialized topics may have limited Polish coverage in the available datasets.
Example Output¶
🏛️ Polish Judgment Curation
┌─ Polish Dataset Curation Summary ─┐
│ │
│ 📊 Dataset Overview │
│ • Total curated documents: 5,847│
│ • Target document count: 6,000 │
│ • Topics with coverage: 5/5 │
│ • Minimum per topic: 50 │
│ │
│ 📈 Source Distribution │
│ • pl-court-raw-enriched: 3,245 │
│ • pl-nsa-enriched: 2,602 │
│ │
│ 🎯 Coverage Quality │
│ • Avg coverage ratio: 0.97 │
│ • Topics with gaps: 1 │
└───────────────────────────────────┘
This indicates successful curation with good coverage across all topics.