Extraction Schema Reference¶

This directory contains comprehensive validation and specification documents for the JuDDGES extraction schema used for Polish legal documents.

Overview¶

The extraction schema defines the structured fields that the LLM extraction pipeline identifies and extracts from legal documents (court judgments and tax interpretations).

Schema Implementation: scripts/extraction/run_extraction_rest.py

Last Updated: 2025-10-11

Documents in This Directory¶

DOCUMENT_SCHEMA_MAPPING.md - Mapping between source documents and schema fields
dataset_weaviate_mapping.md - Mapping between dataset fields and Weaviate properties
extraction_schema_judgments.md - Extraction schema specification for court judgments
extraction_schema_tax_interpretations.md - Extraction schema specification for tax interpretations
gemini_extraction_schema.md - Gemini structured output schema definition
llm_field_extraction_schema.yaml - YAML field extraction schema used by the LLM pipeline

Key Findings¶

Schema Improvements Identified¶

Component	Status	Recommendation
Legal References	❌ Incomplete	Expand from 4 to 8 types
Decision Types	❌ Incomplete	Expand from 5 to 9 types
Court Types	❌ English terms	Use Polish, expand to 8 types
Party Types	❌ Too vague	Define 13 specific types
Tax Types	❌ Incomplete	Define 15 specific types
factual_state field	✅ Missing	Add new field for factual context
legal_state field	✅ Missing	Add new field for legal framework

Impact¶

+2 new essential fields capturing factual and legal context
Better enumeration coverage for Polish legal system
Polish terminology matching source documents
Expected coverage improvements: 60-80% → 85-95%

How to Use These Documents¶

For Developers¶

When implementing or updating the extraction schema:

Reference the per-document schema specifications for detailed field definitions
Use the recommended enumerations in Polish terminology

For Researchers¶

When analyzing extraction results:

Use these documents to understand schema design decisions
Reference the validation methodology (web search against Polish legal system)
Compare extraction coverage against expected benchmarks
Identify gaps or areas for further improvement

For Data Scientists¶

When working with extracted data:

Understand the complete field structure and types
Use enumeration lists for data validation
Reference expected coverage metrics
Plan data cleaning strategies for missing fields

Schema Structure¶

Current Schema (16 fields)¶

Core Identification (4 fields)
├── document_number
├── document_type
├── title
└── date_issued

High-Priority Augmentation (3 fields)
├── summary
├── thesis
└── keywords

Factual & Legal Context (2 NEW fields)
├── factual_state ← NEW
└── legal_state ← NEW

Outcome (1 field)
└── outcome (with decision_type enumeration)

Legal Content (3 fields)
├── legal_references (with type enumeration)
├── legal_concepts
└── parties (with party_type enumeration)

Structured Content (1 field)
└── legal_analysis

Document-Specific (2 fields)
├── judgment_specific (with court_type enumeration)
└── tax_interpretation_specific (with tax_type enumeration)

Implementation¶

Extraction How-To Guides - Practical guides for running extraction
Structured Output Implementation - Technical details of Gemini structured output

Data Management¶

Weaviate Schema Management - How to update Weaviate properties

Validation Methodology¶

All schema improvements were validated through:

Web Search: Research on Polish legal system, court hierarchy, tax types
Document Analysis: Review of actual Polish legal documents
Expert Consultation: Legal terminology verification
Coverage Analysis: Field population rates in test extractions

Quality Assurance: Each enumeration is backed by authoritative sources from Polish legal system documentation.

Next Steps¶

✅ Schema validation complete
⏳ Implement schema improvements
⏳ Test on sample documents (5-10)
⏳ Validate coverage improvements
⏳ Roll out to production pipeline

This is reference documentation for the JuDDGES extraction schema, validated 2025-10-11.