Gemini LLM Extraction Schema for Legal Documents¶

Overview¶

This document defines the comprehensive extraction schema for augmenting Weaviate LegalDocuments collection fields using Gemini LLM extraction. The schema maps Weaviate property names to extraction instructions and field specifications.

Schema Design Principles¶

Field names match Weaviate schema - Use exact property names from LegalDocuments collection
JSON storage for complex objects - Complex fields (parties, legal_references, etc.) stored as JSON strings
ISO 8601 dates - All date fields use ISO format (YYYY-MM-DD)
Language-aware extraction - Maintain original document language for extracted content
Factual extraction only - Extract explicitly stated information without interpretation

Core Document Fields¶

Document Identification & Metadata¶

{
    "document_id": "string, unique identifier extracted from document header or metadata",
    "document_number": "string, official case/document reference number (e.g., 'I ACa 123/24', 'DIS.0101.2.2024')",
    "document_type": "string, type of legal document - must be one of: 'judgment', 'tax_interpretation', 'legal_act'",
    "title": "string, official document title or generated descriptive title",
    "date_issued": "date ISO 8601 (YYYY-MM-DD), date when document was officially issued/published",
    "publication_date": "date ISO 8601 (YYYY-MM-DD), date when document was made public",
    "language": "string, ISO 639-1 language code (e.g., 'pl', 'en')",
    "country": "string, ISO 3166-1 alpha-2 country code (e.g., 'PL', 'CH', 'US')",
}

Extraction Instructions:

document_number: Look for official case numbers in header (format varies by court/jurisdiction)
document_type: Infer from document structure and terminology (wyrok=judgment, interpretacja podatkowa=tax interpretation)
date_issued: Primary date of document issuance, prefer over hearing/decision dates
Extract language from document metadata or infer from content

High-Priority LLM Augmentation Fields¶

1. Summary (Priority: HIGH)¶

{
    "summary": "string, concise 3-5 sentence summary of the legal document covering: (1) document type and issuing body, (2) main legal issue or question, (3) key facts, (4) decision/outcome, (5) primary legal basis"
}

Extraction Instructions:

Generate a balanced summary covering all key aspects
For judgments: Include parties, legal issue, court decision, and reasoning basis
For tax interpretations: Include applicant, tax question, authority's position, legal basis
Maintain factual tone without legal conclusions
Length: 150-300 words maximum
Use document's original language

Example (Judgment):

"Wyrok Sądu Okręgowego w Warszawie z dnia 15.01.2024 r., sygn. V C 1234/23.
Sprawa dotyczyła roszczenia konsumenta o zwrot nienależnie pobranych opłat
bankowych za prowadzenie rachunku. Sąd ustalił, że bank pobierał opłaty
niezgodnie z postanowieniami umowy, co stanowiło bezpodstawne wzbogacenie.
Powództwo uwzględniono w całości na podstawie art. 410 k.c. w zw. z art. 405 k.c."

2. Thesis (Priority: HIGH)¶

{
    "thesis": "string, main legal principle, rule, or holding established by the document; the key takeaway or precedential value"
}

Extraction Instructions:

Extract the core legal principle or rule
For judgments: The precedential legal conclusion or interpretation
For tax interpretations: The authority's legal position on the tax matter
Should be 1-3 sentences expressing the legal rule clearly
Focus on the "what does this document establish" question

Example:

"Pobieranie przez bank opłat za czynności bankowe niezgodnych z postanowieniami
umowy rachunku bankowego stanowi bezpodstawne wzbogacenie w rozumieniu art. 405
k.c. i obliguje do zwrotu nienależnie pobranych kwot."

3. Keywords (Priority: HIGH)¶

{
    "keywords": [
        "string, relevant legal keyword or concept",
        "string, area of law",
        "string, specific legal institution or mechanism"
    ]
}

Extraction Instructions:

Generate 5-15 relevant keywords covering:
Legal domains (prawo cywilne, prawo podatkowe)
Legal institutions (bezpodstawne wzbogacenie, interpretacja indywidualna)
Specific concepts (umowa kredytu, klauzule abuzywne)
Document-specific terms
Use normalized forms (singular, canonical)
Prioritize legal terminology over general terms
Include both broad and specific terms

Example:

["prawo cywilne", "bezpodstawne wzbogacenie", "umowa bankowa", "opłaty bankowe",
"roszczenia konsumenckie", "art. 405 k.c.", "ochrona konsumenta"]

4. Outcome (Priority: HIGH)¶

{
    "outcome": {
        "decision_type": "string, type of decision (e.g., 'uwzględniono', 'oddalono', 'uchylono', 'stanowisko pozytywne', 'stanowisko negatywne')",
        "decision_summary": "string, brief summary of what was decided",
        "awarded_amounts": [
            {
                "type": "string, type of award (e.g., 'kwota główna', 'odsetki', 'koszty')",
                "amount": "number, monetary amount if applicable",
                "currency": "string, currency code (e.g., 'PLN', 'EUR')",
                "recipient": "string, who receives the amount"
            }
        ],
        "legal_effect": "string, legal effect or consequence of the decision"
    }
}

Extraction Instructions:

decision_type: Classify the outcome (for judgments: granted/dismissed/modified; for interpretations: favorable/unfavorable)
decision_summary: 2-3 sentences explaining what the authority decided
awarded_amounts: Extract all monetary awards, including main amounts, interest, and costs
legal_effect: Describe the practical legal consequence (e.g., "Umowa nieważna", "Zastosowanie stawki 23% VAT")
Store as JSON string in Weaviate

5. Legal Concepts (Priority: MEDIUM)¶

{
    "legal_concepts": [
        {
            "concept_name": "string, name of legal concept (e.g., 'bezpodstawne wzbogacenie', 'force majeure')",
            "legal_area": "string, area of law (e.g., 'prawo cywilne', 'tax law')",
            "definition_context": "string, brief context of how concept is used in this document",
            "relevance": "string, how central this concept is ('primary', 'secondary', 'mentioned')"
        }
    ]
}

Extraction Instructions:

Identify all significant legal concepts discussed
Focus on concepts that are:
Central to the legal reasoning
Interpreted or defined in the document
Part of the legal basis for the decision
Include both domestic and international legal concepts
Classify by relevance to the document's core issue
Store as JSON string in Weaviate

6. Legal References (Priority: HIGH)¶

{
    "legal_references": [
        {
            "type": "string, type of reference ('statute', 'article', 'case_law', 'regulation', 'directive', 'treaty')",
            "title": "string, title or name of the legal source",
            "article": "string, specific article/section/paragraph referenced",
            "jurisdiction": "string, jurisdiction (e.g., 'Poland', 'EU', 'Switzerland')",
            "citation": "string, full citation as it appears in document",
            "context": "string, how this reference is used in the document"
        }
    ]
}

Extraction Instructions:

Extract ALL legal citations including:
Domestic statutes (e.g., "art. 405 k.c.", "ustawa o VAT")
EU directives and regulations
Court decisions (own or referenced)
International treaties
Parse structured citations to extract components
Include context: "invoked", "distinguished", "followed", "criticized"
Normalize common abbreviations (k.c., k.p.c., k.k.)
Store as JSON string in Weaviate

Example:

[
    {
        "type": "statute",
        "title": "Kodeks cywilny",
        "article": "art. 405",
        "jurisdiction": "Poland",
        "citation": "art. 405 k.c.",
        "context": "invoked as legal basis for claim"
    }
]

7. Parties (Priority: MEDIUM)¶

{
    "parties": [
        {
            "party_type": "string, role in proceeding ('plaintiff', 'defendant', 'appellant', 'appellee', 'applicant', 'third_party')",
            "party_name": "string, name of party (anonymized if necessary)",
            "party_category": "string, category ('natural_person', 'company', 'public_entity', 'ngo')",
            "representation": "string, legal representative info if available",
            "identification": "string, any relevant identifiers (company registration, tax ID)"
        }
    ]
}

Extraction Instructions:

For judgments: Extract all parties from case header
Preserve party anonymization (use initials if document uses them)
Identify party roles correctly (powód=plaintiff, pozwany=defendant)
Include legal representatives (attorneys, counsel)
For tax interpretations: Extract applicant information
Store as JSON string in Weaviate

8. Legal Analysis (Priority: MEDIUM)¶

{
    "legal_analysis": {
        "facts_summary": "string, summary of key factual findings",
        "legal_issues": ["string, main legal question or issue"],
        "reasoning": "string, court's/authority's reasoning process",
        "precedents_applied": ["string, previous cases or interpretations applied"],
        "statutory_interpretation": "string, how statutes were interpreted",
        "conclusion": "string, final legal conclusion"
    }
}

Extraction Instructions:

Extract structured analysis from judgment reasoning or interpretation body
facts_summary: Key facts established by court (Stan faktyczny)
legal_issues: Main legal questions to be resolved
reasoning: Logical flow of legal reasoning (Sąd uznał, że...)
precedents_applied: References to previous decisions influencing this one
statutory_interpretation: How laws were interpreted/applied
Store as JSON string in Weaviate

9. Structured Content (Priority: MEDIUM)¶

{
    "structured_content": {
        "sections": [
            {
                "section_type": "string, type ('header', 'facts', 'reasoning', 'decision', 'dissent')",
                "section_title": "string, section heading",
                "content_summary": "string, brief summary of section content",
                "key_points": ["string, key points from this section"]
            }
        ],
        "document_structure": {
            "has_dissent": "boolean, whether document contains dissenting opinion",
            "has_appendices": "boolean, whether document has appendices",
            "page_count": "integer, number of pages"
        }
    }
}

Extraction Instructions:

Parse document structure into logical sections
Identify standard sections (sentencja, uzasadnienie, etc.)
Extract key points from each major section
Note presence of special elements (dissenting opinions, appendices)
Store as JSON string in Weaviate

Document Type-Specific Fields¶

Judgment-Specific Fields¶

{
    "judgment_specific": {
        "court_name": "string, full name of the court",
        "court_type": "string, type of court ('district', 'regional', 'appeal', 'supreme', 'administrative')",
        "department_name": "string, court department or division",
        "case_number": "string, full case signature",
        "hearing_date": "date ISO 8601, date of hearing if different from judgment date",
        "judgment_type": "string, type of judgment ('wyrok', 'postanowienie', 'zarządzenie')",
        "proceeding_type": "string, type of proceeding ('civil', 'criminal', 'administrative')",
        "judges": [
            {
                "name": "string, judge name",
                "role": "string, role ('presiding', 'panel_member', 'rapporteur')"
            }
        ],
        "legal_bases": ["string, legal basis for jurisdiction/decision"],
        "appeal_basis": "string, grounds for appeal if applicable",
        "is_final": "boolean, whether judgment is final/binding"
    }
}

Extraction Instructions:

Extract from judgment header (Sąd ... w składzie:)
court_name: Full official court name
judges: Parse judge list with roles (Przewodniczący, Sędziowie)
judgment_type: Distinguish wyrok (judgment) from postanowienie (order)
legal_bases: Articles establishing court jurisdiction
Store as JSON string in Weaviate

Tax Interpretation-Specific Fields¶

{
    "tax_interpretation_specific": {
        "interpretation_type": "string, type ('individual', 'general')",
        "tax_authority": "string, issuing tax authority name",
        "authority_level": "string, level ('local', 'regional', 'national')",
        "applicant_type": "string, type of applicant ('natural_person', 'company', 'ngo')",
        "tax_matter": "string, specific tax matter/question",
        "tax_type": "string, type of tax (VAT, CIT, PIT, etc.)",
        "fiscal_year": "string, relevant fiscal year if applicable",
        "interpretation_status": "string, status ('binding', 'non-binding', 'revoked')",
        "validity_period": {
            "start_date": "date ISO 8601",
            "end_date": "date ISO 8601"
        }
    }
}

Extraction Instructions:

Extract from interpretation metadata
interpretation_type: Individual (indywidualna) vs general (ogólna)
tax_authority: Extract from document header
tax_matter: Brief description of the tax question addressed
interpretation_status: Whether interpretation is binding on taxpayer
Store as JSON string in Weaviate

Issuing Body Information¶

{
    "issuing_body": {
        "name": "string, name of issuing institution (court, tax authority, etc.)",
        "type": "string, type of body ('court', 'tax_authority', 'regulatory_agency')",
        "jurisdiction": "string, territorial jurisdiction",
        "authority_level": "string, hierarchical level ('first_instance', 'appellate', 'supreme')",
        "contact_info": {
            "address": "string, institutional address",
            "website": "string, official website URL"
        }
    }
}

Document Relationships¶

{
    "relationships": [
        {
            "relationship_type": "string, type of relationship ('appeals', 'overturns', 'follows', 'distinguishes', 'cites')",
            "related_document_id": "string, ID of related document",
            "related_document_number": "string, official number of related document",
            "description": "string, description of relationship"
        }
    ]
}

Extraction Instructions:

Identify relationships to other documents
For appeals: Link to lower instance decisions
For references: Link to cited precedents
Store as JSON string in Weaviate

Metadata Fields (Auto-Generated)¶

These fields are typically auto-generated by the system but can be augmented:

{
    "source_url": "string, URL where document was sourced",
    "ingestion_date": "datetime ISO 8601, when document was ingested",
    "last_updated": "datetime ISO 8601, last modification timestamp",
    "processing_status": "string, status ('raw', 'processed', 'enriched', 'verified')",
    "confidence_score": "number (0.0-1.0), confidence in extracted information",
    "metadata": {
        "extractor_model": "string, model used for extraction (e.g., 'gemini-2.5-pro')",
        "extraction_date": "datetime ISO 8601",
        "extraction_version": "string, version of extraction schema",
        "quality_flags": ["string, any quality issues or warnings"]
    }
}

Usage Examples¶

Example 1: Judgment Extraction Schema¶

from juddges.extraction import GeminiExtractionChain, ExtractionSchema, DocumentType

schema = ExtractionSchema(
    fields={
        # Core fields
        "document_number": "string, official case number",
        "date_issued": "date ISO 8601, judgment date",
        "court_name": "string, full court name",

        # High-priority augmentation fields
        "summary": "string, 3-5 sentence summary covering parties, issue, decision, and basis",
        "thesis": "string, main legal principle established",
        "keywords": "List[string], 5-15 relevant legal keywords",

        # Outcome
        "outcome": "JSON object with decision_type, decision_summary, awarded_amounts, legal_effect",

        # Legal content
        "legal_references": "JSON array of legal citations with type, title, article, context",
        "legal_concepts": "JSON array of legal concepts with name, area, definition_context",
        "parties": "JSON array of parties with type, name, category, representation",

        # Judgment-specific
        "judgment_specific": "JSON object with court_name, judges, case_number, legal_bases",
    },
    instructions="""
    Focus on extracting factual information from the judgment.
    For dates, use ISO 8601 format (YYYY-MM-DD).
    For complex objects (outcome, parties, legal_references), return valid JSON.
    Maintain original language (Polish) for all extracted content.
    If information is not found, use empty string or empty array.
    For legal references, extract ALL citations including article numbers.
    """,
    language="polish"
)

chain = GeminiExtractionChain(model_name="gemini-2.5-flash")
result = chain.extract(
    document_type=DocumentType.JUDGMENT,
    text=judgment_text,
    schema=schema
)

Example 2: Tax Interpretation Extraction Schema¶

schema = ExtractionSchema(
    fields={
        "document_number": "string, interpretation reference number",
        "date_issued": "date ISO 8601, interpretation date",
        "tax_authority": "string, issuing authority name",

        "summary": "string, summary of tax question and authority position",
        "thesis": "string, main tax law principle",
        "keywords": "List[string], tax-related keywords",

        "outcome": "JSON object with interpretation decision and legal effect",
        "legal_references": "JSON array of tax law citations",

        "tax_interpretation_specific": "JSON object with interpretation_type, applicant_type, tax_matter, tax_type",
    },
    instructions="""
    Extract tax interpretation content focusing on the legal question and authority's position.
    Identify the specific tax type and relevant tax law provisions.
    """,
    language="polish"
)

Field Coverage Analysis¶

Based on Weaviate schema analysis, these fields have the highest priority for LLM augmentation (sorted by emptiness):

Priority	Field	Coverage	Description
1	`summary`	0-10%	Concise document summary
2	`thesis`	0-10%	Main legal principle
3	`keywords`	0-15%	Relevant keywords/tags
4	`outcome`	0-20%	Decision outcome and effects
5	`legal_concepts`	0-20%	Legal concepts discussed
6	`legal_references`	20-40%	Legal citations and references
7	`parties`	30-50%	Party information
8	`legal_analysis`	0-10%	Structured legal analysis
9	`structured_content`	0-10%	Document structure
10	`judgment_specific`	40-60%	Judgment metadata

Integration with Weaviate¶

After extraction, results should be stored in Weaviate using the appropriate data types:

from juddges.data.documents_weaviate_db import WeaviateLegalDocumentsDatabase
import json

# Extract information
extracted_data = chain.extract(document_type, text, schema)

# Prepare for Weaviate (convert complex objects to JSON strings)
weaviate_properties = {
    "document_id": extracted_data.get("document_id"),
    "summary": extracted_data.get("summary"),
    "thesis": extracted_data.get("thesis"),
    "keywords": extracted_data.get("keywords"),  # Already a list

    # Convert complex objects to JSON strings
    "outcome": json.dumps(extracted_data.get("outcome")) if extracted_data.get("outcome") else None,
    "legal_references": json.dumps(extracted_data.get("legal_references")) if extracted_data.get("legal_references") else None,
    "parties": json.dumps(extracted_data.get("parties")) if extracted_data.get("parties") else None,
    "judgment_specific": json.dumps(extracted_data.get("judgment_specific")) if extracted_data.get("judgment_specific") else None,
}

# Update Weaviate document
with WeaviateLegalDocumentsDatabase() as db:
    collection = db.legal_documents_collection
    collection.data.update(
        uuid=document_uuid,
        properties=weaviate_properties
    )

Best Practices¶

Incremental Extraction: Start with high-priority fields (summary, thesis, keywords) before extracting complex structures
Validation: Verify extracted dates are valid ISO 8601, JSON objects are well-formed
Caching: Use LangChain SQLite cache to avoid re-extracting same documents
Batch Processing: Use batch_extract() for processing multiple documents efficiently
Quality Control: Review confidence scores and implement human-in-the-loop for low-confidence extractions
Version Tracking: Store extraction schema version in metadata for reproducibility

Gemini LLM Extraction Schema for Legal Documents¶

Overview¶

Schema Design Principles¶

Core Document Fields¶

Document Identification & Metadata¶

High-Priority LLM Augmentation Fields¶

1. Summary (Priority: HIGH)¶

2. Thesis (Priority: HIGH)¶

3. Keywords (Priority: HIGH)¶

4. Outcome (Priority: HIGH)¶

5. Legal Concepts (Priority: MEDIUM)¶

6. Legal References (Priority: HIGH)¶

7. Parties (Priority: MEDIUM)¶

8. Legal Analysis (Priority: MEDIUM)¶

9. Structured Content (Priority: MEDIUM)¶

Document Type-Specific Fields¶

Judgment-Specific Fields¶

Tax Interpretation-Specific Fields¶

Issuing Body Information¶

Document Relationships¶

Metadata Fields (Auto-Generated)¶

Usage Examples¶

Example 1: Judgment Extraction Schema¶

Example 2: Tax Interpretation Extraction Schema¶

Field Coverage Analysis¶

Integration with Weaviate¶

Best Practices¶

See Also¶