Skip to content

Tutorial: Your First Legal Document Analysis with JuDDGES

Complete this hands-on tutorial to learn the fundamentals of legal document analysis using JuDDGES. You'll go from zero to analyzing real court decisions in 30-60 minutes.

Table of Contents


Learning Objectives

By the end of this tutorial, you will be able to:

  • ✅ Set up the JuDDGES environment and dependencies
  • ✅ Load and explore Polish legal document datasets
  • ✅ Extract structured information from court judgments
  • ✅ Perform semantic search on legal documents
  • ✅ Understand the basic JuDDGES workflow
  • ✅ Run interactive legal document analysis

Estimated Time: 30-60 minutes


Prerequisites

Required Knowledge

  • Basic Python programming (variables, functions, loops)
  • Command line familiarity (running commands in terminal)
  • Basic understanding of JSON and dictionaries

Required Software

  • Python 3.10+ installed
  • Docker and Docker Compose installed
  • Git for cloning the repository
  • 16GB+ RAM recommended (8GB minimum)
  • 10GB+ free disk space

Optional

  • Google API Key for Gemini extraction (get it here)
  • GPU with CUDA support (for advanced features)

Note: Don't worry if you don't have everything yet! We'll guide you through the setup.


What You'll Build

In this tutorial, you'll create a complete legal document analysis pipeline:

graph LR
    A[📄 Load Dataset] --> B[🔍 Explore Documents]
    B --> C[📊 Extract Information]
    C --> D[🔎 Semantic Search]
    D --> E[📈 Visualize Results]

    style A fill:#e1f5ff
    style C fill:#fff3e0
    style D fill:#f3e5f5
    style E fill:#e8f5e9

Real-world application: This is the foundation for building legal research tools, compliance systems, and case analytics platforms.


Setup Your Environment

Step 1: Clone the Repository

Open your terminal and run:

# Clone the JuDDGES repository
git clone https://github.com/pwr-ai/JuDDGES.git
cd JuDDGES

Expected output:

Cloning into 'JuDDGES'...
remote: Enumerating objects: 1234, done.
remote: Counting objects: 100% (1234/1234), done.

Step 2: Run Automated Setup

# Linux/macOS
./setup.sh

# Windows
setup.bat

This script will:

  • Create a virtual environment in .venv/
  • Install all Python dependencies
  • Set up pre-commit hooks
  • Configure Git LFS for large files

Expected output:

✓ Creating virtual environment
✓ Installing dependencies
✓ Setting up pre-commit hooks
✓ JuDDGES setup complete!

Step 3: Activate Virtual Environment

# Linux/macOS
source .venv/bin/activate

# Windows
.venv\Scripts\activate

Your prompt should now show (.venv) indicating the environment is active.

Step 4: Set Environment Variables

Create a .env file in the project root:

# Create .env file
touch .env

Add your API keys (optional for this tutorial):

# For Gemini extraction (optional for now)
GOOGLE_API_KEY=your-google-api-key-here

# For GPU acceleration (optional)
CUDA_VISIBLE_DEVICES=0
NUM_PROC=10

Tip: You can skip the API key for now and add it later when we do extraction.


Step 1: Load Your First Dataset

Understanding JuDDGES Datasets

JuDDGES provides pre-built datasets of Polish and English legal documents. Let's start with a sample dataset.

Create Your First Script

Create a new file called my_first_analysis.py:

"""My First Legal Document Analysis with JuDDGES."""

from datasets import load_dataset
from rich.console import Console
from rich.table import Table

# Initialize rich console for beautiful output
console = Console()

# Step 1: Load a sample dataset
console.print("[bold blue]Loading Polish court decisions dataset...[/bold blue]")

# Load a small sample for quick experimentation
dataset = load_dataset(
    "JuDDGES/pl-court-raw-sample",
    split="train[:100]"  # Load only first 100 documents
)

console.print(f"[green]✓ Loaded {len(dataset)} documents[/green]")

# Step 2: Explore the dataset structure
console.print("\n[bold]Dataset Info:[/bold]")
console.print(f"Number of documents: {len(dataset)}")
console.print(f"Available fields: {list(dataset.features.keys())}")

# Step 3: Display first document
console.print("\n[bold]First Document Sample:[/bold]")
sample = dataset[0]

# Create a table for better visualization
table = Table(show_header=True, header_style="bold magenta")
table.add_column("Field", style="cyan")
table.add_column("Value", style="white")

# Show key fields
table.add_row("ID", str(sample.get("id", "N/A")))
table.add_row("Court", sample.get("court", "N/A"))
table.add_row("Date", sample.get("judgment_date", "N/A"))
table.add_row("Case Type", sample.get("court_type", "N/A"))

# Show truncated text
text_preview = sample.get("text", "")[:200] + "..."
table.add_row("Text Preview", text_preview)

console.print(table)

console.print("\n[bold green]✓ Step 1 Complete![/bold green]")
console.print("You've successfully loaded and explored a legal dataset!")

Run Your Script

python my_first_analysis.py

Expected output:

Loading Polish court decisions dataset...
✓ Loaded 100 documents

Dataset Info:
Number of documents: 100
Available fields: ['id', 'text', 'court', 'judgment_date', 'court_type', ...]

First Document Sample:
┌─────────────┬──────────────────────────────┐
│ Field       │ Value                        │
├─────────────┼──────────────────────────────┤
│ ID          │ 12345678                     │
│ Court       │ Sąd Okręgowy w Warszawie     │
│ Date        │ 2023-03-15                   │
│ Case Type   │ civil                        │
│ Text Preview│ W imieniu Rzeczypospolitej...│
└─────────────┴──────────────────────────────┘

✓ Step 1 Complete!
You've successfully loaded and explored a legal dataset!

🎯 Checkpoint 1: Test Your Understanding

Question: How many fields does the dataset have?

Click to reveal answer Run this to find out:
print(f"Number of fields: {len(dataset.features)}")
print(f"Field names: {list(dataset.features.keys())}")
The dataset typically has 15-20 fields including `id`, `text`, `court`, `judgment_date`, `parties`, etc.

Try This: Modify the script to load 200 documents instead of 100. What happens to the loading time?


Step 2: Explore Document Structure

Now let's dive deeper into the structure of legal documents.

Add to Your Script

"""Step 2: Explore document structure in detail."""

# Continuing from previous script...

console.print("\n[bold blue]Step 2: Exploring Document Structure[/bold blue]")

# Analyze document statistics
total_docs = len(dataset)
courts = {}
years = {}

for doc in dataset:
    # Count documents by court
    court = doc.get("court", "Unknown")
    courts[court] = courts.get(court, 0) + 1

    # Count documents by year
    date = doc.get("judgment_date", "")
    if date:
        year = date[:4]  # Extract year from YYYY-MM-DD
        years[year] = years.get(year, 0) + 1

# Display court statistics
console.print("\n[bold]Top 5 Courts by Document Count:[/bold]")
top_courts = sorted(courts.items(), key=lambda x: x[1], reverse=True)[:5]

court_table = Table(show_header=True, header_style="bold cyan")
court_table.add_column("Court Name", style="white")
court_table.add_column("Count", justify="right", style="green")

for court, count in top_courts:
    court_table.add_row(court, str(count))

console.print(court_table)

# Display year distribution
console.print("\n[bold]Documents by Year:[/bold]")
year_table = Table(show_header=True, header_style="bold cyan")
year_table.add_column("Year", style="white")
year_table.add_column("Count", justify="right", style="green")

for year in sorted(years.keys()):
    year_table.add_row(year, str(years[year]))

console.print(year_table)

# Analyze document length
text_lengths = [len(doc.get("text", "")) for doc in dataset]
avg_length = sum(text_lengths) / len(text_lengths)
min_length = min(text_lengths)
max_length = max(text_lengths)

console.print(f"\n[bold]Document Length Statistics:[/bold]")
console.print(f"Average length: {avg_length:,.0f} characters")
console.print(f"Shortest: {min_length:,} characters")
console.print(f"Longest: {max_length:,} characters")

console.print("\n[bold green]✓ Step 2 Complete![/bold green]")

🎯 Checkpoint 2: Exploration Exercise

Exercise: Find documents from a specific court.

# Try this code
target_court = "Sąd Okręgowy w Warszawie"
filtered_docs = [doc for doc in dataset if doc.get("court") == target_court]
console.print(f"Found {len(filtered_docs)} documents from {target_court}")

Challenge: Can you find the oldest document in the dataset? Write code to find it!

Hint
oldest_doc = min(
    dataset,
    key=lambda x: x.get("judgment_date", "9999-99-99")
)
console.print(f"Oldest: {oldest_doc['judgment_date']} from {oldest_doc['court']}")

Step 3: Extract Key Information

Now let's extract structured information from judgments using Gemini.

Prerequisites Check

Before continuing, ensure you have:

# Check if Google API key is set
echo $GOOGLE_API_KEY  # Should show your API key

If not set, add it to your .env file or export it:

export GOOGLE_API_KEY="your-api-key-here"

Extraction Script

Create extract_information.py:

"""Extract structured information from legal documents."""

import os
from datasets import load_dataset
from rich.console import Console
from rich.panel import Panel

from juddges.extraction import GeminiExtractionChain
from juddges.extraction.gemini_chain import DocumentType, ExtractionSchema

console = Console()

# Step 1: Initialize extraction chain
console.print("[bold blue]Initializing Gemini extraction chain...[/bold blue]")

chain = GeminiExtractionChain(
    model_name="gemini-2.5-flash",  # Fast and cost-effective
    api_key=os.getenv("GOOGLE_API_KEY"),
    temperature=0.0,  # Deterministic for consistency
    cache_path=".cache/tutorial_extraction.db",
)

console.print("[green]✓ Chain initialized[/green]")

# Step 2: Load a document
console.print("\n[bold blue]Loading document...[/bold blue]")
dataset = load_dataset("JuDDGES/pl-court-raw-sample", split="train[:1]")
document = dataset[0]

console.print(f"[green]✓ Loaded document from {document.get('court', 'Unknown')}[/green]")
console.print(f"Document length: {len(document['text'])} characters")

# Step 3: Define extraction schema
console.print("\n[bold blue]Defining extraction schema...[/bold blue]")

schema = ExtractionSchema(
    fields={
        "verdict_date": "date as ISO 8601, data wydania wyroku",
        "case_signature": "string, sygnatura sprawy",
        "court": "string, nazwa sądu",
        "judge_names": "List[string], imiona i nazwiska sędziów",
        "parties": "List[string], strony postępowania",
        "verdict": "string, treść rozstrzygnięcia",
    },
    instructions=(
        "Wyodrębnij informacje faktyczne z treści wyroku. "
        "Dla dat użyj formatu ISO 8601. "
        "Dla list uwzględnij wszystkie wymienione pozycje."
    ),
    language="polish",
)

console.print("[green]✓ Schema defined with 6 fields[/green]")

# Step 4: Extract information
console.print("\n[bold blue]Extracting information...[/bold blue]")
console.print("[yellow]This may take 5-10 seconds for the first run...[/yellow]")

try:
    result = chain.extract(
        document_type=DocumentType.JUDGMENT,
        text=document["text"],
        schema=schema,
    )

    console.print("[green]✓ Extraction complete![/green]")

    # Step 5: Display results
    console.print("\n[bold]Extracted Information:[/bold]")

    for field, value in result.items():
        if isinstance(value, list):
            value_str = "\n  • " + "\n  • ".join(value) if value else "[]"
        else:
            value_str = str(value)

        panel = Panel(
            value_str,
            title=f"[cyan]{field}[/cyan]",
            border_style="blue",
        )
        console.print(panel)

    console.print("\n[bold green]✓ Step 3 Complete![/bold green]")
    console.print("You've successfully extracted structured data from a legal document!")

except Exception as e:
    console.print(f"[red]Error during extraction: {e}[/red]")
    console.print("[yellow]Tip: Make sure GOOGLE_API_KEY is set correctly[/yellow]")

Run Extraction

python extract_information.py

Expected output:

Initializing Gemini extraction chain...
✓ Chain initialized

Loading document...
✓ Loaded document from Sąd Okręgowy w Warszawie
Document length: 8,543 characters

Defining extraction schema...
✓ Schema defined with 6 fields

Extracting information...
This may take 5-10 seconds for the first run...
✓ Extraction complete!

Extracted Information:
╭─ verdict_date ──────────────────╮
│ 2023-03-15                      │
╰─────────────────────────────────╯
╭─ case_signature ────────────────╮
│ II C 123/2023                   │
╰─────────────────────────────────╯
[... more fields ...]

✓ Step 3 Complete!
You've successfully extracted structured data from a legal document!

🎯 Checkpoint 3: Extraction Challenge

Challenge: Modify the schema to extract additional fields:

  • legal_basis: List of referenced laws
  • verdict_type: Type of verdict (e.g., "oddalono powództwo", "uwzględniono powództwo")

Try running the extraction again with your modified schema!


Step 4: Search Documents Semantically

Let's set up semantic search using Weaviate vector database.

Start Weaviate

# Navigate to weaviate directory
cd weaviate

# Start Weaviate with Docker Compose
docker compose up -d

# Check if it's running
docker compose ps

Expected output:

NAME                IMAGE                       STATUS
weaviate            semitechnologies/weaviate   Up 5 seconds

Semantic Search Script

Create semantic_search.py:

"""Perform semantic search on legal documents."""

from rich.console import Console
from rich.table import Table

from juddges.data.judgments_weaviate_db import JudgmentsWeaviateDB

console = Console()

# Step 1: Connect to Weaviate
console.print("[bold blue]Connecting to Weaviate vector database...[/bold blue]")

try:
    db = JudgmentsWeaviateDB(url="http://localhost:8080")
    console.print("[green]✓ Connected to Weaviate[/green]")
except Exception as e:
    console.print(f"[red]Connection failed: {e}[/red]")
    console.print("[yellow]Make sure Weaviate is running: cd weaviate && docker compose up -d[/yellow]")
    exit(1)

# Step 2: Check database status
console.print("\n[bold blue]Checking database status...[/bold blue]")

# Get document count (this may be 0 if you haven't ingested data yet)
# We'll show the search functionality anyway

# Step 3: Perform semantic search
console.print("\n[bold blue]Performing semantic search...[/bold blue]")

# Example search query
query = "umowa kredytu we frankach szwajcarskich"

console.print(f"[cyan]Query:[/cyan] {query}")
console.print("[yellow]Searching...[/yellow]")

try:
    results = db.search_semantic(
        query=query,
        limit=5,
    )

    if results:
        console.print(f"[green]✓ Found {len(results)} relevant documents[/green]")

        # Display results
        table = Table(show_header=True, header_style="bold magenta")
        table.add_column("Rank", width=6)
        table.add_column("Court", width=30)
        table.add_column("Date", width=12)
        table.add_column("Relevance", width=10)

        for i, result in enumerate(results, 1):
            court = result.get("court", "Unknown")[:27] + "..."
            date = result.get("judgment_date", "N/A")
            # Distance is converted to similarity score (lower distance = higher relevance)
            relevance = f"{(1 - result.get('distance', 1)) * 100:.1f}%"

            table.add_row(str(i), court, date, relevance)

        console.print(table)

        # Show snippet of most relevant document
        if results:
            console.print("\n[bold]Most Relevant Document Snippet:[/bold]")
            text = results[0].get("text", "")[:300] + "..."
            console.print(f"[italic]{text}[/italic]")

        console.print("\n[bold green]✓ Step 4 Complete![/bold green]")

    else:
        console.print("[yellow]No documents found in database yet.[/yellow]")
        console.print("[yellow]You can ingest documents using: python scripts/embed/simple_ingest.py[/yellow]")

except Exception as e:
    console.print(f"[red]Search failed: {e}[/red]")
python semantic_search.py

Note: If your database is empty, you'll see a message about ingesting documents. Don't worry - the search functionality is set up correctly!

🎯 Checkpoint 4: Search Exercise

Exercise: Try different search queries:

queries = [
    "umowa kredytu we frankach szwajcarskich",
    "odszkodowanie za wypadek przy pracy",
    "rozwód z orzeczeniem o winie",
]

for query in queries:
    results = db.search_semantic(query=query, limit=3)
    print(f"Query: {query} -> Found {len(results)} results")

Challenge: Can you search in English and find relevant documents?


Step 5: Visualize Results

Let's create a simple visualization dashboard.

Create Dashboard Script

Create simple_dashboard.py:

"""Simple interactive dashboard for legal document analysis."""

from datasets import load_dataset
from rich.console import Console
from rich.prompt import Prompt
from rich.table import Table
import os

from juddges.extraction import GeminiExtractionChain
from juddges.extraction.gemini_chain import DocumentType, ExtractionSchema

console = Console()

# Load dataset
console.print("[bold blue]Loading dataset...[/bold blue]")
dataset = load_dataset("JuDDGES/pl-court-raw-sample", split="train[:50]")
console.print(f"[green]✓ Loaded {len(dataset)} documents[/green]")

# Initialize extraction chain
chain = GeminiExtractionChain(
    model_name="gemini-2.5-flash",
    api_key=os.getenv("GOOGLE_API_KEY"),
    cache_path=".cache/tutorial_extraction.db",
)

# Define schema
schema = ExtractionSchema(
    fields={
        "verdict_date": "date as ISO 8601",
        "court": "string, nazwa sądu",
        "parties": "List[string], strony",
        "verdict": "string, wyrok",
    },
    language="polish",
)

def display_menu():
    """Display interactive menu."""
    console.print("\n[bold cyan]═══════════════════════════════════════[/bold cyan]")
    console.print("[bold cyan]  JuDDGES Legal Document Analyzer    [/bold cyan]")
    console.print("[bold cyan]═══════════════════════════════════════[/bold cyan]")
    console.print("\n[bold]Options:[/bold]")
    console.print("  [1] Browse documents")
    console.print("  [2] Extract information from document")
    console.print("  [3] Search by court")
    console.print("  [4] Exit")

def browse_documents():
    """Browse documents in a table."""
    table = Table(show_header=True, header_style="bold magenta")
    table.add_column("ID", width=4)
    table.add_column("Court", width=35)
    table.add_column("Date", width=12)

    for i, doc in enumerate(dataset[:10]):
        table.add_row(
            str(i),
            doc.get("court", "Unknown")[:32] + "...",
            doc.get("judgment_date", "N/A"),
        )

    console.print(table)
    console.print(f"\n[italic]Showing 10 of {len(dataset)} documents[/italic]")

def extract_from_document():
    """Extract information from selected document."""
    doc_id = Prompt.ask("\nEnter document ID (0-49)")

    try:
        doc_id = int(doc_id)
        if 0 <= doc_id < len(dataset):
            doc = dataset[doc_id]
            console.print(f"\n[bold]Extracting from document {doc_id}...[/bold]")

            result = chain.extract(
                document_type=DocumentType.JUDGMENT,
                text=doc["text"],
                schema=schema,
            )

            console.print("[green]✓ Extraction complete[/green]")
            for field, value in result.items():
                console.print(f"[cyan]{field}:[/cyan] {value}")
        else:
            console.print("[red]Invalid document ID[/red]")
    except ValueError:
        console.print("[red]Please enter a valid number[/red]")

def search_by_court():
    """Search documents by court name."""
    court_name = Prompt.ask("\nEnter court name (partial match)")

    results = [
        doc for doc in dataset
        if court_name.lower() in doc.get("court", "").lower()
    ]

    if results:
        console.print(f"\n[green]Found {len(results)} matching documents:[/green]")
        for i, doc in enumerate(results[:10]):
            console.print(f"{i+1}. {doc['court']} ({doc.get('judgment_date', 'N/A')})")
    else:
        console.print("[yellow]No matching documents found[/yellow]")

# Main loop
while True:
    display_menu()
    choice = Prompt.ask("\nSelect option", choices=["1", "2", "3", "4"])

    if choice == "1":
        browse_documents()
    elif choice == "2":
        extract_from_document()
    elif choice == "3":
        search_by_court()
    elif choice == "4":
        console.print("[bold green]Thank you for using JuDDGES![/bold green]")
        break

Run Dashboard

python simple_dashboard.py

Try This:

  1. Browse documents (option 1)
  2. Extract information from document 0 (option 2)
  3. Search for "Warszawa" (option 3)

🎯 Checkpoint 5: Final Challenge

Challenge: Enhance the dashboard with a new feature:

  • Add option to filter documents by date range
  • Add option to export extracted data to JSON
  • Add option to compare two documents

Checkpoints & Exercises

Summary Exercises

Now that you've completed all steps, test your knowledge:

Exercise 1: Data Pipeline Create a script that:

  1. Loads 20 documents
  2. Extracts information from each
  3. Saves results to a JSON file
Solution
import json
from datasets import load_dataset
from juddges.extraction import GeminiExtractionChain
from juddges.extraction.gemini_chain import DocumentType, ExtractionSchema

# Load documents
dataset = load_dataset("JuDDGES/pl-court-raw-sample", split="train[:20]")

# Initialize chain
chain = GeminiExtractionChain(model_name="gemini-2.5-flash")

# Define schema
schema = ExtractionSchema(
    fields={
        "verdict_date": "date as ISO 8601",
        "court": "string",
        "verdict": "string",
    },
    language="polish",
)

# Extract from all documents
results = []
for i, doc in enumerate(dataset):
    print(f"Processing document {i+1}/20...")
    result = chain.extract(
        document_type=DocumentType.JUDGMENT,
        text=doc["text"],
        schema=schema,
    )
    results.append(result)

# Save to JSON
with open("extracted_results.json", "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

print("✓ Results saved to extracted_results.json")

Exercise 2: Document Statistics Calculate:

  • Average document length
  • Most common court
  • Documents per year distribution

Exercise 3: Custom Schema Create a schema for extracting:

  • All monetary amounts mentioned
  • All dates mentioned
  • All legal article references

Troubleshooting

Issue: "Module not found: juddges"

Solution: Make sure you've installed the package:

source .venv/bin/activate  # Activate environment
pip install -e .  # Install in editable mode

Issue: "GOOGLE_API_KEY not found"

Solution: Set the environment variable:

export GOOGLE_API_KEY="your-api-key"
# Or add to .env file

Issue: "Weaviate connection failed"

Solution: Ensure Weaviate is running:

cd weaviate
docker compose up -d
docker compose ps  # Check status

Issue: "Out of memory"

Solution: Reduce the number of documents:

# Instead of loading all documents
dataset = load_dataset(..., split="train[:50]")  # Load only 50

Issue: "Extraction is slow"

Solution:

  • Use cache (it's enabled by default)
  • Use gemini-2.5-flash instead of pro
  • Process documents in smaller batches

Summary

Congratulations! You've completed your first legal document analysis with JuDDGES.

What You've Learned

Environment Setup: Installed and configured JuDDGES ✅ Data Loading: Loaded and explored legal document datasets ✅ Document Structure: Analyzed document fields and statistics ✅ Information Extraction: Extracted structured data using Gemini ✅ Semantic Search: Performed vector-based document search ✅ Visualization: Created an interactive analysis dashboard

Key Concepts

Concept Description
Dataset Collection of legal documents with structured metadata
Extraction Schema Definition of what information to extract
Semantic Search Finding documents by meaning, not just keywords
Vector Database Storage system for document embeddings
Gemini Chain LangChain pipeline for LLM-based extraction

Your Skills Progression

Beginner → Novice → Intermediate → Advanced → Expert
^
You are here! 🎉

Next Steps

Continue Learning

Now that you've mastered the basics, explore these tutorials:

  1. Tutorial 2: Working with Legal Document Embeddings
  2. Generate embeddings for documents
  3. Ingest large datasets to Weaviate
  4. Visualize document spaces with UMAP

  5. Tutorial 3: Fine-tuning Legal LLMs

  6. Prepare instruction datasets
  7. Fine-tune models with PEFT/LoRA
  8. Evaluate model performance

  9. Tutorial 4: Advanced Information Extraction

  10. Complex extraction schemas
  11. Multi-step extraction pipelines
  12. Validation and quality control

  13. Tutorial 5: End-to-End Project

  14. Build a complete legal analysis system
  15. Deploy to production
  16. Monitor and maintain

Explore How-To Guides

Need to solve specific problems? Check out:

Understand the Architecture

Want to dive deeper? Read:

Join the Community

  • GitHub: Report issues and contribute
  • Discussions: Ask questions and share ideas
  • Documentation: Help improve our docs

Support

For questions or issues:


Last Updated: 2025-10-11 | Version: 1.0 | Status: Published

Estimated Completion Time: 30-60 minutes | Difficulty: Beginner | Prerequisites: Python basics