Evaluation Metrics¶

Comprehensive evaluation metrics for information extraction tasks with support for dates, numbers, strings, enums, and lists.

Overview¶

The juddges.evals.metrics module provides field-level evaluation metrics for comparing predicted extractions against ground truth labels. Each metric is designed for a specific data type:

Dates: Flexible date parsing and exact matching
Numbers: Numeric comparison with tolerance
Strings: ROUGE scores (unigram, bigram, longest common subsequence)
Enums: Classification accuracy with hallucination detection
Lists: Precision, recall, F1 with greedy matching

Key Features¶

Type-Specific Metrics: Optimized evaluation for each data type
Flexible Parsing: Handles various date and number formats
ROUGE Scores: Industry-standard text similarity metrics
Hallucination Detection: Identifies out-of-vocabulary enum predictions
List Matching: Greedy matching with standard IR metrics
Zero-Value Handling: Proper handling of missing or null values

Usage Examples¶

Date Evaluation¶

from juddges.evals.metrics import evaluate_date

# Exact match
result = evaluate_date("2024-01-15", "2024-01-15")
print(result)  # {"match": 1}

# Flexible parsing
result = evaluate_date("15 stycznia 2024", "2024-01-15")
print(result)  # {"match": 1} - parsed and matched

# Mismatch
result = evaluate_date("2024-01-15", "2024-01-16")
print(result)  # {"match": 0}

# Missing value
result = evaluate_date(None, "2024-01-15")
print(result)  # {"match": 0}

Number Evaluation¶

from juddges.evals.metrics import evaluate_number

# Exact match
result = evaluate_number(42, 42)
print(result)  # {"match": 1}

# String parsing
result = evaluate_number("1000.50", "1000.5")
print(result)  # {"match": 1}

# Tolerance-based matching
result = evaluate_number(1000.0001, 1000.0, atol=0.001)
print(result)  # {"match": 1}

# Type conversion
result = evaluate_number("42", 42)
print(result)  # {"match": 1}

String (ROUGE) Evaluation¶

from juddges.evals.metrics import evaluate_string_rouge

# Exact match
result = evaluate_string_rouge(
    "Sąd Okręgowy w Warszawie",
    "Sąd Okręgowy w Warszawie"
)
print(result)
# {
#     "rouge1": 1.0,
#     "rouge2": 1.0,
#     "rougeL": 1.0
# }

# Partial match
result = evaluate_string_rouge(
    "Sąd w Warszawie",
    "Sąd Okręgowy w Warszawie"
)
print(result)
# {
#     "rouge1": 0.75,  # 3 of 4 unigrams match
#     "rouge2": 0.67,  # 2 of 3 bigrams match
#     "rougeL": 0.75   # Longest common subsequence
# }

# Missing values
result = evaluate_string_rouge(None, "some text")
print(result)
# {
#     "rouge1": 0.0,
#     "rouge2": 0.0,
#     "rougeL": 0.0
# }

Enum Evaluation¶

from juddges.evals.metrics import evaluate_enum

# Valid prediction
result = evaluate_enum(
    predicted="Wyrok",
    gold="Wyrok",
    choices=["Wyrok", "Postanowienie", "Uchwała"]
)
print(result)
# {
#     "match": 1,
#     "predicted_in_choices": 1
# }

# Hallucination (invalid choice)
result = evaluate_enum(
    predicted="InvalidType",
    gold="Wyrok",
    choices=["Wyrok", "Postanowienie", "Uchwała"]
)
print(result)
# {
#     "match": 0,
#     "predicted_in_choices": 0  # Hallucination detected
# }

# Mismatch but valid
result = evaluate_enum(
    predicted="Postanowienie",
    gold="Wyrok",
    choices=["Wyrok", "Postanowienie", "Uchwała"]
)
print(result)
# {
#     "match": 0,
#     "predicted_in_choices": 1
# }

List Evaluation¶

from juddges.evals.metrics import evaluate_list_greedy

# Perfect match
result = evaluate_list_greedy(
    predicted=["Art. 123", "Art. 456"],
    gold=["Art. 123", "Art. 456"]
)
print(result)
# {
#     "true_positives": 2,
#     "false_positives": 0,
#     "false_negatives": 0,
#     "precision": 1.0,
#     "recall": 1.0,
#     "f1": 1.0
# }

# Partial match
result = evaluate_list_greedy(
    predicted=["Art. 123", "Art. 789"],
    gold=["Art. 123", "Art. 456"]
)
print(result)
# {
#     "true_positives": 1,
#     "false_positives": 1,
#     "false_negatives": 1,
#     "precision": 0.5,
#     "recall": 0.5,
#     "f1": 0.5
# }

# Empty lists
result = evaluate_list_greedy(
    predicted=None,
    gold=["Art. 123"]
)
print(result)
# {
#     "true_positives": 0,
#     "false_positives": 0,
#     "false_negatives": 1,
#     "precision": 0.0,
#     "recall": 0.0,
#     "f1": 0.0
# }

API Reference¶

evaluate_date ¶

evaluate_date(predicted: str | None, gold: str | None) -> dict[str, int]

Parses dates and checks for an exact match.

PARAMETER	DESCRIPTION
`predicted`	The predicted date string. TYPE: `str \| None`
`gold`	The ground truth date string. TYPE: `str \| None`

RETURNS	DESCRIPTION
`dict[str, int]`	True if dates match, False otherwise.

Source code in juddges/evals/metrics.py

def evaluate_date(
    predicted: str | None,
    gold: str | None,
) -> dict[str, int]:
    """
    Parses dates and checks for an exact match.

    Args:
        predicted: The predicted date string.
        gold: The ground truth date string.

    Returns:
        True if dates match, False otherwise.
    """
    if predicted == gold:
        return {"match": 1}

    try:
        predicted_date = date_parser.parse(predicted)
        gold_date = date_parser.parse(gold)
        return {"match": int(predicted_date == gold_date)}
    except (ValueError, TypeError):
        return {"match": 0}

evaluate_number ¶

evaluate_number(predicted: Any, gold: Any, atol: float = 1e-06) -> dict[str, int]

Compares two numbers for an exact match.

PARAMETER	DESCRIPTION
`predicted`	The predicted number. TYPE: `Any`
`gold`	The ground truth number. TYPE: `Any`

RETURNS	DESCRIPTION
`dict[str, int]`	True if numbers are equal, False otherwise.

Source code in juddges/evals/metrics.py

def evaluate_number(
    predicted: Any,
    gold: Any,
    atol: float = 1e-6,
) -> dict[str, int]:
    """
    Compares two numbers for an exact match.

    Args:
        predicted: The predicted number.
        gold: The ground truth number.

    Returns:
        True if numbers are equal, False otherwise.
    """
    if predicted == gold:
        return {"match": 1}
    else:
        try:
            predicted_num = float(predicted)
            gold_num = float(gold)
            return {"match": int(abs(predicted_num - gold_num) <= atol)}
        except (ValueError, TypeError):
            return {"match": 0}

evaluate_string_rouge ¶

evaluate_string_rouge(predicted: str | None, gold: str | None) -> dict[str, float]

Calculates ROUGE scores for two strings using TorchMetrics.

PARAMETER	DESCRIPTION
`predicted`	The predicted string. TYPE: `str \| None`
`gold`	The ground truth string. TYPE: `str \| None`

RETURNS	DESCRIPTION
`dict[str, float]`	A dictionary with ROUGE scores, or None if inputs are invalid.

Source code in juddges/evals/metrics.py

def evaluate_string_rouge(
    predicted: str | None,
    gold: str | None,
) -> dict[str, float]:
    """
    Calculates ROUGE scores for two strings using TorchMetrics.

    Args:
        predicted: The predicted string.
        gold: The ground truth string.

    Returns:
        A dictionary with ROUGE scores, or None if inputs are invalid.
    """
    if predicted == gold:
        return {"rouge1": 1.0, "rouge2": 1.0, "rougeL": 1.0}
    elif predicted is None or gold is None:
        return {"rouge1": 0.0, "rouge2": 0.0, "rougeL": 0.0}

    rouge = ROUGEScore()
    scores = rouge([predicted], [gold])

    return {
        "rouge1": float(scores["rouge1_fmeasure"].item()),
        "rouge2": float(scores["rouge2_fmeasure"].item()),
        "rougeL": float(scores["rougeL_fmeasure"].item()),
    }

evaluate_enum ¶

evaluate_enum(predicted: str | None, gold: str | None, choices: list[str]) -> dict[str, int]

Evaluates enum classification with hallucination detection.

PARAMETER	DESCRIPTION
`predicted`	The predicted enum value. TYPE: `str \| None`
`gold`	The ground truth enum value. TYPE: `str \| None`
`choices`	List of valid enum choices. TYPE: `list[str]`

RETURNS	DESCRIPTION
`dict[str, int]`	Dictionary with classification metrics and hallucination info.

Source code in juddges/evals/metrics.py

def evaluate_enum(
    predicted: str | None,
    gold: str | None,
    choices: list[str],
) -> dict[str, int]:
    """
    Evaluates enum classification with hallucination detection.

    Args:
        predicted: The predicted enum value.
        gold: The ground truth enum value.
        choices: List of valid enum choices.

    Returns:
        Dictionary with classification metrics and hallucination info.
    """
    return {
        "match": int(predicted == gold),
        "predicted_in_choices": int(predicted in choices),
    }

evaluate_list_greedy ¶

evaluate_list_greedy(predicted: list | None, gold: list | None) -> dict[str, int | float]

Evaluates list matching using a greedy approach.

todo: hungarian matching should be used instead of greedy matching in the final version

PARAMETER	DESCRIPTION
`predicted`	The predicted list. TYPE: `list \| None`
`gold`	The ground truth list. TYPE: `list \| None`

RETURNS	DESCRIPTION
`dict[str, int \| float]`	A dictionary with counts for true positives, false positives,
`dict[str, int \| float]`	false negatives, and precision, recall, F1-score.

Source code in juddges/evals/metrics.py

def evaluate_list_greedy(predicted: list | None, gold: list | None) -> dict[str, int | float]:
    """
    Evaluates list matching using a greedy approach.

    todo: hungarian matching should be used instead of greedy matching in the final version

    Args:
        predicted: The predicted list.
        gold: The ground truth list.

    Returns:
        A dictionary with counts for true positives, false positives,
        false negatives, and precision, recall, F1-score.
    """
    if predicted is None:
        predicted = []
    if gold is None:
        gold = []

    gold_counts = Counter(gold)
    pred_counts = Counter(predicted)

    true_positives = 0
    for item, count in pred_counts.items():
        true_positives += min(count, gold_counts.get(item, 0))

    false_positives = len(predicted) - true_positives
    false_negatives = len(gold) - true_positives

    precision = (
        true_positives / (true_positives + false_positives)
        if (true_positives + false_positives) > 0
        else 0
    )
    recall = (
        true_positives / (true_positives + false_negatives)
        if (true_positives + false_negatives) > 0
        else 0
    )
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    return {
        "true_positives": true_positives,
        "false_positives": false_positives,
        "false_negatives": false_negatives,
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }

Metric Selection Guide¶

When to Use Each Metric¶

Data Type	Metric	Use Case
Date	`evaluate_date`	Dates, timestamps
Number	`evaluate_number`	Amounts, counts, IDs
String	`evaluate_string_rouge`	Names, titles, long text
Enum	`evaluate_enum`	Categories, classifications
List	`evaluate_list_greedy`	Arrays, multi-value fields

Choosing ROUGE vs Exact Match¶

Use ROUGE when:

Text can be paraphrased
Word order doesn't matter
Partial credit is meaningful

Use Exact Match when:

Precision is critical (legal basis references)
Format is standardized (case numbers)
No variation expected

Advanced Usage¶

Custom Tolerance for Numbers¶

# Financial amounts (2 decimal places)
result = evaluate_number(
    predicted=1000.01,
    gold=1000.00,
    atol=0.01
)

# Approximate counts
result = evaluate_number(
    predicted=103,
    gold=100,
    atol=5
)

Aggregating Metrics¶

from typing import Dict, List

def aggregate_metrics(results: List[Dict]) -> Dict:
    """Aggregate evaluation results across multiple examples."""
    total = len(results)

    return {
        "accuracy": sum(r["match"] for r in results) / total,
        "total_examples": total,
        "correct": sum(r["match"] for r in results)
    }

# Evaluate multiple extractions
results = [
    evaluate_date(pred, gold)
    for pred, gold in zip(predictions, ground_truth)
]

# Aggregate
summary = aggregate_metrics(results)
print(summary)
# {
#     "accuracy": 0.85,
#     "total_examples": 100,
#     "correct": 85
# }

Multi-Field Evaluation¶

def evaluate_extraction(predicted: Dict, gold: Dict, schema: Dict) -> Dict:
    """Evaluate all fields in an extraction."""
    results = {}

    for field, config in schema.items():
        pred_val = predicted.get(field)
        gold_val = gold.get(field)

        if config["type"] == "date":
            results[field] = evaluate_date(pred_val, gold_val)
        elif config["type"] == "enum":
            results[field] = evaluate_enum(
                pred_val, gold_val, config["choices"]
            )
        elif config["type"] == "string":
            results[field] = evaluate_string_rouge(pred_val, gold_val)
        # ... handle other types

    return results

# Define schema
schema = {
    "verdict_date": {"type": "date"},
    "judgment_type": {
        "type": "enum",
        "choices": ["Wyrok", "Postanowienie", "Uchwała"]
    },
    "court_name": {"type": "string"}
}

# Evaluate
results = evaluate_extraction(predicted, gold, schema)

Performance Considerations¶

ROUGE Computation¶

ROUGE metrics use TorchMetrics for efficient computation:

# For large batches, ROUGE is vectorized
results = [
    evaluate_string_rouge(pred, gold)
    for pred, gold in zip(predictions, ground_truth)
]
# TorchMetrics handles batching internally

Date Parsing¶

Date parsing uses dateutil.parser which is flexible but can be slow:

# For performance-critical code, pre-parse dates
from dateutil import parser

parsed_preds = [parser.parse(d) for d in date_strings]
parsed_golds = [parser.parse(d) for d in gold_strings]

results = [
    evaluate_date(str(p.date()), str(g.date()))
    for p, g in zip(parsed_preds, parsed_golds)
]

Extraction Evaluation - End-to-end evaluation pipeline
LLM as Judge - LLM-based evaluation
Gemini Chain - Extraction pipeline
How-To: Evaluation - Evaluation guide

Common Patterns¶

Production Evaluation Pipeline¶

import pandas as pd
from juddges.evals.metrics import (
    evaluate_date,
    evaluate_string_rouge,
    evaluate_enum,
    evaluate_list_greedy
)

def evaluate_dataset(predictions_df: pd.DataFrame, schema: Dict) -> pd.DataFrame:
    """Evaluate entire dataset against schema."""
    results = []

    for idx, row in predictions_df.iterrows():
        pred = row["predicted"]
        gold = row["ground_truth"]

        # Evaluate each field
        eval_result = {"id": row["id"]}

        for field, config in schema.items():
            pred_val = pred.get(field)
            gold_val = gold.get(field)

            if config["type"] == "date":
                metrics = evaluate_date(pred_val, gold_val)
                eval_result[f"{field}_match"] = metrics["match"]

            elif config["type"] == "string":
                metrics = evaluate_string_rouge(pred_val, gold_val)
                eval_result[f"{field}_rouge1"] = metrics["rouge1"]
                eval_result[f"{field}_rouge2"] = metrics["rouge2"]
                eval_result[f"{field}_rougeL"] = metrics["rougeL"]

            # ... handle other types

        results.append(eval_result)

    return pd.DataFrame(results)

Evaluation Metrics¶

Overview¶

Key Features¶

Usage Examples¶

Date Evaluation¶

Number Evaluation¶

String (ROUGE) Evaluation¶

Enum Evaluation¶

List Evaluation¶

API Reference¶

evaluate_date ¶

evaluate_number ¶

evaluate_string_rouge ¶

evaluate_enum ¶

evaluate_list_greedy ¶

Metric Selection Guide¶

When to Use Each Metric¶

Choosing ROUGE vs Exact Match¶

Advanced Usage¶

Custom Tolerance for Numbers¶

Aggregating Metrics¶

Multi-Field Evaluation¶

Performance Considerations¶

ROUGE Computation¶

Date Parsing¶

Related¶

Common Patterns¶

Production Evaluation Pipeline¶