Dataset Card for JuDDGES/pl-court-instruct

import datasets
import transformers
import warnings
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import pandas as pd
import polars as pl
import seaborn as sns
import yaml
from datasets import load_dataset
from transformers import AutoTokenizer
from IPython.display import display


warnings.filterwarnings('ignore')
sns.set_theme("notebook")
transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()
datasets.utils.disable_progress_bars()

ds = load_dataset("JuDDGES/pl-court-instruct")

display(ds["train"][0])


 
</details>

### Data Fields


| Feature name     | Feature description                                                                                                                       | Type       |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------|------------|
| _id              | Unique identifier of the judgement                                                                                                        | `string`   |
| prompt           | The prompt template provided for extracting information from the judgement. It contains placeholder `{context}` for the judgement content. | `string`   |
| context          | The full text content of the judgement                                                                                                    | `string`   |
| output           | The extracted information in YAML format based on the provided context                                                                    | `string`   |


### Data Splits

::: {#ee96bab3205ad17a .cell}
``` {.python .cell-code}
data = []
for split in ds.keys():
    data.append({"split": split, "# samples": len(ds[split])})

df = pd.DataFrame(data)
df["% samples"] = (df["# samples"] / df["# samples"].sum() * 100).round(2)
# print(df.to_markdown(index=False))

:::

split	# samples	% samples
train	238851	99.17
test	2000	0.83

Dataset Creation

For details on the dataset creation, see the paper TBA and the code repository here.

Curation Rationale

Created to enable cross-jurisdictional legal analytics.

Source Data

Initial Data Collection and Normalization

Utilize the raw dataset JuDDGES/pl-court-raw.
First, we identified information from metadata which is contained in text of the judgement. Therefore, the following fields were selected for extraction as targets:
- date
- judges
- recorder
- signature
- court_name
- department_name
- legal_bases
Data filtering: In order to ensure high quality of the dataset, we performed filtering procedure, as described below.
1. Removal of judgements with missing values in targets - if any of the target field has missing value, entire judgement is discarded (information might still be contained in judgement text, and in such case the targets would be incorrect)
2. Cleaning judges field - in some examples, names of judges were concatenated into single name instead of being list of names, so we split them based on conjunction
3. Removing examples wherein targets are not in text - due to inherent errors in acquired data, some targets might be mistyped, hence we filter them out (Data cleaning removes 173297 examples, and dataset consists of 240851.)

Generating instructions: After cleaning we generate instructions for information extraction. Specifically, we define same prompt for each document, as follows:

You are extracting information from the Polish court judgments.
Extract specified values strictly from the provided judgement. If information is not provided in the judgement, leave the field with null value.
Please return the response in the identical YAML format:
'''yaml
court_name: "<nazwa sądu, string containing the full name of the court>"
date: <data, date in format YYYY-MM-DD>
department_name: "<nazwa wydziału, string containing the full name of the court's department>"
judges: "<sędziowie, list of judge full names>"
legal_bases: <podstawy prawne, list of strings containing legal bases (legal regulations)>
recorder: <protokolant, string containing the name of the recorder>
signature: <sygnatura, string contraining the signature of the judgment>
'''
=====
{context}
======

where {context} is replaced by text of each judgement. We highlight that judgements are in Polish, hence to foster model responding in Polish, we provide name Polish names of the field in the prompt.

Who are the source language producers?

Produced by human legal professionals (judges, court clerks). Demographics was not analysed. Sourced from public court databases.

Annotations

Annotation process

No annotation was performed by us. All features were provided via API.

Who are the annotators?

As above.

Personal and Sensitive Information

Pseudoanonymized to comply with GDPR (art. 4 sec. 5 GDPR).

Considerations for Using the Data

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Statistics

data = yaml.safe_load(ds["train"]["output"][0].replace("```yaml", "").replace("```", ""))
data["date"] = pd.to_datetime(data["date"])

def parse_output(output: str) -> dict:
    data = yaml.safe_load(output.replace("```yaml", "").replace("```", ""))
    data["date"] = pd.to_datetime(data["date"])
    return data

ds = ds.map(parse_output, input_columns="output", num_proc=20)

pl_ds = pl.concat([pl.from_arrow(ds["train"].data.table), pl.from_arrow(ds["test"].data.table)])
pl_ds = pl_ds.with_columns(pl.Series(name="subset", values=["train"] * len(ds["train"]) + ["test"] * len(ds["test"])))

court_distribution = pl_ds.select(["subset", "court_name"]).group_by(["subset", "court_name"]).len().sort("len", descending=True).to_pandas()
ax = sns.histplot(data=court_distribution, x="len", hue="subset", log_scale=True, kde=True, stat="percent", common_norm=False )
ax.set(title="Distribution of judgments per court", xlabel="#Judgements in single court", ylabel="percent")
plt.show()

judgements_per_year = pl_ds.select(["subset", "date"])[["subset", "date"]]
judgements_per_year = judgements_per_year.with_columns(judgements_per_year["date"].dt.year()) 
judgements_per_year = judgements_per_year.group_by(["subset", "date"]).len().sort("date")
judgements_per_year = judgements_per_year.to_pandas()
judgements_per_year["%"] = judgements_per_year.groupby("subset")["len"].transform(lambda x: x / x.sum() * 100) 

_, ax = plt.subplots(1, 1, figsize=(10, 5))
ax = sns.pointplot(data=judgements_per_year, x="date", y="%", hue="subset", linestyles="--", ax=ax)
ax.set(xlabel="Year", ylabel="% Judgements", title="Yearly Number of Judgements", yscale="log")
plt.xticks(rotation=90)
plt.show()

num_judges = pl_ds.with_columns([pl.col("judges").list.len().alias("num_judges")]).select(["subset", "num_judges"]).to_pandas()
ax = sns.histplot(data=num_judges, x="num_judges", hue="subset", bins=num_judges["num_judges"].nunique(), stat="percent", common_norm=False)
ax.set(xlabel="#Judges per judgement", ylabel="%", title="#Judges per single judgement")
plt.show()

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

def tokenize(batch: dict[str, list]) -> list[int]: 
    tokenized = tokenizer(batch["context"], add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False, return_length=True)
    return {"length": tokenized["length"]}

ds = ds.map(tokenize, batched=True, batch_size=16, remove_columns=["context"], num_proc=20)

context_len_train = ds["train"].to_pandas()
context_len_train["subset"] = "train"
context_len_test = ds["test"].to_pandas()
context_len_test["subset"] = "test"
context_len = pd.concat([context_len_train, context_len_test])

ax = sns.histplot(data=context_len, x="length", bins=50, hue="subset")
ax.set(xlabel="#Tokens", ylabel="Count", title="#Tokens distribution in context (llama-3 tokenizer)", yscale="log")
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: f'{int(x/1_000)}k'))
plt.show()

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Statistics