import datasets
import transformers
import warnings
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import pandas as pd
import polars as pl
import seaborn as sns
import yaml
from datasets import load_dataset
from transformers import AutoTokenizer
from IPython.display import display
'ignore')
warnings.filterwarnings("notebook")
sns.set_theme(
transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error() datasets.utils.disable_progress_bars()
Dataset Card for JuDDGES/pl-court-instruct
= load_dataset("JuDDGES/pl-court-instruct") ds
"train"][0]) display(ds[
</details>
### Data Fields
| Feature name | Feature description | Type |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------|------------|
| _id | Unique identifier of the judgement | `string` |
| prompt | The prompt template provided for extracting information from the judgement. It contains placeholder `{context}` for the judgement content. | `string` |
| context | The full text content of the judgement | `string` |
| output | The extracted information in YAML format based on the provided context | `string` |
### Data Splits
::: {#ee96bab3205ad17a .cell}
``` {.python .cell-code}
data = []
for split in ds.keys():
data.append({"split": split, "# samples": len(ds[split])})
df = pd.DataFrame(data)
df["% samples"] = (df["# samples"] / df["# samples"].sum() * 100).round(2)
# print(df.to_markdown(index=False))
:::
split | # samples | % samples |
---|---|---|
train | 238851 | 99.17 |
test | 2000 | 0.83 |
Dataset Creation
For details on the dataset creation, see the paper TBA and the code repository here.
Curation Rationale
Created to enable cross-jurisdictional legal analytics.
Source Data
Initial Data Collection and Normalization
Utilize the raw dataset
JuDDGES/pl-court-raw
.First, we identified information from metadata which is contained in text of the judgement. Therefore, the following fields were selected for extraction as targets:
date
judges
recorder
signature
court_name
department_name
legal_bases
Data filtering: In order to ensure high quality of the dataset, we performed filtering procedure, as described below.
- Removal of judgements with missing values in targets - if any of the target field has missing value, entire judgement is discarded (information might still be contained in judgement text, and in such case the targets would be incorrect)
- Cleaning
judges
field - in some examples, names of judges were concatenated into single name instead of being list of names, so we split them based on conjunction - Removing examples wherein targets are not in text - due to inherent errors in acquired data, some targets might be mistyped, hence we filter them out (Data cleaning removes 173297 examples, and dataset consists of 240851.)
Generating instructions: After cleaning we generate instructions for information extraction. Specifically, we define same prompt for each document, as follows:
You are extracting information from the Polish court judgments. Extract specified values strictly from the provided judgement. If information is not provided in the judgement, leave the field with null value. Please return the response in the identical YAML format: '''yaml court_name: "<nazwa sądu, string containing the full name of the court>" date: <data, date in format YYYY-MM-DD> department_name: "<nazwa wydziału, string containing the full name of the court's department>" judges: "<sędziowie, list of judge full names>" legal_bases: <podstawy prawne, list of strings containing legal bases (legal regulations)> recorder: <protokolant, string containing the name of the recorder> signature: <sygnatura, string contraining the signature of the judgment> ''' ===== {context} ======
where
{context}
is replaced by text of each judgement. We highlight that judgements are in Polish, hence to foster model responding in Polish, we provide name Polish names of the field in the prompt.
Who are the source language producers?
Produced by human legal professionals (judges, court clerks). Demographics was not analysed. Sourced from public court databases.
Annotations
Annotation process
No annotation was performed by us. All features were provided via API.
Who are the annotators?
As above.
Personal and Sensitive Information
Pseudoanonymized to comply with GDPR (art. 4 sec. 5 GDPR).
Considerations for Using the Data
Discussion of Biases
[More Information Needed]
Other Known Limitations
[More Information Needed]
Additional Information
Dataset Curators
[More Information Needed]
Licensing Information
[More Information Needed]
Citation Information
[More Information Needed]
Statistics
= yaml.safe_load(ds["train"]["output"][0].replace("```yaml", "").replace("```", ""))
data "date"] = pd.to_datetime(data["date"]) data[
def parse_output(output: str) -> dict:
= yaml.safe_load(output.replace("```yaml", "").replace("```", ""))
data "date"] = pd.to_datetime(data["date"])
data[return data
= ds.map(parse_output, input_columns="output", num_proc=20) ds
= pl.concat([pl.from_arrow(ds["train"].data.table), pl.from_arrow(ds["test"].data.table)])
pl_ds = pl_ds.with_columns(pl.Series(name="subset", values=["train"] * len(ds["train"]) + ["test"] * len(ds["test"]))) pl_ds
= pl_ds.select(["subset", "court_name"]).group_by(["subset", "court_name"]).len().sort("len", descending=True).to_pandas()
court_distribution = sns.histplot(data=court_distribution, x="len", hue="subset", log_scale=True, kde=True, stat="percent", common_norm=False )
ax set(title="Distribution of judgments per court", xlabel="#Judgements in single court", ylabel="percent")
ax. plt.show()
= pl_ds.select(["subset", "date"])[["subset", "date"]]
judgements_per_year = judgements_per_year.with_columns(judgements_per_year["date"].dt.year())
judgements_per_year = judgements_per_year.group_by(["subset", "date"]).len().sort("date")
judgements_per_year = judgements_per_year.to_pandas()
judgements_per_year "%"] = judgements_per_year.groupby("subset")["len"].transform(lambda x: x / x.sum() * 100)
judgements_per_year[
= plt.subplots(1, 1, figsize=(10, 5))
_, ax = sns.pointplot(data=judgements_per_year, x="date", y="%", hue="subset", linestyles="--", ax=ax)
ax set(xlabel="Year", ylabel="% Judgements", title="Yearly Number of Judgements", yscale="log")
ax.=90)
plt.xticks(rotation plt.show()
= pl_ds.with_columns([pl.col("judges").list.len().alias("num_judges")]).select(["subset", "num_judges"]).to_pandas()
num_judges = sns.histplot(data=num_judges, x="num_judges", hue="subset", bins=num_judges["num_judges"].nunique(), stat="percent", common_norm=False)
ax set(xlabel="#Judges per judgement", ylabel="%", title="#Judges per single judgement")
ax. plt.show()
= AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer
def tokenize(batch: dict[str, list]) -> list[int]:
= tokenizer(batch["context"], add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False, return_length=True)
tokenized return {"length": tokenized["length"]}
= ds.map(tokenize, batched=True, batch_size=16, remove_columns=["context"], num_proc=20) ds
= ds["train"].to_pandas()
context_len_train "subset"] = "train"
context_len_train[= ds["test"].to_pandas()
context_len_test "subset"] = "test"
context_len_test[= pd.concat([context_len_train, context_len_test])
context_len
= sns.histplot(data=context_len, x="length", bins=50, hue="subset")
ax set(xlabel="#Tokens", ylabel="Count", title="#Tokens distribution in context (llama-3 tokenizer)", yscale="log")
ax.lambda x, pos: f'{int(x/1_000)}k'))
ax.xaxis.set_major_formatter(ticker.FuncFormatter( plt.show()
Social Impact of Dataset
[More Information Needed]