Dataset preparation¶

Scripts for dataset preparation are located in dataset directory, and should be run from the root of the repository.

1. Building the dataset¶

Dataset was downloaded from open API of Polish Court Judgements. The following procedure will download data and store it in MongoDB. Whenever script interacts with outside environment (storing data in mongodb or pushing files to huggingface-hub) it is run outside dvc. Prior to downloading, make sure you have proper environment variable set in .env file:

MONGO_URI=<mongo_uri_including_password>
MONGO_DB_NAME="datasets"

Raw dataset¶

Download judgements metadata - this will store metadata in the database:

PYTHONPATH=. python scripts/dataset/download_pl_metadata.py \
    --last-update-from <YYYY-MM-DD>

Download judgements text (XML content of judgements) - this will alter the database with content:

PYTHONPATH=. python scripts/dataset/download_pl_additional_data.py \
    --data-type content \
    --n-jobs 10 \
    --last-update-from <YYYY-MM-DD>

Download additional details available for each judgement - this will alter the database with acquired details:

PYTHONPATH=. python scripts/dataset/download_pl_additional_data.py \
    --data-type details \
    --n-jobs 10 \
    --last-update-from <YYYY-MM-DD>

Map id of courts and departments to court name:

PYTHONPATH=. python scripts/dataset/map_court_dep_id_2_name.py \
    --n-jobs 10 \
    --last-update-from <YYYY-MM-DD>

Remark: File with mapping available at data/datasets/pl/court_id_2_name.csv was prepared based on data published on: https://orzeczenia.wroclaw.sa.gov.pl/indices

Extract raw text from XML content and details of judgments not available through API:

PYTHONPATH=. python scripts/dataset/extract_pl_xml.py \
    --n-jobs 10 \
    --last-update-from <YYYY-MM-DD>

For further processing prepare local dataset dump in parquet file, version it with dvc and push to remote storage:

PYTHONPATH=.  python scripts/dataset/dump_pl_dataset.py \
    --file-name data/datasets/pl/raw/raw.parquet \
    --filter-empty-content
dvc add data/datasets/pl/raw && dvc push

Generate dataset card for pl-court-raw

dvc repro raw_dataset_readme && dvc push

Upload pl-court-raw dataset (with card) to huggingface

shell PYTHONPATH=. python scripts/dataset/push_raw_dataset.py --repo-id "JuDDGES/pl-court-raw" --commit-message <commit_message>

Instruction dataset¶

Generate intruction dataset and upload it to huggingface (pl-court-instruct)
```
NUM_JOBS=8 dvc repro build_instruct_dataset
```

Generate dataset card for pl-court-instruct

dvc repro instruct_dataset_readme && dvc push

Upload pl-court-instruct dataset card to huggingface

PYTHONPATH=. scripts/dataset/push_instruct_readme.py --repo-id JuDDGES/pl-court-instruct

Graph dataset¶

Embed judgments with pre-trained language model (documents are chunked and embeddings are computed per chunk)
```
CUDA_VISIBLE_DEVICES=<device_number> dvc repro embed
```
Aggregate embeddings of chunks into embeddings of document
```
NUM_PROC=4 dvc repro embed aggregate_embeddings
```

Eventually ingest data to mongodb (e.g. for vector search)

PYTHONPATH=. python scripts/embed/ingest.py --embeddings-file <embeddings>

Generate graph dataset
```
dvc repro embed build_graph_dataset
```

Generate dataset card and upload it to huggingface (remember to be logged in to huggingface or set HUGGING_FACE_HUB_TOKEN env variable)

PYTHONPATH=. python scripts/dataset/upload_graph_dataset.py \
    --root-dir <dir_to_dataset> \
    --repo-id JuDDGES/pl-court-graph \
    --commit-message <message>