Dataset preparation¶
Scripts for dataset preparation are located in dataset directory, and should be run from the root
of the repository.
1. Building the dataset¶
Dataset was downloaded from open API of Polish Court Judgements.
The following procedure will download data and store it in MongoDB. Whenever script interacts with outside environment (storing data in mongodb or pushing files to huggingface-hub) it is run outside dvc.
Prior to downloading, make sure you have proper environment variable set in .env file:
Raw dataset¶
-
Download judgements metadata - this will store metadata in the database:
-
Download judgements text (XML content of judgements) - this will alter the database with content:
-
Download additional details available for each judgement - this will alter the database with acquired details:
-
Map id of courts and departments to court name:
Remark: File with mapping available at data/datasets/pl/court_id_2_name.csv was prepared based
on data published on: https://orzeczenia.wroclaw.sa.gov.pl/indices
-
Extract raw text from XML content and details of judgments not available through API:
-
For further processing prepare local dataset dump in
parquetfile, version it with dvc and push to remote storage: -
Generate dataset card for
pl-court-raw -
Upload
pl-court-rawdataset (with card) to huggingfaceshell PYTHONPATH=. python scripts/dataset/push_raw_dataset.py --repo-id "JuDDGES/pl-court-raw" --commit-message <commit_message>
Instruction dataset¶
-
Generate intruction dataset and upload it to huggingface (
pl-court-instruct) -
Generate dataset card for
pl-court-instruct -
Upload
pl-court-instructdataset card to huggingface
Graph dataset¶
-
Embed judgments with pre-trained language model (documents are chunked and embeddings are computed per chunk)
-
Aggregate embeddings of chunks into embeddings of document
-
Eventually ingest data to
mongodb(e.g. for vector search) -
Generate graph dataset
-
Generate dataset card and upload it to huggingface (remember to be logged in to
huggingfaceor setHUGGING_FACE_HUB_TOKENenv variable)