Embed pl-court-raw dataset and ingest to Weaviate¶
This document describes the process of generating text embeddings from raw documents in JuDDGES/pl-court-raw dataset and ingesting them into a Weaviate database.
Overview¶
The embedding workflow consists of three main steps:
- Embedding Generation: Chunk raw text, convert chunks into vector embeddings, aggregate embeddings for each judgment
- Embedding Ingestion: Upload embeddings to a Weaviate database
Prerequisites¶
- Deploy Weaviate instance (see embeddings_deploy_weaviate.md)
- Setup environment variables in
.envfile
WV_HOST=localhost # Weaviate host
WV_PORT=8080 # Weaviate port
WV_GRPC_PORT=8085 # Weaviate gRPC port
WV_API_KEY=<your-key> # Weaviate API key (if applicable)
Step 1: Embed documents¶
The embedding generation script (scripts/embed/embed_text.py) converts raw text documents into vector embeddings using a Sentence Transformer model using dataset from huggingface hub.
Running Embedding Generation¶
- Full configuration of embedding generation is defined in file
configs/embedding.yaml. - To run the embedding simply run command with proper dataset and embedding model with following command.
- It overrides hydra config, so for embedding model use names of configs present in
configs/embedding_model, and for dataset simply use name from huggingface hub. - The output will be two dirs with chunk and aggregated embeddings:
data/embeddings/<dataset_name>/<embedding_model>/chunk_embeddingsdata/embeddings/<dataset_name>/<embedding_model>/agg_embeddings- The script can work with multiple GPUs at once (by default it uses all available GPUs, so specify them with
CUDA_VISIBLE_DEVICES).
CUDA_VISIBLE_DEVICES=0 NUM_PROC=10 PYTHONPATH="$PWD:$PYTHONPATH" python scripts/embed/embed_text.py \
embedding_model=mmlw-roberta-large \
dataset_name=JuDDGES/pl-court-raw \
output_dir=data/embeddings/pl-court-raw/mmlw-roberta-large
Embedding generation can be run with DVC, by running command specified below, which will run embedding for dataset (unless already present) and embedding model specified in dvc.yaml:
JuDDGES/pl-court-rawJuDDGES/en-court-raw
Step 2: Ingest embeddings to Weaviate¶
- To upload the embeddings created in the previous step to a Weaviate database, one needs to run the following command with parameters similar to the previous one.
- The upload will be done in two steps:
- Upload chunks with their embeddings
- Upload judgments with their aggregated embeddings (full dataset with aggregated embeddings will be ingested)
PROCESSING_PROC=10 INGEST_PROC=5 PYTHONPATH="$PWD:$PYTHONPATH" python scripts/embed/ingest_to_weaviate.py \
embedding_model=mmlw-roberta-large \
dataset_name=JuDDGES/pl-court-raw \
output_dir=data/embeddings/pl-court-raw/mmlw-roberta-large \
[+ingest_batch_size=64] \
[+upsert=true]
Step 3: Test Weaviate Ingestion¶
Test the ingestion by running the following command, which will print all collections and their schemas, and run sample queries.
Notes¶
- The embedding model used locally should match the one configured in the Weaviate database
- For large datasets, consider adjusting the batch size and number of processors
- The chunking process can be customized through the configuration to suit your specific document characteristics
- The code were adjusted to be memory efficient (uses hf datasets and polars lazyframe)