UMAP 2D Visualization Coordinates¶
This document describes how to calculate and update UMAP 2D coordinates for documents and chunks stored in Weaviate for visualization purposes.
Overview¶
The UMAP (Uniform Manifold Approximation and Projection) algorithm reduces high-dimensional embeddings to 2D coordinates suitable for visualization. This allows you to:
- Visualize document clusters and relationships
- Explore semantic similarity in 2D space
- Identify patterns and groupings in legal documents
- Create interactive visualizations for exploration
Prerequisites¶
- Install dependencies:
- Generated embeddings in parquet format:
- Run the embedding pipeline to create
chunk_embeddingsandagg_embeddingsdirectories -
These should be in a path like
data/embeddings/{dataset-name}/ -
Weaviate running (for updating coordinates):
Two-Step Workflow (Recommended)¶
The recommended approach is to calculate and save coordinates first, then update Weaviate separately. This allows you to:
- Review coordinates before updating
- Reuse coordinates for multiple updates
- Separate computation from database updates
Step 1: Calculate and Save Coordinates¶
Calculate UMAP coordinates from embeddings and save to parquet files:
# Calculate for both collections and save
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--embeddings-dir data/embeddings/pl-court-raw-sample \
--output-dir umap_coords \
--skip-update
# This creates:
# - umap_coords/LegalDocuments_coords.parquet
# - umap_coords/DocumentChunks_coords.parquet
Each parquet file contains three columns:
uuid: The deterministic UUID for the document/chunkx: The x coordinate from UMAPy: The y coordinate from UMAP
Step 2: Update Weaviate¶
Load the saved coordinates and update Weaviate:
# Update LegalDocuments collection
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--load-from umap_coords/LegalDocuments_coords.parquet \
--collection LegalDocuments
# Update DocumentChunks collection
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--load-from umap_coords/DocumentChunks_coords.parquet \
--collection DocumentChunks
One-Step Workflow (Alternative)¶
Calculate and update in a single step:
# Calculate and update both collections
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--embeddings-dir data/embeddings/pl-court-raw-sample
# Calculate and update specific collection
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--embeddings-dir data/embeddings/pl-court-raw-sample \
--collection LegalDocuments
Additional Options¶
Adjust UMAP Parameters¶
# More local structure (smaller neighborhoods)
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--embeddings-dir data/embeddings/pl-court-raw-sample \
--output-dir umap_coords \
--skip-update \
--n-neighbors 5 --min-dist 0.01
# More global structure (larger neighborhoods)
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--embeddings-dir data/embeddings/pl-court-raw-sample \
--output-dir umap_coords \
--skip-update \
--n-neighbors 50 --min-dist 0.5
Dry Run Mode¶
Preview changes without updating Weaviate:
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--load-from umap_coords/LegalDocuments_coords.parquet \
--collection LegalDocuments \
--dry-run
Custom Batch Size¶
# Larger batches for faster processing (if memory allows)
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--load-from umap_coords/LegalDocuments_coords.parquet \
--collection LegalDocuments \
--batch-size 1000
Command Line Arguments¶
| Argument | Type | Default | Description |
|---|---|---|---|
--embeddings-dir |
str | - | Path to directory containing embedding parquet files (required for calculation mode) |
--load-from |
str | - | Load previously calculated coordinates from parquet file (alternative to --embeddings-dir) |
--output-dir |
str | - | Directory to save calculated coordinates (creates {collection}_coords.parquet files) |
--collection |
str | both |
Collection to process: LegalDocuments, DocumentChunks, or both |
--skip-update |
flag | False | Only calculate and save coordinates, don't update Weaviate |
--batch-size |
int | 500 | Number of documents to update per batch |
--n-neighbors |
int | 15 | UMAP n_neighbors parameter (controls local vs global structure) |
--min-dist |
float | 0.1 | UMAP min_dist parameter (controls point spacing) |
--dry-run |
flag | False | Preview changes without updating Weaviate |
How It Works¶
Mode 1: Calculate from Embeddings¶
The script performs these phases:
- Load Embeddings from Parquet Files
- Loads embeddings from
chunk_embeddings/oragg_embeddings/subdirectories - Generates deterministic UUIDs based on document IDs (and chunk IDs for chunks)
-
Extracts embedding vectors from the parquet files
-
Normalize Vectors
- All vectors are L2-normalized before UMAP computation
- Ensures consistent cosine distance interpretation
-
Better UMAP clustering results
-
Compute UMAP Coordinates
from umap import UMAP
umap_model = UMAP(
n_neighbors=15, # Controls local vs global structure
min_dist=0.1, # Controls point spacing
metric='cosine', # Appropriate for normalized embeddings
random_state=42, # For reproducibility
n_components=2 # 2D output
)
coordinates = umap_model.fit_transform(normalized_embeddings)
- Save Coordinates (Optional)
- If
--output-diris specified, saves to{collection}_coords.parquet -
Format:
uuid,x,ycolumns -
Update Weaviate (Optional)
- If
--skip-updateis not specified, updates documents in Weaviate - Only updates the
xandyproperties
Mode 2: Load from Saved Coordinates¶
- Load Coordinates
- Reads parquet file with
uuid,x,ycolumns -
Validates file exists and contains required columns
-
Update Weaviate
- Updates documents with loaded coordinates
- Uses deterministic UUIDs to match documents
UMAP Parameter Tuning¶
n_neighbors¶
Controls the balance between local and global structure:
- Small values (5-10): Focus on local structure, tighter clusters
- Medium values (15-30): Balanced view (recommended)
- Large values (50+): Focus on global structure, broader patterns
min_dist¶
Controls how tightly UMAP packs points:
- Small values (0.01-0.05): Tight clusters, minimal spacing
- Medium values (0.1-0.3): Balanced spacing (recommended)
- Large values (0.5+): Looser clusters, more spread out
metric¶
We use cosine distance because:
- Embeddings represent semantic meaning
- After L2 normalization, cosine distance is most appropriate
- Works well for high-dimensional text embeddings
Weaviate Schema¶
The script updates these properties in both collections:
# In LegalDocuments and DocumentChunks collections
wvcc.Property(
name="x",
data_type=wvcc.DataType.NUMBER,
description="X coordinate for visualization",
skip_vectorization=True,
),
wvcc.Property(
name="y",
data_type=wvcc.DataType.NUMBER,
description="Y coordinate for visualization",
skip_vectorization=True,
),
Example Workflow¶
Complete workflow from embedding generation to visualization:
# 1. Generate embeddings (using DVC or embedding scripts)
dvc repro embed
# 2. Ingest documents to Weaviate (if not already done)
docker compose run --rm web python scripts/embed/simple_ingest.py \
configs/datasets/JuDDGES_pl-court-raw-sample.yaml
# 3. Calculate UMAP coordinates and save
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--embeddings-dir data/embeddings/pl-court-raw-sample \
--output-dir umap_coords \
--skip-update
# 4. Review saved coordinates (optional)
python -c "
import polars as pl
df = pl.read_parquet('umap_coords/LegalDocuments_coords.parquet')
print(df.head())
print(f'X range: [{df[\"x\"].min():.2f}, {df[\"x\"].max():.2f}]')
print(f'Y range: [{df[\"y\"].min():.2f}, {df[\"y\"].max():.2f}]')
"
# 5. Update Weaviate with saved coordinates
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--load-from umap_coords/LegalDocuments_coords.parquet \
--collection LegalDocuments
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--load-from umap_coords/DocumentChunks_coords.parquet \
--collection DocumentChunks
# 6. Query documents with coordinates
docker compose run --rm web python -c "
from juddges.data.documents_weaviate_db import WeaviateLegalDocumentsDatabase
with WeaviateLegalDocumentsDatabase() as db:
collection = db.legal_documents_collection
# Fetch first 10 documents with coordinates
response = collection.query.fetch_objects(
limit=10,
return_properties=['document_id', 'title', 'x', 'y']
)
for obj in response.objects:
props = obj.properties
print(f'{props[\"title\"][:50]}: ({props[\"x\"]:.2f}, {props[\"y\"]:.2f})')
"
Performance Considerations¶
Memory Usage¶
- Loading embeddings loads all vectors into memory
- For large collections (>100k documents), ensure sufficient RAM
- Embedding dimension × number of documents = memory required
- Example: 768 dims × 100,000 docs × 4 bytes = ~300MB
Processing Time¶
Approximate times (on standard CPU):
| Documents | Load Embeddings | UMAP | Save/Update | Total |
|---|---|---|---|---|
| 10,000 | ~30 sec | ~2 min | ~30 sec | ~3 min |
| 50,000 | ~2 min | ~10 min | ~2 min | ~14 min |
| 100,000 | ~4 min | ~30 min | ~4 min | ~38 min |
Optimization Tips¶
- Use two-step workflow: Calculate once, update multiple times if needed
- Process one collection at a time: Use
--collectionto avoid memory issues - Save coordinates: Use
--output-dirto preserve calculations - Adjust batch size: Default (500) is optimized for most cases
- Run in Docker: Ensures consistent environment and resource allocation
Troubleshooting¶
No embeddings extracted¶
Problem: Script reports "No embeddings extracted"
Solutions:
- Verify embeddings directory exists and contains parquet files
- Check that
chunk_embeddings/oragg_embeddings/subdirectories exist - Ensure embedding generation was successful (check DVC pipeline or embedding scripts)
- Verify parquet files contain
embeddingorembeddingsfield
Memory errors¶
Problem: Out of memory during UMAP computation
Solutions:
- Process collections separately:
--collection LegalDocuments - Use
--skip-updateto save coordinates without immediate Weaviate update - Increase Docker memory limit
- Consider processing in smaller batches (requires script modification)
Update failures¶
Problem: Some documents fail to update
Solutions:
- Check Weaviate logs for errors
- Verify UUIDs in saved coordinates match documents in Weaviate
- Ensure sufficient Weaviate resources
- Check network connectivity
- Use
--dry-runto preview before actual update
Coordinates file not found¶
Problem: Cannot load coordinates from saved file
Solutions:
- Verify the file path is correct
- Ensure Step 1 (calculation) completed successfully
- Check that the parquet file contains
uuid,x,ycolumns - Verify file permissions
Inconsistent coordinates¶
Problem: Running calculation multiple times gives different coordinates
Solutions:
- This is expected - UMAP has random initialization
- The script uses
random_state=42for reproducibility within a single run - Coordinates are relative, not absolute
- Save coordinates after calculation to reuse the same projection
Visualization Examples¶
After calculating coordinates, you can visualize them using various tools:
Python with Matplotlib¶
import matplotlib.pyplot as plt
from juddges.data.documents_weaviate_db import WeaviateLegalDocumentsDatabase
with WeaviateLegalDocumentsDatabase() as db:
collection = db.legal_documents_collection
response = collection.query.fetch_objects(
limit=1000,
return_properties=['document_type', 'x', 'y']
)
# Extract coordinates and types
points = [(obj.properties['x'], obj.properties['y'], obj.properties['document_type'])
for obj in response.objects if obj.properties.get('x') is not None]
x = [p[0] for p in points]
y = [p[1] for p in points]
types = [p[2] for p in points]
# Plot
plt.figure(figsize=(12, 8))
plt.scatter(x, y, c=pd.Categorical(types).codes, alpha=0.5, cmap='tab10')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.title('Legal Documents - UMAP Projection')
plt.colorbar(label='Document Type')
plt.savefig('umap_visualization.png', dpi=300, bbox_inches='tight')
Interactive with Plotly¶
import plotly.express as px
import pandas as pd
from juddges.data.documents_weaviate_db import WeaviateLegalDocumentsDatabase
with WeaviateLegalDocumentsDatabase() as db:
collection = db.legal_documents_collection
response = collection.query.fetch_objects(
limit=5000,
return_properties=['document_id', 'title', 'document_type', 'x', 'y']
)
# Convert to DataFrame
data = []
for obj in response.objects:
props = obj.properties
if props.get('x') is not None:
data.append({
'x': props['x'],
'y': props['y'],
'type': props['document_type'],
'title': props['title'][:100],
'id': props['document_id']
})
df = pd.DataFrame(data)
# Create interactive plot
fig = px.scatter(
df, x='x', y='y', color='type',
hover_data=['title', 'id'],
title='Legal Documents UMAP Visualization'
)
fig.update_traces(marker=dict(size=5, opacity=0.6))
fig.write_html('umap_interactive.html')
Analyze Saved Coordinates¶
import polars as pl
# Load saved coordinates
df = pl.read_parquet('umap_coords/LegalDocuments_coords.parquet')
# Basic statistics
print(f"Total coordinates: {len(df)}")
print(f"X range: [{df['x'].min():.2f}, {df['x'].max():.2f}]")
print(f"Y range: [{df['y'].min():.2f}, {df['y'].max():.2f}]")
# Preview
print(df.head())
Advanced Topics¶
Incremental Updates¶
For new documents, you have several options:
- Recalculate all coordinates: Ensures consistent projection but requires full recalculation
- Calculate separately: Create separate UMAP for new batch (coordinates won't align with existing)
- Save UMAP model: Use a saved UMAP model to transform new documents (requires script modification)
Custom Coordinate Scaling¶
Normalize coordinates to specific range:
import polars as pl
from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Load coordinates
df = pl.read_parquet('umap_coords/LegalDocuments_coords.parquet')
# Scale to 0-100 range
coords = np.column_stack([df['x'].to_numpy(), df['y'].to_numpy()])
scaler = MinMaxScaler(feature_range=(0, 100))
scaled_coords = scaler.fit_transform(coords)
# Save scaled coordinates
df_scaled = pl.DataFrame({
'uuid': df['uuid'],
'x': scaled_coords[:, 0],
'y': scaled_coords[:, 1]
})
df_scaled.write_parquet('umap_coords/LegalDocuments_coords_scaled.parquet')
Multiple UMAP Projections¶
Calculate different projections for different use cases:
# Local structure (for detailed exploration)
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--embeddings-dir data/embeddings/pl-court-raw-sample \
--output-dir umap_coords/local \
--skip-update \
--n-neighbors 5 --min-dist 0.01
# Global structure (for overview)
docker compose run --rm web python scripts/embed/calculate_umap_coords.py \
--embeddings-dir data/embeddings/pl-court-raw-sample \
--output-dir umap_coords/global \
--skip-update \
--n-neighbors 50 --min-dist 0.5