Weaviate Schema Update Guide¶
Based on research from Weaviate documentation and community forums (2024).
Summary of Capabilities¶
| Operation | Possible? | Method |
|---|---|---|
| Add new properties | ✅ Yes | collection.config.add_property() |
| Modify existing property types | ❌ No | Must migrate to new collection |
| Delete properties | ❌ No | Must migrate to new collection |
| Change vectorizer | ❌ No | Must migrate to new collection |
1. Adding New Properties to Existing Collections¶
✅ This IS Supported¶
You can add new properties to existing collections after creation.
Python Code Example (v4 Client)¶
import weaviate
import weaviate.classes.config as wvcc
# Connect to Weaviate
client = weaviate.connect_to_local()
# Get the collection
collection = client.collections.get("LegalDocuments")
# Add new property
collection.config.add_property(
wvcc.Property(
name="extracted_factual_state",
data_type=wvcc.DataType.TEXT,
description="Extracted factual circumstances and background of the case",
skip_vectorization=False, # Enable for semantic search
vectorize_property_name=False,
index_searchable=True
)
)
# Add another property
collection.config.add_property(
wvcc.Property(
name="extracted_legal_state",
data_type=wvcc.DataType.TEXT,
description="Extracted legal basis and applicable law",
skip_vectorization=False,
vectorize_property_name=False,
index_searchable=True
)
)
client.close()
⚠️ Important Limitations¶
Inverted Index Issue: When you add a property to a collection that already contains data, there are limitations in inverted-index related behavior:
- Filtering by the new property's length may not work correctly
- Filtering by the new property's null status may not work correctly
Why? The inverted index is built at import time. If you add a property after importing objects, the inverted index metadata (length, null status) won't be retroactively updated for existing objects.
Recommended Approaches¶
Option 1: Add Properties Before Data Import (Best)
# 1. Add all properties to schema first
collection.config.add_property(...)
# 2. Then import data
collection.data.insert_many(objects)
Option 2: Accept Limitations (Quick)
# Add property to existing data
# Accept that inverted index filtering may be limited
collection.config.add_property(...)
Option 3: Re-import Data (Most Complete)
# 1. Export all existing data
objects = collection.query.fetch_objects(limit=10000)
# 2. Delete collection
client.collections.delete("LegalDocuments")
# 3. Recreate with new properties
# (use your create_collections() method)
# 4. Re-import all data
collection.data.insert_many(objects)
Option 4: Wait for Re-indexing API (Future)
- Weaviate is working on a re-indexing API
- Will allow retroactive index updates
- Not available yet (as of 2024)
2. Modifying Existing Property Types¶
❌ This is NOT Supported¶
You CANNOT:
- Change a property's data type (e.g.,
TEXT→TEXT_ARRAY) - Delete properties from a collection
- Modify property configurations after creation
Official Statement: "You cannot modify existing properties after you create the collection."
Workaround: Migration to New Collection¶
The only way to change property types is to create a new collection and migrate data.
Migration Process¶
import weaviate
import weaviate.classes.config as wvcc
from loguru import logger
from rich.progress import Progress
# Connect to Weaviate
client = weaviate.connect_to_local()
# Step 1: Export data from old collection
logger.info("Exporting data from LegalDocuments...")
old_collection = client.collections.get("LegalDocuments")
all_objects = []
cursor = None
# Paginate through all objects
while True:
response = old_collection.query.fetch_objects(
limit=1000,
after=cursor
)
all_objects.extend(response.objects)
if len(response.objects) < 1000:
break
cursor = response.objects[-1].uuid
logger.info(f"Exported {len(all_objects)} objects")
# Step 2: Delete old collection
logger.info("Deleting old LegalDocuments collection...")
client.collections.delete("LegalDocuments")
# Step 3: Create new collection with updated schema
logger.info("Creating new LegalDocuments collection with updated schema...")
client.collections.create(
name="LegalDocuments",
description="Collection of legal documents with updated schema",
properties=[
# Updated property with new type
wvcc.Property(
name="legal_references",
data_type=wvcc.DataType.TEXT_ARRAY, # Changed from TEXT to TEXT_ARRAY
description="References to legal acts and regulations",
skip_vectorization=True
),
# ... add all other properties
],
vectorizer_config=[
# ... vectorizer config
]
)
# Step 4: Transform and re-import data
new_collection = client.collections.get("LegalDocuments")
logger.info("Migrating data to new collection...")
with Progress() as progress:
task = progress.add_task("Migrating...", total=len(all_objects))
batch_size = 100
for i in range(0, len(all_objects), batch_size):
batch = all_objects[i:i+batch_size]
# Transform objects to match new schema
transformed = []
for obj in batch:
props = obj.properties.copy()
# Transform property types as needed
# Example: Convert JSON string to array
if "legal_references" in props and isinstance(props["legal_references"], str):
import json
try:
props["legal_references"] = json.loads(props["legal_references"])
except:
props["legal_references"] = []
transformed.append(props)
# Insert batch
new_collection.data.insert_many(transformed)
progress.update(task, advance=len(batch))
logger.info("Migration complete!")
client.close()
3. Our Specific Use Case¶
What We Need to Do¶
Based on our schema comparison, we have two options:
Option A: Add Missing Properties Only (Recommended)¶
Add these 2 new properties:
extracted_factual_state(TEXT)extracted_legal_state(TEXT)
Pros:
- Simple, fast
- No data migration needed
- No downtime
Cons:
- Type mismatches remain (JSON strings for lists)
- Less optimal for querying
Code:
from juddges.data.documents_weaviate_db import WeaviateLegalDocumentsDatabase
import weaviate.classes.config as wvcc
with WeaviateLegalDocumentsDatabase() as db:
collection = db.legal_documents_collection
# Add factual_state property
collection.config.add_property(
wvcc.Property(
name="extracted_factual_state",
data_type=wvcc.DataType.TEXT,
description="Extracted factual circumstances and background of the case",
skip_vectorization=False,
vectorize_property_name=False,
index_searchable=True
)
)
# Add legal_state property
collection.config.add_property(
wvcc.Property(
name="extracted_legal_state",
data_type=wvcc.DataType.TEXT,
description="Extracted legal basis and applicable law",
skip_vectorization=False,
vectorize_property_name=False,
index_searchable=True
)
)
Option B: Full Schema Migration (Production-Ready)¶
Fix all type mismatches:
legal_references: TEXT (JSON) → TEXT_ARRAYlegal_concepts: TEXT (JSON) → TEXT_ARRAYparties: TEXT (JSON) → TEXT_ARRAYoutcome: TEXT (JSON) → TEXTlegal_analysis: TEXT (JSON) → TEXT- Add
extracted_factual_stateandextracted_legal_state
Pros:
- Proper types for optimal querying
- Native array support for list fields
- Better semantic search on text fields
- Production-ready schema
Cons:
- Requires data migration
- More complex
- Temporary downtime or dual-collection period
Required: Full migration script (see Section 2 above)
4. Recommendations¶
For Our JuDDGES Project¶
Phase 1: Quick Implementation (This Week)
- ✅ Add 2 missing properties (
extracted_factual_state,extracted_legal_state) - ✅ Implement ingestion with JSON serialization for existing properties
- ✅ Test with IP Box data (43 documents)
- ✅ Validate extraction pipeline works end-to-end
Phase 2: Production Migration (When Ready)
- ⏸️ Plan data migration window
- ⏸️ Create migration script with proper type conversions
- ⏸️ Test migration on staging/copy of data
- ⏸️ Execute production migration
- ⏸️ Validate all data migrated correctly
- ⏸️ Update ingestion code to use new types
5. Script: Add Missing Properties¶
I'll create this as a standalone script:
#!/usr/bin/env python3
"""
Add extracted_factual_state and extracted_legal_state properties to LegalDocuments collection.
"""
import os
import sys
from dotenv import load_dotenv
from rich.console import Console
from loguru import logger
from juddges.data.documents_weaviate_db import WeaviateLegalDocumentsDatabase
from juddges.settings import ROOT_PATH
import weaviate.classes.config as wvcc
load_dotenv(ROOT_PATH / ".env", override=True)
def add_extraction_properties():
"""Add new properties for extraction fields."""
console = Console()
console.print("[bold blue]Adding Extraction Properties to Weaviate Schema[/bold blue]\n")
try:
with WeaviateLegalDocumentsDatabase() as db:
console.print("✅ Connected to Weaviate")
collection = db.legal_documents_collection
# Check existing properties
existing_props = db.legal_documents_properties
console.print(f"\n[cyan]Current properties ({len(existing_props)}):[/cyan]")
console.print(", ".join(existing_props[:10]) + "...")
# Add extracted_factual_state
if "extracted_factual_state" not in existing_props:
console.print("\n[yellow]Adding property: extracted_factual_state[/yellow]")
collection.config.add_property(
wvcc.Property(
name="extracted_factual_state",
data_type=wvcc.DataType.TEXT,
description="Extracted factual circumstances and background of the case",
skip_vectorization=False,
vectorize_property_name=False,
index_searchable=True
)
)
console.print("[green]✓ Added extracted_factual_state[/green]")
else:
console.print("[dim]• extracted_factual_state already exists[/dim]")
# Add extracted_legal_state
if "extracted_legal_state" not in existing_props:
console.print("\n[yellow]Adding property: extracted_legal_state[/yellow]")
collection.config.add_property(
wvcc.Property(
name="extracted_legal_state",
data_type=wvcc.DataType.TEXT,
description="Extracted legal basis and applicable law",
skip_vectorization=False,
vectorize_property_name=False,
index_searchable=True
)
)
console.print("[green]✓ Added extracted_legal_state[/green]")
else:
console.print("[dim]• extracted_legal_state already exists[/dim]")
# Verify
console.print("\n[cyan]Verifying changes...[/cyan]")
updated_props = db.legal_documents_properties
console.print(f"[green]✓ Collection now has {len(updated_props)} properties[/green]")
if "extracted_factual_state" in updated_props:
console.print("[green]✓ extracted_factual_state confirmed[/green]")
if "extracted_legal_state" in updated_props:
console.print("[green]✓ extracted_legal_state confirmed[/green]")
console.print("\n[bold green]✅ Schema update complete![/bold green]")
except Exception as e:
console.print(f"\n[red]❌ Error: {e}[/red]")
logger.exception("Schema update failed")
sys.exit(1)
if __name__ == "__main__":
add_extraction_properties()
6. Reference: Property Configuration Options¶
When adding properties, you can configure:
wvcc.Property(
name="property_name", # Required: Property name
data_type=wvcc.DataType.TEXT, # Required: TEXT, TEXT_ARRAY, NUMBER, BOOL, etc.
description="Description", # Optional: Human-readable description
skip_vectorization=False, # Skip this property in vectorization?
vectorize_property_name=False, # Include property name in vector?
index_searchable=True, # Enable inverted index for filtering?
index_filterable=True, # (Alias for index_searchable)
tokenization=wvcc.Tokenization.WORD, # Word-level or field-level tokens
)
Common Data Types¶
wvcc.DataType.TEXT # Single text string
wvcc.DataType.TEXT_ARRAY # Array of text strings
wvcc.DataType.NUMBER # Integer or float
wvcc.DataType.INT # Integer only
wvcc.DataType.BOOL # Boolean
wvcc.DataType.DATE # ISO 8601 datetime
wvcc.DataType.GEO_COORDINATES # Geographic coordinates
wvcc.DataType.PHONE_NUMBER # Phone number
wvcc.DataType.BLOB # Binary data
wvcc.DataType.UUID # UUID reference to another object
7. Conclusion¶
For our use case:
- Immediate action: Add 2 missing properties using
collection.config.add_property() - Future consideration: Full schema migration to fix type mismatches
- Current limitation: Cannot change existing property types without migration
The good news: We can start ingesting extracted data immediately by adding just 2 properties!