Weaviate Schema Update Guide¶

Based on research from Weaviate documentation and community forums (2024).

Summary of Capabilities¶

Operation	Possible?	Method
Add new properties	✅ Yes	`collection.config.add_property()`
Modify existing property types	❌ No	Must migrate to new collection
Delete properties	❌ No	Must migrate to new collection
Change vectorizer	❌ No	Must migrate to new collection

1. Adding New Properties to Existing Collections¶

✅ This IS Supported¶

You can add new properties to existing collections after creation.

Python Code Example (v4 Client)¶

import weaviate
import weaviate.classes.config as wvcc

# Connect to Weaviate
client = weaviate.connect_to_local()

# Get the collection
collection = client.collections.get("LegalDocuments")

# Add new property
collection.config.add_property(
    wvcc.Property(
        name="extracted_factual_state",
        data_type=wvcc.DataType.TEXT,
        description="Extracted factual circumstances and background of the case",
        skip_vectorization=False,  # Enable for semantic search
        vectorize_property_name=False,
        index_searchable=True
    )
)

# Add another property
collection.config.add_property(
    wvcc.Property(
        name="extracted_legal_state",
        data_type=wvcc.DataType.TEXT,
        description="Extracted legal basis and applicable law",
        skip_vectorization=False,
        vectorize_property_name=False,
        index_searchable=True
    )
)

client.close()

⚠️ Important Limitations¶

Inverted Index Issue: When you add a property to a collection that already contains data, there are limitations in inverted-index related behavior:

Filtering by the new property's length may not work correctly
Filtering by the new property's null status may not work correctly

Why? The inverted index is built at import time. If you add a property after importing objects, the inverted index metadata (length, null status) won't be retroactively updated for existing objects.

Recommended Approaches¶

Option 1: Add Properties Before Data Import (Best)

# 1. Add all properties to schema first
collection.config.add_property(...)

# 2. Then import data
collection.data.insert_many(objects)

Option 2: Accept Limitations (Quick)

# Add property to existing data
# Accept that inverted index filtering may be limited
collection.config.add_property(...)

Option 3: Re-import Data (Most Complete)

# 1. Export all existing data
objects = collection.query.fetch_objects(limit=10000)

# 2. Delete collection
client.collections.delete("LegalDocuments")

# 3. Recreate with new properties
# (use your create_collections() method)

# 4. Re-import all data
collection.data.insert_many(objects)

Option 4: Wait for Re-indexing API (Future)

Weaviate is working on a re-indexing API
Will allow retroactive index updates
Not available yet (as of 2024)

2. Modifying Existing Property Types¶

❌ This is NOT Supported¶

You CANNOT:

Change a property's data type (e.g., TEXT → TEXT_ARRAY)
Delete properties from a collection
Modify property configurations after creation

Official Statement: "You cannot modify existing properties after you create the collection."

Workaround: Migration to New Collection¶

The only way to change property types is to create a new collection and migrate data.

Migration Process¶

import weaviate
import weaviate.classes.config as wvcc
from loguru import logger
from rich.progress import Progress

# Connect to Weaviate
client = weaviate.connect_to_local()

# Step 1: Export data from old collection
logger.info("Exporting data from LegalDocuments...")
old_collection = client.collections.get("LegalDocuments")

all_objects = []
cursor = None

# Paginate through all objects
while True:
    response = old_collection.query.fetch_objects(
        limit=1000,
        after=cursor
    )

    all_objects.extend(response.objects)

    if len(response.objects) < 1000:
        break

    cursor = response.objects[-1].uuid

logger.info(f"Exported {len(all_objects)} objects")

# Step 2: Delete old collection
logger.info("Deleting old LegalDocuments collection...")
client.collections.delete("LegalDocuments")

# Step 3: Create new collection with updated schema
logger.info("Creating new LegalDocuments collection with updated schema...")

client.collections.create(
    name="LegalDocuments",
    description="Collection of legal documents with updated schema",
    properties=[
        # Updated property with new type
        wvcc.Property(
            name="legal_references",
            data_type=wvcc.DataType.TEXT_ARRAY,  # Changed from TEXT to TEXT_ARRAY
            description="References to legal acts and regulations",
            skip_vectorization=True
        ),
        # ... add all other properties
    ],
    vectorizer_config=[
        # ... vectorizer config
    ]
)

# Step 4: Transform and re-import data
new_collection = client.collections.get("LegalDocuments")

logger.info("Migrating data to new collection...")

with Progress() as progress:
    task = progress.add_task("Migrating...", total=len(all_objects))

    batch_size = 100
    for i in range(0, len(all_objects), batch_size):
        batch = all_objects[i:i+batch_size]

        # Transform objects to match new schema
        transformed = []
        for obj in batch:
            props = obj.properties.copy()

            # Transform property types as needed
            # Example: Convert JSON string to array
            if "legal_references" in props and isinstance(props["legal_references"], str):
                import json
                try:
                    props["legal_references"] = json.loads(props["legal_references"])
                except:
                    props["legal_references"] = []

            transformed.append(props)

        # Insert batch
        new_collection.data.insert_many(transformed)
        progress.update(task, advance=len(batch))

logger.info("Migration complete!")
client.close()

3. Our Specific Use Case¶

What We Need to Do¶

Based on our schema comparison, we have two options:

Option A: Add Missing Properties Only (Recommended)¶

Add these 2 new properties:

extracted_factual_state (TEXT)
extracted_legal_state (TEXT)

Pros:

Simple, fast
No data migration needed
No downtime

Cons:

Type mismatches remain (JSON strings for lists)
Less optimal for querying

Code:

from juddges.data.documents_weaviate_db import WeaviateLegalDocumentsDatabase
import weaviate.classes.config as wvcc

with WeaviateLegalDocumentsDatabase() as db:
    collection = db.legal_documents_collection

    # Add factual_state property
    collection.config.add_property(
        wvcc.Property(
            name="extracted_factual_state",
            data_type=wvcc.DataType.TEXT,
            description="Extracted factual circumstances and background of the case",
            skip_vectorization=False,
            vectorize_property_name=False,
            index_searchable=True
        )
    )

    # Add legal_state property
    collection.config.add_property(
        wvcc.Property(
            name="extracted_legal_state",
            data_type=wvcc.DataType.TEXT,
            description="Extracted legal basis and applicable law",
            skip_vectorization=False,
            vectorize_property_name=False,
            index_searchable=True
        )
    )

Option B: Full Schema Migration (Production-Ready)¶

Fix all type mismatches:

legal_references: TEXT (JSON) → TEXT_ARRAY
legal_concepts: TEXT (JSON) → TEXT_ARRAY
parties: TEXT (JSON) → TEXT_ARRAY
outcome: TEXT (JSON) → TEXT
legal_analysis: TEXT (JSON) → TEXT
Add extracted_factual_state and extracted_legal_state

Pros:

Proper types for optimal querying
Native array support for list fields
Better semantic search on text fields
Production-ready schema

Cons:

Requires data migration
More complex
Temporary downtime or dual-collection period

Required: Full migration script (see Section 2 above)

4. Recommendations¶

For Our JuDDGES Project¶

Phase 1: Quick Implementation (This Week)

✅ Add 2 missing properties (extracted_factual_state, extracted_legal_state)
✅ Implement ingestion with JSON serialization for existing properties
✅ Test with IP Box data (43 documents)
✅ Validate extraction pipeline works end-to-end

Phase 2: Production Migration (When Ready)

⏸️ Plan data migration window
⏸️ Create migration script with proper type conversions
⏸️ Test migration on staging/copy of data
⏸️ Execute production migration
⏸️ Validate all data migrated correctly
⏸️ Update ingestion code to use new types

5. Script: Add Missing Properties¶

I'll create this as a standalone script:

#!/usr/bin/env python3
"""
Add extracted_factual_state and extracted_legal_state properties to LegalDocuments collection.
"""

import os
import sys
from dotenv import load_dotenv
from rich.console import Console
from loguru import logger

from juddges.data.documents_weaviate_db import WeaviateLegalDocumentsDatabase
from juddges.settings import ROOT_PATH
import weaviate.classes.config as wvcc

load_dotenv(ROOT_PATH / ".env", override=True)


def add_extraction_properties():
    """Add new properties for extraction fields."""
    console = Console()

    console.print("[bold blue]Adding Extraction Properties to Weaviate Schema[/bold blue]\n")

    try:
        with WeaviateLegalDocumentsDatabase() as db:
            console.print("✅ Connected to Weaviate")

            collection = db.legal_documents_collection

            # Check existing properties
            existing_props = db.legal_documents_properties
            console.print(f"\n[cyan]Current properties ({len(existing_props)}):[/cyan]")
            console.print(", ".join(existing_props[:10]) + "...")

            # Add extracted_factual_state
            if "extracted_factual_state" not in existing_props:
                console.print("\n[yellow]Adding property: extracted_factual_state[/yellow]")
                collection.config.add_property(
                    wvcc.Property(
                        name="extracted_factual_state",
                        data_type=wvcc.DataType.TEXT,
                        description="Extracted factual circumstances and background of the case",
                        skip_vectorization=False,
                        vectorize_property_name=False,
                        index_searchable=True
                    )
                )
                console.print("[green]✓ Added extracted_factual_state[/green]")
            else:
                console.print("[dim]• extracted_factual_state already exists[/dim]")

            # Add extracted_legal_state
            if "extracted_legal_state" not in existing_props:
                console.print("\n[yellow]Adding property: extracted_legal_state[/yellow]")
                collection.config.add_property(
                    wvcc.Property(
                        name="extracted_legal_state",
                        data_type=wvcc.DataType.TEXT,
                        description="Extracted legal basis and applicable law",
                        skip_vectorization=False,
                        vectorize_property_name=False,
                        index_searchable=True
                    )
                )
                console.print("[green]✓ Added extracted_legal_state[/green]")
            else:
                console.print("[dim]• extracted_legal_state already exists[/dim]")

            # Verify
            console.print("\n[cyan]Verifying changes...[/cyan]")
            updated_props = db.legal_documents_properties
            console.print(f"[green]✓ Collection now has {len(updated_props)} properties[/green]")

            if "extracted_factual_state" in updated_props:
                console.print("[green]✓ extracted_factual_state confirmed[/green]")
            if "extracted_legal_state" in updated_props:
                console.print("[green]✓ extracted_legal_state confirmed[/green]")

            console.print("\n[bold green]✅ Schema update complete![/bold green]")

    except Exception as e:
        console.print(f"\n[red]❌ Error: {e}[/red]")
        logger.exception("Schema update failed")
        sys.exit(1)


if __name__ == "__main__":
    add_extraction_properties()

6. Reference: Property Configuration Options¶

When adding properties, you can configure:

wvcc.Property(
    name="property_name",                    # Required: Property name
    data_type=wvcc.DataType.TEXT,           # Required: TEXT, TEXT_ARRAY, NUMBER, BOOL, etc.
    description="Description",               # Optional: Human-readable description
    skip_vectorization=False,                # Skip this property in vectorization?
    vectorize_property_name=False,           # Include property name in vector?
    index_searchable=True,                   # Enable inverted index for filtering?
    index_filterable=True,                   # (Alias for index_searchable)
    tokenization=wvcc.Tokenization.WORD,    # Word-level or field-level tokens
)

Common Data Types¶

wvcc.DataType.TEXT          # Single text string
wvcc.DataType.TEXT_ARRAY    # Array of text strings
wvcc.DataType.NUMBER        # Integer or float
wvcc.DataType.INT           # Integer only
wvcc.DataType.BOOL          # Boolean
wvcc.DataType.DATE          # ISO 8601 datetime
wvcc.DataType.GEO_COORDINATES  # Geographic coordinates
wvcc.DataType.PHONE_NUMBER  # Phone number
wvcc.DataType.BLOB          # Binary data
wvcc.DataType.UUID          # UUID reference to another object

7. Conclusion¶

For our use case:

Immediate action: Add 2 missing properties using collection.config.add_property()
Future consideration: Full schema migration to fix type mismatches
Current limitation: Cannot change existing property types without migration

The good news: We can start ingesting extracted data immediately by adding just 2 properties!