Weaviate Vector Database Integration¶
Overview¶
Weaviate serves as the semantic search engine and vector database for JuDDGES, enabling efficient similarity search, context retrieval, and document management. This document details the integration architecture, schema design, and operational workflows.
Integration Architecture¶
graph TB
subgraph "Data Sources"
Docs[("Legal Documents<br/>Parquet Files")]
Embeddings[("Generated Embeddings<br/>768-dim vectors")]
end
subgraph "Weaviate Infrastructure"
subgraph "Docker Container"
WeaviateCore["Weaviate Core<br/>v1.24+"]
GRPC["gRPC Server<br/>Port 8085"]
REST["REST API<br/>Port 8080"]
GraphQL["GraphQL API<br/>Port 8080"]
end
subgraph "Storage Backend"
VectorIndex["Vector Indices<br/>HNSW Algorithm"]
ObjectStore["Object Storage<br/>Document Metadata"]
InvertedIndex["Inverted Index<br/>Keyword Search"]
end
subgraph "Modules"
Text2Vec["text2vec-transformers<br/>Embedding Module"]
QnA["qna-transformers<br/>Question Answering"]
Reranker["reranker-transformers<br/>Result Reranking"]
end
end
subgraph "Client Applications"
Ingestion["Ingestion Script<br/>ingest_to_weaviate.py"]
Inference["Inference Pipeline<br/>Context Retrieval"]
Analytics["Analytics<br/>Document Analysis"]
end
Docs --> Ingestion
Embeddings --> Ingestion
Ingestion --> REST
REST --> ObjectStore
REST --> VectorIndex
Inference --> GraphQL
GraphQL --> VectorIndex
VectorIndex --> Reranker
Reranker --> Inference
Analytics --> GraphQL
GraphQL --> InvertedIndex
style WeaviateCore fill:#e8f5e9
style VectorIndex fill:#e3f2fd
style ObjectStore fill:#f3e5f5
Schema Design¶
Collection: legal_documents¶
classDiagram
class LegalDocument {
+UUID document_id
+String signature
+String court_name
+Date judgment_date
+String document_text
+String[] judges
+String[] keywords
+String legal_basis
+String decision_type
+Float[] embedding_vector
+String metadata_json
+Timestamp created_at
+Timestamp updated_at
}
class DocumentMetadata {
+String source_file
+String language
+Integer word_count
+String document_type
+String jurisdiction
+Boolean is_published
}
LegalDocument "1" --> "1" DocumentMetadata : contains
style LegalDocument fill:#e3f2fd
style DocumentMetadata fill:#f3e5f5
Collection: document_chunks¶
classDiagram
class DocumentChunk {
+UUID chunk_id
+UUID document_id
+String chunk_text
+Integer chunk_index
+Integer start_char
+Integer end_char
+Float[] embedding_vector
+String chunk_type
+Float relevance_score
}
class ChunkMetadata {
+String section_title
+String paragraph_type
+Integer token_count
+String[] entities
+String[] citations
}
DocumentChunk "1" --> "1" ChunkMetadata : contains
DocumentChunk "n" --> "1" LegalDocument : belongs_to
style DocumentChunk fill:#e8f5e9
style ChunkMetadata fill:#fff3e0
Data Ingestion Pipeline¶
flowchart TD
subgraph "Ingestion Process"
Start["Start Ingestion"]
Load["Load Parquet Files<br/>• Document batches<br/>• Metadata"]
Validate["Validate Schema<br/>• Required fields<br/>• Data types"]
GenUUID["Generate UUIDs<br/>• Deterministic<br/>• Based on content hash"]
Batch["Create Batches<br/>• Size: 100 objects<br/>• Memory efficient"]
subgraph "Weaviate Operations"
Check{"Object Exists?"}
Update["Update Object<br/>• Merge metadata<br/>• Update embedding"]
Create["Create Object<br/>• New document<br/>• Full metadata"]
end
Commit["Commit Batch<br/>• Atomic operation<br/>• Error handling"]
More{"More Batches?"}
Complete["Ingestion Complete<br/>• Log statistics"]
end
Start --> Load
Load --> Validate
Validate --> GenUUID
GenUUID --> Batch
Batch --> Check
Check -->|Yes| Update
Check -->|No| Create
Update --> Commit
Create --> Commit
Commit --> More
More -->|Yes| Batch
More -->|No| Complete
style Start fill:#e8f5e9
style Complete fill:#ffebee
style Check fill:#fff3e0
Query Architecture¶
sequenceDiagram
participant Client
participant API
participant Weaviate
participant VectorIndex
participant Reranker
Client->>API: Query Request
API->>API: Parse query & extract intent
alt Semantic Search
API->>Weaviate: Vector search query
Weaviate->>VectorIndex: Find similar vectors
VectorIndex-->>Weaviate: Top-k results
Weaviate->>Reranker: Rerank results
Reranker-->>Weaviate: Reranked results
else Hybrid Search
API->>Weaviate: Hybrid query (vector + keyword)
Weaviate->>VectorIndex: Vector search
Weaviate->>Weaviate: Keyword search
Weaviate->>Weaviate: Merge & score results
else Aggregate Query
API->>Weaviate: Aggregation query
Weaviate->>Weaviate: Calculate aggregates
end
Weaviate-->>API: Query results
API->>API: Post-process results
API-->>Client: Formatted response
Query Types and Patterns¶
1. Semantic Search¶
graph LR
Query["User Query:<br/>'contracts breach damages'"]
Embed["Generate Query<br/>Embedding"]
Search["Vector Similarity<br/>Search"]
Filter["Apply Filters:<br/>• Date range<br/>• Court type<br/>• Jurisdiction"]
Rank["Rank by:<br/>• Cosine similarity<br/>• Relevance score"]
Results["Return Top-K<br/>Documents"]
Query --> Embed
Embed --> Search
Search --> Filter
Filter --> Rank
Rank --> Results
style Query fill:#e3f2fd
style Results fill:#e8f5e9
2. Hybrid Search¶
graph TB
subgraph "Hybrid Search Components"
Input["Search Query"]
subgraph "Vector Search Path"
VEmbed["Query Embedding"]
VSearch["Vector Search<br/>α = 0.7"]
end
subgraph "Keyword Search Path"
KTokenize["Tokenization"]
KSearch["BM25 Search<br/>α = 0.3"]
end
Fusion["Score Fusion<br/>weighted combination"]
Output["Ranked Results"]
end
Input --> VEmbed
Input --> KTokenize
VEmbed --> VSearch
KTokenize --> KSearch
VSearch --> Fusion
KSearch --> Fusion
Fusion --> Output
style Input fill:#e3f2fd
style Output fill:#e8f5e9
style Fusion fill:#fff3e0
3. Context Retrieval for RAG¶
flowchart LR
subgraph "RAG Context Pipeline"
Question["User Question"]
subgraph "Retrieval"
ChunkSearch["Search Chunks<br/>Top-20"]
DocSearch["Search Documents<br/>Top-5"]
Combine["Combine Results"]
end
subgraph "Processing"
Dedup["Deduplicate"]
Sort["Sort by Relevance"]
Truncate["Fit Context Window"]
end
Context["Final Context<br/>for LLM"]
end
Question --> ChunkSearch
Question --> DocSearch
ChunkSearch --> Combine
DocSearch --> Combine
Combine --> Dedup
Dedup --> Sort
Sort --> Truncate
Truncate --> Context
style Question fill:#e3f2fd
style Context fill:#e8f5e9
Performance Optimization¶
graph TD
subgraph "Optimization Strategies"
subgraph "Indexing"
HNSW["HNSW Index<br/>• ef_construction: 128<br/>• max_connections: 16"]
Sharding["Data Sharding<br/>• By date<br/>• By court"]
end
subgraph "Caching"
QueryCache["Query Cache<br/>• TTL: 5 minutes<br/>• LRU eviction"]
EmbedCache["Embedding Cache<br/>• Pre-computed<br/>• Frequently accessed"]
end
subgraph "Batch Operations"
BatchIngest["Batch Ingestion<br/>• 100 objects/batch<br/>• Parallel processing"]
BatchQuery["Batch Queries<br/>• Multi-get<br/>• Aggregations"]
end
subgraph "Resource Management"
ConnPool["Connection Pooling<br/>• Max: 100<br/>• Timeout: 30s"]
RateLimit["Rate Limiting<br/>• 1000 req/min<br/>• Backoff strategy"]
end
end
style HNSW fill:#e3f2fd
style QueryCache fill:#e8f5e9
style BatchIngest fill:#fff3e0
style ConnPool fill:#f3e5f5
Docker Deployment¶
graph TB
subgraph "Docker Compose Stack"
Network["Docker Network<br/>weaviate_default"]
subgraph "Weaviate Service"
Container["weaviate:latest<br/>Container"]
Volumes["Volumes<br/>• /var/lib/weaviate<br/>• ./data:/data"]
Ports["Ports<br/>• 8080:8080<br/>• 8085:50051"]
Env["Environment<br/>• ENABLE_MODULES<br/>• PERSISTENCE_DATA_PATH"]
end
subgraph "Dependencies"
T2V["text2vec-transformers<br/>Container"]
QnAModule["qna-transformers<br/>Container"]
end
end
Network --> Container
Container --> Volumes
Container --> Ports
Container --> Env
Container --> T2V
Container --> QnAModule
style Network fill:#e8f5e9
style Container fill:#e3f2fd
Monitoring and Metrics¶
graph LR
subgraph "Metrics Collection"
Weaviate --> Prometheus["Prometheus<br/>Metrics Endpoint"]
Prometheus --> Grafana["Grafana<br/>Dashboards"]
subgraph "Key Metrics"
M1["Query Latency"]
M2["Index Size"]
M3["Memory Usage"]
M4["Query Rate"]
M5["Error Rate"]
end
Grafana --> M1
Grafana --> M2
Grafana --> M3
Grafana --> M4
Grafana --> M5
end
style Prometheus fill:#fff3e0
style Grafana fill:#e8f5e9
Error Handling¶
stateDiagram-v2
[*] --> Healthy
Healthy --> ConnectionError: Connection lost
ConnectionError --> Retry: Automatic
Retry --> Healthy: Success
Retry --> Retry: Failed < 3
Retry --> Error: Failed >= 3
Healthy --> RateLimit: Too many requests
RateLimit --> Backoff: Exponential
Backoff --> Healthy: After delay
Healthy --> SchemaError: Invalid data
SchemaError --> Validation: Check schema
Validation --> Transform: Fix data
Transform --> Healthy: Retry
Error --> Manual: Alert sent
Manual --> Healthy: Fixed
Error --> [*]
Best Practices¶
- UUID Generation: Use deterministic UUIDs based on document content
- Batch Size: Optimal batch size is 100 objects for ingestion
- Connection Pooling: Reuse connections for better performance
- Error Recovery: Implement exponential backoff for transient errors
- Schema Evolution: Version schemas and migrate carefully
- Monitoring: Track query latency and index performance
- Backup: Regular backups of Weaviate data directory
API Examples¶
Creating a Collection¶
# Schema definition for legal_documents
schema = {
"class": "LegalDocument",
"properties": [
{"name": "signature", "dataType": ["text"]},
{"name": "court_name", "dataType": ["text"]},
{"name": "judgment_date", "dataType": ["date"]},
{"name": "document_text", "dataType": ["text"]},
{"name": "judges", "dataType": ["text[]"]},
{"name": "keywords", "dataType": ["text[]"]}
],
"vectorizer": "none" # Using pre-computed embeddings
}
Querying Documents¶
# GraphQL query for semantic search
{
Get {
LegalDocument(
nearVector: {
vector: [0.1, 0.2, ...]
certainty: 0.8
}
limit: 10
) {
signature
court_name
judgment_date
_additional {
certainty
distance
}
}
}
}