-
Notifications
You must be signed in to change notification settings - Fork 146
Description
Summary
Add an optional feature to normalize message templates by storing them in a separate lookup table and referencing them by hash in the main logs table, similar to how Seq handles event types. This would dramatically reduce database storage requirements and improve query performance for applications with high-volume logging.
Problem Statement
Currently, serilog-sinks-mssqlserver
stores the full MessageTemplate
string in every log record. This approach creates significant inefficiencies:
Storage Duplication Issue
-- Current approach - same template repeated thousands of times
INSERT INTO Logs (MessageTemplate, Message, ...) VALUES
('User {UserId} logged in from {IPAddress}', 'User 12345 logged in from 192.168.1.1', ...),
('User {UserId} logged in from {IPAddress}', 'User 67890 logged in from 10.0.0.1', ...),
('User {UserId} logged in from {IPAddress}', 'User 11111 logged in from 172.16.0.1', ...);
-- Template "User {UserId} logged in from {IPAddress}" is stored 3 times (and potentially thousands more)
Real-World Impact
- Storage waste: A typical template like
"Processing {ItemCount} items for user {UserId} in {TimeElapsed}ms"
(65 characters) repeated 10,000 times wastes 650,000 bytes (~635 KB) - With normalization: Same scenario uses 10,000 × 8 bytes (hashes) + 1 × 65 bytes (template) = 80,065 bytes (~78 KB)
- Storage savings: 88% reduction (from ~635 KB to ~78 KB) for this single template
- Index inefficiency: Large
MessageTemplate
columns hurt index performance and increase storage overhead - Query complexity: Finding all events of a specific type requires expensive string comparisons
- Backup/restore overhead: Redundant data increases backup sizes and restore times
Proposed Solution
Implement message template normalization with a two-table approach:
Database Schema Changes
New table: MessageTemplates
CREATE TABLE [MessageTemplates] (
[TemplateHash] CHAR(8) NOT NULL PRIMARY KEY, -- 32-bit hash (hex)
[Template] NVARCHAR(MAX) NOT NULL -- Original template
);
Modified main table: Logs
CREATE TABLE [Logs] (
[Id] INT IDENTITY(1,1) NOT NULL,
[TemplateHash] CHAR(8) NULL, -- Reference to MessageTemplates
[MessageTemplate] NVARCHAR(MAX) NULL, -- Legacy column (optional)
[Message] NVARCHAR(MAX) NULL,
[Level] NVARCHAR(MAX) NULL,
[TimeStamp] DATETIME NULL,
[Exception] NVARCHAR(MAX) NULL,
[Properties] NVARCHAR(MAX) NULL,
CONSTRAINT [PK_Logs] PRIMARY KEY CLUSTERED ([Id] ASC),
CONSTRAINT [FK_Logs_MessageTemplates] FOREIGN KEY ([TemplateHash])
REFERENCES [MessageTemplates]([TemplateHash])
);
Configuration Options
Add new configuration properties to enable/control the feature:
var columnOptions = new ColumnOptions();
columnOptions.MessageTemplate.UseNormalization = false; // Default: disabled for backward compatibility
columnOptions.MessageTemplate.HashAlgorithm = HashAlgorithm.Murmur3; // Hash algorithm (when enabled)
columnOptions.MessageTemplate.KeepLegacyColumn = true; // Keep original MessageTemplate column (dual-write mode)
columnOptions.MessageTemplate.AutoCreateTemplateTable = true; // Create lookup table automatically
Configuration Modes:
Mode 1: Legacy (Default) - Full Backward Compatibility
UseNormalization = false; // Feature disabled, works exactly as today
// KeepLegacyColumn not relevant
Mode 2: Dual-Write - Safe Migration
UseNormalization = true; // Feature enabled
KeepLegacyColumn = true; // Write to BOTH TemplateHash AND MessageTemplate columns
Mode 3: Hash-Only - Maximum Optimization
UseNormalization = true; // Feature enabled
KeepLegacyColumn = false; // Write ONLY to TemplateHash column
JSON configuration:
{
"columnOptionsSection": {
"messageTemplate": {
"useNormalization": false,
"hashAlgorithm": "Murmur3",
"keepLegacyColumn": true,
"autoCreateTemplateTable": true
}
}
}
Performance Considerations
SqlBulkCopy Compatibility
The implementation is designed to preserve the high-performance characteristics of serilog-sinks-mssqlserver
:
- In-memory template cache: Templates are resolved to hashes without database calls during normal operation
- Asynchronous template insertion: New templates are inserted in background without blocking the main logging pipeline
- Bulk copy preservation: The main logs table continues using SqlBulkCopy for maximum throughput
- Minimal overhead: Hash calculation and dictionary lookup add negligible latency
Cache Management
// Configuration options for cache behavior
columnOptions.MessageTemplate.CacheSize = 10000; // Max templates in memory
columnOptions.MessageTemplate.PreloadTemplates = true; // Load existing on startup
columnOptions.MessageTemplate.CacheTimeout = TimeSpan.FromHours(1); // Optional expiry
Concurrency Handling
- Uses
ConcurrentDictionary
for thread-safe template caching - Handles concurrent template insertion with
IF NOT EXISTS
- Gracefully handles primary key violations from race conditions
- No locks or synchronization points in the hot logging path
Implementation Details
Hash Algorithm
Use 32-bit Murmur3 hash (same as Seq) for consistency and proven collision resistance:
- Converts templates like
"User {UserId} logged in"
→A26D9943
- Stores as 8-character hex string for readability and SQL compatibility
- Collision probability is negligible for typical logging scenarios
Write Process
Performance-Optimized Implementation:
-
Initialization Phase
// Load existing templates into memory cache on sink startup private readonly ConcurrentDictionary<string, string> _templateCache = new(); private async Task LoadExistingTemplatesAsync() { var templates = await connection.QueryAsync<(string Hash, string Template)>( "SELECT TOP 10000 TemplateHash, Template FROM MessageTemplates"); foreach (var (hash, template) in templates) { _templateCache.TryAdd(template, hash); } }
-
Fast Template Resolution
private string GetOrCreateTemplateHash(string template) { // Fast in-memory lookup first if (_templateCache.TryGetValue(template, out string existingHash)) return existingHash; // Calculate hash for new template string newHash = ComputeHash(template); // Try to add to cache (handles concurrent access) if (_templateCache.TryAdd(template, newHash)) { // This thread "won" - responsible for DB insert _ = Task.Run(() => InsertTemplateAsync(newHash, template)); } return newHash; }
-
SQL Bulk Copy Compatibility
private async Task InsertTemplateAsync(string hash, string template) { try { // Use MERGE or IF NOT EXISTS to handle concurrent inserts await connection.ExecuteAsync(@" IF NOT EXISTS (SELECT 1 FROM MessageTemplates WHERE TemplateHash = @Hash) INSERT INTO MessageTemplates (TemplateHash, Template) VALUES (@Hash, @Template)", new { Hash = hash, Template = template }); } catch (SqlException ex) when (ex.Number == 2627) // Primary key violation { // Another process inserted this template - ignore } }
-
Batch Processing
- Templates are resolved to hashes in-memory before bulk copy
- New templates are inserted asynchronously (fire-and-forget)
- SqlBulkCopy performance is preserved for the main logs table
- No synchronous DB calls during high-throughput logging
Read Process
When legacy MessageTemplate
column is removed:
-- Query with template resolution
SELECT l.*, mt.Template as MessageTemplate
FROM Logs l
LEFT JOIN MessageTemplates mt ON l.TemplateHash = mt.TemplateHash
WHERE l.TimeStamp > '2024-01-01';
-- Query for specific event types (much faster than string comparison)
SELECT * FROM Logs WHERE TemplateHash = 'A26D9943';
Backward Compatibility Strategy
Default Behavior: Zero Breaking Changes
UseNormalization = false
by default- Existing installations continue working exactly as before
- No schema changes required unless explicitly enabled
Migration Path:
Step 1: Enable with dual-write (Recommended)
UseNormalization = true;
KeepLegacyColumn = true; // Write to both TemplateHash AND MessageTemplate
- Creates
MessageTemplates
table alongside existing schema - All log records written to both old and new columns
- Existing queries continue working unchanged
- Allows testing and validation of the new feature
Step 2: Optimize storage (Optional)
UseNormalization = true;
KeepLegacyColumn = false; // Write ONLY to TemplateHash
- Maximum storage savings
- Requires updating queries to use JOIN or accepting hash-only queries
- Can be rolled back by switching
KeepLegacyColumn = true
Migration Support
- Provide migration script to populate
MessageTemplates
from existing data - Support for analyzing existing templates and their frequency
- Clear documentation for each migration step
Benefits
Storage Optimization
- Dramatic space savings: A template used 10,000 times goes from ~635 KB to ~78 KB (88% reduction)
- Improved index performance:
CHAR(8)
indexes are much faster thanNVARCHAR(MAX)
- Reduced backup sizes: Especially significant for high-volume logging applications
Query Performance
-- Before: Expensive string comparison
SELECT * FROM Logs WHERE MessageTemplate = 'User {UserId} logged in from {IPAddress}';
-- After: Fast hash lookup
SELECT * FROM Logs WHERE TemplateHash = 'F20BA6E0';
-- Advanced analytics become practical
SELECT TemplateHash, COUNT(*) as EventCount
FROM Logs
GROUP BY TemplateHash
ORDER BY EventCount DESC;
Operational Benefits
- Template discovery: Easily identify all unique templates in your application
-- Find all templates and their usage frequency SELECT mt.Template, COUNT(l.Id) as UsageCount FROM MessageTemplates mt LEFT JOIN Logs l ON mt.TemplateHash = l.TemplateHash GROUP BY mt.Template, mt.TemplateHash ORDER BY UsageCount DESC;
- Usage analytics: Track template frequency for optimization
- Faster log analysis: Group and analyze events by type rather than parsing strings
- High-performance logging: Maintains SqlBulkCopy speed with in-memory template caching
Proven Approach
This approach is battle-tested and proven effective:
Seq Implementation
Seq (the popular Serilog log server) has used this exact approach since its inception:
- Seq does this automatically by assigning a type to Serilog events on the server-side
- Seq produces a 32-bit (Murmur) hash of the message template that can be referred to using hex literals
- Storing, and then filtering on a wordy message template isn't always ideal. Instead, it's common to create a numeric hash value from the message template, and store this with the event instead
Community Recognition
- Nicholas Blumhardt (Serilog creator) has written extensively about this pattern
- Multiple blog posts demonstrate the value of event type hashing
- Widely adopted in the structured logging community
Potential Concerns & Mitigations
Hash Collisions
- Risk: Extremely low with Murmur3 32-bit hash
- Mitigation: Monitor for collisions and provide collision detection
- Fallback: Could extend to 64-bit hash if needed in future
Performance Impact
- Concern: Additional overhead affecting SqlBulkCopy performance
- Mitigation: In-memory template caching eliminates database calls during normal operation
- Benefit: Hash lookups (8 chars) are much faster than string comparisons (potentially 100+ chars)
Memory Usage
- Concern: Template cache consuming memory
- Mitigation: Configurable cache size with reasonable defaults (10K templates ≈ 1-2MB)
- Benefit: Tiny overhead compared to storage savings achieved
Migration Complexity
- Concern: Existing installations need migration path
- Mitigation: Dual-write mode provides smooth transition
- Benefit: Opt-in feature, no forced migration
Conclusion
Message template normalization addresses a fundamental inefficiency in current database logging approaches. By following Seq's proven model, we can:
- Reduce storage requirements by 80-95% for individual templates (depends on template length and frequency)
- Improve query performance significantly
- Enable advanced log analytics that aren't practical with current approach
- Maintain full backward compatibility during transition
This feature would make serilog-sinks-mssqlserver
more suitable for production environments with high-volume logging, while providing a clear migration path for existing users.
The implementation follows established patterns, has proven benefits, and addresses real pain points experienced by users managing large log databases. I believe this would be a valuable addition to the sink that would benefit the entire Serilog community.
Would the maintainers be interested in this feature? I'm happy to contribute to the implementation if there's community interest.