Skip to content

Feature Request: Message Template Normalization for Database Storage Optimization #632

@na-pel

Description

@na-pel

Summary

Add an optional feature to normalize message templates by storing them in a separate lookup table and referencing them by hash in the main logs table, similar to how Seq handles event types. This would dramatically reduce database storage requirements and improve query performance for applications with high-volume logging.

Problem Statement

Currently, serilog-sinks-mssqlserver stores the full MessageTemplate string in every log record. This approach creates significant inefficiencies:

Storage Duplication Issue

-- Current approach - same template repeated thousands of times
INSERT INTO Logs (MessageTemplate, Message, ...) VALUES 
('User {UserId} logged in from {IPAddress}', 'User 12345 logged in from 192.168.1.1', ...),
('User {UserId} logged in from {IPAddress}', 'User 67890 logged in from 10.0.0.1', ...),
('User {UserId} logged in from {IPAddress}', 'User 11111 logged in from 172.16.0.1', ...);
-- Template "User {UserId} logged in from {IPAddress}" is stored 3 times (and potentially thousands more)

Real-World Impact

  • Storage waste: A typical template like "Processing {ItemCount} items for user {UserId} in {TimeElapsed}ms" (65 characters) repeated 10,000 times wastes 650,000 bytes (~635 KB)
  • With normalization: Same scenario uses 10,000 × 8 bytes (hashes) + 1 × 65 bytes (template) = 80,065 bytes (~78 KB)
  • Storage savings: 88% reduction (from ~635 KB to ~78 KB) for this single template
  • Index inefficiency: Large MessageTemplate columns hurt index performance and increase storage overhead
  • Query complexity: Finding all events of a specific type requires expensive string comparisons
  • Backup/restore overhead: Redundant data increases backup sizes and restore times

Proposed Solution

Implement message template normalization with a two-table approach:

Database Schema Changes

New table: MessageTemplates

CREATE TABLE [MessageTemplates] (
    [TemplateHash] CHAR(8) NOT NULL PRIMARY KEY,     -- 32-bit hash (hex)
    [Template] NVARCHAR(MAX) NOT NULL                -- Original template
);

Modified main table: Logs

CREATE TABLE [Logs] (
    [Id] INT IDENTITY(1,1) NOT NULL,
    [TemplateHash] CHAR(8) NULL,                     -- Reference to MessageTemplates
    [MessageTemplate] NVARCHAR(MAX) NULL,            -- Legacy column (optional)
    [Message] NVARCHAR(MAX) NULL,
    [Level] NVARCHAR(MAX) NULL,
    [TimeStamp] DATETIME NULL,
    [Exception] NVARCHAR(MAX) NULL,
    [Properties] NVARCHAR(MAX) NULL,
    CONSTRAINT [PK_Logs] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [FK_Logs_MessageTemplates] FOREIGN KEY ([TemplateHash]) 
        REFERENCES [MessageTemplates]([TemplateHash])
);

Configuration Options

Add new configuration properties to enable/control the feature:

var columnOptions = new ColumnOptions();
columnOptions.MessageTemplate.UseNormalization = false;          // Default: disabled for backward compatibility
columnOptions.MessageTemplate.HashAlgorithm = HashAlgorithm.Murmur3; // Hash algorithm (when enabled)
columnOptions.MessageTemplate.KeepLegacyColumn = true;           // Keep original MessageTemplate column (dual-write mode)
columnOptions.MessageTemplate.AutoCreateTemplateTable = true;    // Create lookup table automatically

Configuration Modes:

Mode 1: Legacy (Default) - Full Backward Compatibility

UseNormalization = false;  // Feature disabled, works exactly as today
// KeepLegacyColumn not relevant

Mode 2: Dual-Write - Safe Migration

UseNormalization = true;   // Feature enabled
KeepLegacyColumn = true;   // Write to BOTH TemplateHash AND MessageTemplate columns

Mode 3: Hash-Only - Maximum Optimization

UseNormalization = true;   // Feature enabled  
KeepLegacyColumn = false;  // Write ONLY to TemplateHash column

JSON configuration:

{
  "columnOptionsSection": {
    "messageTemplate": {
      "useNormalization": false,
      "hashAlgorithm": "Murmur3",
      "keepLegacyColumn": true,
      "autoCreateTemplateTable": true
    }
  }
}

Performance Considerations

SqlBulkCopy Compatibility

The implementation is designed to preserve the high-performance characteristics of serilog-sinks-mssqlserver:

  • In-memory template cache: Templates are resolved to hashes without database calls during normal operation
  • Asynchronous template insertion: New templates are inserted in background without blocking the main logging pipeline
  • Bulk copy preservation: The main logs table continues using SqlBulkCopy for maximum throughput
  • Minimal overhead: Hash calculation and dictionary lookup add negligible latency

Cache Management

// Configuration options for cache behavior
columnOptions.MessageTemplate.CacheSize = 10000;        // Max templates in memory
columnOptions.MessageTemplate.PreloadTemplates = true;  // Load existing on startup
columnOptions.MessageTemplate.CacheTimeout = TimeSpan.FromHours(1); // Optional expiry

Concurrency Handling

  • Uses ConcurrentDictionary for thread-safe template caching
  • Handles concurrent template insertion with IF NOT EXISTS
  • Gracefully handles primary key violations from race conditions
  • No locks or synchronization points in the hot logging path

Implementation Details

Hash Algorithm

Use 32-bit Murmur3 hash (same as Seq) for consistency and proven collision resistance:

  • Converts templates like "User {UserId} logged in"A26D9943
  • Stores as 8-character hex string for readability and SQL compatibility
  • Collision probability is negligible for typical logging scenarios

Write Process

Performance-Optimized Implementation:

  1. Initialization Phase

    // Load existing templates into memory cache on sink startup
    private readonly ConcurrentDictionary<string, string> _templateCache = new();
    
    private async Task LoadExistingTemplatesAsync()
    {
        var templates = await connection.QueryAsync<(string Hash, string Template)>(
            "SELECT TOP 10000 TemplateHash, Template FROM MessageTemplates");
        
        foreach (var (hash, template) in templates)
        {
            _templateCache.TryAdd(template, hash);
        }
    }
  2. Fast Template Resolution

    private string GetOrCreateTemplateHash(string template)
    {
        // Fast in-memory lookup first
        if (_templateCache.TryGetValue(template, out string existingHash))
            return existingHash;
            
        // Calculate hash for new template
        string newHash = ComputeHash(template);
        
        // Try to add to cache (handles concurrent access)
        if (_templateCache.TryAdd(template, newHash))
        {
            // This thread "won" - responsible for DB insert
            _ = Task.Run(() => InsertTemplateAsync(newHash, template));
        }
        
        return newHash;
    }
  3. SQL Bulk Copy Compatibility

    private async Task InsertTemplateAsync(string hash, string template)
    {
        try
        {
            // Use MERGE or IF NOT EXISTS to handle concurrent inserts
            await connection.ExecuteAsync(@"
                IF NOT EXISTS (SELECT 1 FROM MessageTemplates WHERE TemplateHash = @Hash)
                INSERT INTO MessageTemplates (TemplateHash, Template) 
                VALUES (@Hash, @Template)",
                new { Hash = hash, Template = template });
        }
        catch (SqlException ex) when (ex.Number == 2627) // Primary key violation
        {
            // Another process inserted this template - ignore
        }
    }
  4. Batch Processing

    • Templates are resolved to hashes in-memory before bulk copy
    • New templates are inserted asynchronously (fire-and-forget)
    • SqlBulkCopy performance is preserved for the main logs table
    • No synchronous DB calls during high-throughput logging

Read Process

When legacy MessageTemplate column is removed:

-- Query with template resolution
SELECT l.*, mt.Template as MessageTemplate 
FROM Logs l 
LEFT JOIN MessageTemplates mt ON l.TemplateHash = mt.TemplateHash 
WHERE l.TimeStamp > '2024-01-01';

-- Query for specific event types (much faster than string comparison)
SELECT * FROM Logs WHERE TemplateHash = 'A26D9943';

Backward Compatibility Strategy

Default Behavior: Zero Breaking Changes

  • UseNormalization = false by default
  • Existing installations continue working exactly as before
  • No schema changes required unless explicitly enabled

Migration Path:

Step 1: Enable with dual-write (Recommended)

UseNormalization = true;
KeepLegacyColumn = true;  // Write to both TemplateHash AND MessageTemplate
  • Creates MessageTemplates table alongside existing schema
  • All log records written to both old and new columns
  • Existing queries continue working unchanged
  • Allows testing and validation of the new feature

Step 2: Optimize storage (Optional)

UseNormalization = true;
KeepLegacyColumn = false;  // Write ONLY to TemplateHash
  • Maximum storage savings
  • Requires updating queries to use JOIN or accepting hash-only queries
  • Can be rolled back by switching KeepLegacyColumn = true

Migration Support

  • Provide migration script to populate MessageTemplates from existing data
  • Support for analyzing existing templates and their frequency
  • Clear documentation for each migration step

Benefits

Storage Optimization

  • Dramatic space savings: A template used 10,000 times goes from ~635 KB to ~78 KB (88% reduction)
  • Improved index performance: CHAR(8) indexes are much faster than NVARCHAR(MAX)
  • Reduced backup sizes: Especially significant for high-volume logging applications

Query Performance

-- Before: Expensive string comparison
SELECT * FROM Logs WHERE MessageTemplate = 'User {UserId} logged in from {IPAddress}';

-- After: Fast hash lookup  
SELECT * FROM Logs WHERE TemplateHash = 'F20BA6E0';

-- Advanced analytics become practical
SELECT TemplateHash, COUNT(*) as EventCount 
FROM Logs 
GROUP BY TemplateHash 
ORDER BY EventCount DESC;

Operational Benefits

  • Template discovery: Easily identify all unique templates in your application
    -- Find all templates and their usage frequency
    SELECT mt.Template, COUNT(l.Id) as UsageCount
    FROM MessageTemplates mt
    LEFT JOIN Logs l ON mt.TemplateHash = l.TemplateHash
    GROUP BY mt.Template, mt.TemplateHash
    ORDER BY UsageCount DESC;
  • Usage analytics: Track template frequency for optimization
  • Faster log analysis: Group and analyze events by type rather than parsing strings
  • High-performance logging: Maintains SqlBulkCopy speed with in-memory template caching

Proven Approach

This approach is battle-tested and proven effective:

Seq Implementation

Seq (the popular Serilog log server) has used this exact approach since its inception:

  • Seq does this automatically by assigning a type to Serilog events on the server-side
  • Seq produces a 32-bit (Murmur) hash of the message template that can be referred to using hex literals
  • Storing, and then filtering on a wordy message template isn't always ideal. Instead, it's common to create a numeric hash value from the message template, and store this with the event instead

Community Recognition

  • Nicholas Blumhardt (Serilog creator) has written extensively about this pattern
  • Multiple blog posts demonstrate the value of event type hashing
  • Widely adopted in the structured logging community

Potential Concerns & Mitigations

Hash Collisions

  • Risk: Extremely low with Murmur3 32-bit hash
  • Mitigation: Monitor for collisions and provide collision detection
  • Fallback: Could extend to 64-bit hash if needed in future

Performance Impact

  • Concern: Additional overhead affecting SqlBulkCopy performance
  • Mitigation: In-memory template caching eliminates database calls during normal operation
  • Benefit: Hash lookups (8 chars) are much faster than string comparisons (potentially 100+ chars)

Memory Usage

  • Concern: Template cache consuming memory
  • Mitigation: Configurable cache size with reasonable defaults (10K templates ≈ 1-2MB)
  • Benefit: Tiny overhead compared to storage savings achieved

Migration Complexity

  • Concern: Existing installations need migration path
  • Mitigation: Dual-write mode provides smooth transition
  • Benefit: Opt-in feature, no forced migration

Conclusion

Message template normalization addresses a fundamental inefficiency in current database logging approaches. By following Seq's proven model, we can:

  • Reduce storage requirements by 80-95% for individual templates (depends on template length and frequency)
  • Improve query performance significantly
  • Enable advanced log analytics that aren't practical with current approach
  • Maintain full backward compatibility during transition

This feature would make serilog-sinks-mssqlserver more suitable for production environments with high-volume logging, while providing a clear migration path for existing users.

The implementation follows established patterns, has proven benefits, and addresses real pain points experienced by users managing large log databases. I believe this would be a valuable addition to the sink that would benefit the entire Serilog community.


Would the maintainers be interested in this feature? I'm happy to contribute to the implementation if there's community interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions