Skip to content

[API Proposal]: DataIngestion: Document representation #6893

@adamsitnik

Description

@adamsitnik

Background and motivation

DataIngestion is an ETL process, where a DocumentReader parses given document and represents it with Document type, then it's processed by 0-n processors, split into Chunks and persisted in a Vector Store to allow for Vector Search and RAG.

Document is a format-agnostic container that normalizes diverse input formats into a structured hierarchy. It's composed of DocumentSection objects, which can contain nested elements (including subsections). Each DocumentElement has to provide its content in Markdown format.

Markdown is extremely close to plain text, with minimal markup or formatting, but still provides a way to represent important document structure. Mainstream LLMs, such as OpenAI's GPT-4o, natively "speak" Markdown, and often incorporate Markdown into their responses unprompted. This suggests that they have been trained on vast amounts of Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions are also highly token-efficient.

API Proposal

namespace Microsoft.Extensions.DataIngestion;

public abstract class DocumentElement
{
    protected internal DocumentElement(string markdown);

    protected internal DocumentElement(); // ctor used by Section where providing Markdown up-front is not mandatory
 
    public virtual string Markdown { get; }

    public string? Text { get; set; }

    public int? PageNumber { get; set; }

    public Dictionary<string, object?> Metadata { get; }
}

public sealed class DocumentSection : DocumentElement
{
    public DocumentSection(string markdown) : base(markdown);

    public DocumentSection() : base();

    public List<DocumentElement> Elements { get; };

    public override string Markdown { get; }
}

public sealed class DocumentParagraph : DocumentElement
{
    public DocumentParagraph(string markdown) : base(markdown);
}

public sealed class DocumentHeader : DocumentElement
{
    public DocumentHeader(string markdown) : base(markdown);

    public int? Level { get; set; }
}

public sealed class DocumentFooter : DocumentElement
{
    public DocumentFooter(string markdown) : base(markdown);
}

public sealed class DocumentTable : DocumentElement
{
    public DocumentTable(string markdown, string[,] cells) : base(markdown);

    // This information is useful when chunking large tables that exceed token count limit.
    public string[,] Cells { get; }
}

public sealed class DocumentImage : DocumentElement
{
    public DocumentImage(string markdown) : base(markdown);

    public ReadOnlyMemory<byte>? Content { get; set; }

    public string? MediaType { get; set; }

    public string? AlternativeText { get; set; }
}

public sealed class Document : IEnumerable<DocumentElement>
{
    public Document(string identifier);
    
    public string Identifier { get; }

    public List<DocumentSection> Sections { get; }

    public string Markdown { get; set; }

    /// <summary>
    /// Iterate over all elements in the document, including those in nested sections.
    /// </summary>
    /// <remarks>
    /// Sections themselves are not included.
    /// </remarks>
    public IEnumerator<DocumentElement> GetEnumerator();
    IEnumerator IEnumerable.GetEnumerator();
}

API Usage

The following example uses Document API to build a simple structured document:

Document doc = new("doc")
{
    Sections =
    {
        new DocumentSection()
        {
            Elements =
            {
                new DocumentHeader("# Section title"),
                new DocumentParagraph("This is a paragraph in section 1."),
                new DocumentParagraph("This is another paragraph in section 1."),
                new DocumentSection
                {
                    Elements =
                    {
                        new DocumentHeader("## Subsection title"),
                        new DocumentParagraph("This is a paragraph in subsection 1.1."),
                        new DocumentParagraph("This is another paragraph in subsection 1.1.")
                    }
                }
            }
        }
    }
};

Another one that iterates over all elements and gets their semantic content to be used for generating embeddings:

foreach (DocumentElement element in documents)
{
    string? semanticContent = element is DocumentImage img
        ? img.AlternativeText ?? img.Text
        : element.Markdown;

    if (!string.IsNullOrEmpty(semanticContent))
    {
        yield return (element, semanticContent);
    }
}

Alternative Designs

Naming: Document may be a bit too generic. My current best idea for a different name is DocumentContent.

The fact that Document implements IEnumerable<DocumentElement> may be hard to discover by the end users. Because of that, it may be easier to just add an explicit Flatten method:

public sealed class Document
{
    public IEnumerator<DocumentElement> Flatten();
}

Risks

Exposing a public Dictionary<string, object> can cause serialization headache in the future (durable document pipelines are on our radar). The main goal of the metadata is to allow for storing any information provided by the DocumentReader that is specific to given implementation. Examples:

  • ConfidenceScore: a double
  • BoundingRegions: a list of BoundingRegion structs (X, Y, Width, Height)

The DocumentElement.Metadata is not consumed by any of the chunkers we provide as of now (we focus on RAG), but it may be used by users to implement more advanced scenarios like recreating a document and preserving its structure.

Metadata

Metadata

Assignees

Labels

api-approvedAPI was approved in API review, it can be implementedarea-data-ingestionblockingAPI Review Board to prioritise

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions