[API Proposal]: DataIngestion: Document representation

### Background and motivation

`DataIngestion` is an ETL process, where a `DocumentReader` parses given document and represents it with `Document` type, then it's processed by 0-n processors, split into `Chunks` and persisted in a Vector Store to allow for Vector Search and RAG.

`Document` is a format-agnostic container that normalizes diverse input formats into a structured hierarchy. It's composed of `DocumentSection` objects, which can contain nested elements (including subsections). Each `DocumentElement` has to provide its content in Markdown format.

Markdown is extremely close to plain text, with minimal markup or formatting, but still provides a way to represent important document structure. Mainstream LLMs, such as OpenAI's GPT-4o, natively "speak" Markdown, and often incorporate Markdown into their responses unprompted. This suggests that they have been trained on vast amounts of Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions are also highly token-efficient.

### API Proposal

```csharp
namespace Microsoft.Extensions.DataIngestion;

public abstract class DocumentElement
{
    protected internal DocumentElement(string markdown);

    protected internal DocumentElement(); // ctor used by Section where providing Markdown up-front is not mandatory
 
    public virtual string Markdown { get; }

    public string? Text { get; set; }

    public int? PageNumber { get; set; }

    public Dictionary<string, object?> Metadata { get; }
}

public sealed class DocumentSection : DocumentElement
{
    public DocumentSection(string markdown) : base(markdown);

    public DocumentSection() : base();

    public List<DocumentElement> Elements { get; };

    public override string Markdown { get; }
}

public sealed class DocumentParagraph : DocumentElement
{
    public DocumentParagraph(string markdown) : base(markdown);
}

public sealed class DocumentHeader : DocumentElement
{
    public DocumentHeader(string markdown) : base(markdown);

    public int? Level { get; set; }
}

public sealed class DocumentFooter : DocumentElement
{
    public DocumentFooter(string markdown) : base(markdown);
}

public sealed class DocumentTable : DocumentElement
{
    public DocumentTable(string markdown, string[,] cells) : base(markdown);

    // This information is useful when chunking large tables that exceed token count limit.
    public string[,] Cells { get; }
}

public sealed class DocumentImage : DocumentElement
{
    public DocumentImage(string markdown) : base(markdown);

    public ReadOnlyMemory<byte>? Content { get; set; }

    public string? MediaType { get; set; }

    public string? AlternativeText { get; set; }
}

public sealed class Document : IEnumerable<DocumentElement>
{
    public Document(string identifier);
    
    public string Identifier { get; }

    public List<DocumentSection> Sections { get; }

    public string Markdown { get; set; }

    /// <summary>
    /// Iterate over all elements in the document, including those in nested sections.
    /// </summary>
    /// <remarks>
    /// Sections themselves are not included.
    /// </remarks>
    public IEnumerator<DocumentElement> GetEnumerator();
    IEnumerator IEnumerable.GetEnumerator();
}
```


### API Usage

The following example uses `Document` API to build a simple structured document:

```cs
Document doc = new("doc")
{
    Sections =
    {
        new DocumentSection()
        {
            Elements =
            {
                new DocumentHeader("# Section title"),
                new DocumentParagraph("This is a paragraph in section 1."),
                new DocumentParagraph("This is another paragraph in section 1."),
                new DocumentSection
                {
                    Elements =
                    {
                        new DocumentHeader("## Subsection title"),
                        new DocumentParagraph("This is a paragraph in subsection 1.1."),
                        new DocumentParagraph("This is another paragraph in subsection 1.1.")
                    }
                }
            }
        }
    }
};
```

Another one that iterates over all elements and gets their semantic content to be used for generating embeddings:

```cs
foreach (DocumentElement element in documents)
{
    string? semanticContent = element is DocumentImage img
        ? img.AlternativeText ?? img.Text
        : element.Markdown;

    if (!string.IsNullOrEmpty(semanticContent))
    {
        yield return (element, semanticContent);
    }
}
```

### Alternative Designs

Naming: `Document` may be a bit too generic. My current best idea for a different name is `DocumentContent`.

The fact that `Document` implements `IEnumerable<DocumentElement>` may be hard to discover by the end users. Because of that, it may be easier to just add an explicit `Flatten` method:

```csharp
public sealed class Document
{
    public IEnumerator<DocumentElement> Flatten();
}
```

### Risks

Exposing a public `Dictionary<string, object>` can cause serialization headache in the future (durable document pipelines are on our radar). The main goal of the metadata is to allow for storing any information provided by the `DocumentReader` that is specific to given implementation. Examples: 
- `ConfidenceScore`: a double
- `BoundingRegions`: a list of `BoundingRegion` structs (X, Y, Width, Height)

The `DocumentElement.Metadata` is not consumed by any of the chunkers we provide as of now (we focus on RAG), but it may be used by users to implement more advanced scenarios like recreating a document and preserving its structure.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[API Proposal]: DataIngestion: Document representation #6893

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[API Proposal]: DataIngestion: Document representation #6893

Description

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions