-
Notifications
You must be signed in to change notification settings - Fork 838
Description
Background and motivation
DataIngestion
is an ETL process, where a DocumentReader
parses given document and represents it with Document
type, then it's processed by 0-n processors, split into Chunks
and persisted in a Vector Store to allow for Vector Search and RAG.
Document
is a format-agnostic container that normalizes diverse input formats into a structured hierarchy. It's composed of DocumentSection
objects, which can contain nested elements (including subsections). Each DocumentElement
has to provide its content in Markdown format.
Markdown is extremely close to plain text, with minimal markup or formatting, but still provides a way to represent important document structure. Mainstream LLMs, such as OpenAI's GPT-4o, natively "speak" Markdown, and often incorporate Markdown into their responses unprompted. This suggests that they have been trained on vast amounts of Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions are also highly token-efficient.
API Proposal
namespace Microsoft.Extensions.DataIngestion;
public abstract class DocumentElement
{
protected internal DocumentElement(string markdown);
protected internal DocumentElement(); // ctor used by Section where providing Markdown up-front is not mandatory
public virtual string Markdown { get; }
public string? Text { get; set; }
public int? PageNumber { get; set; }
public Dictionary<string, object?> Metadata { get; }
}
public sealed class DocumentSection : DocumentElement
{
public DocumentSection(string markdown) : base(markdown);
public DocumentSection() : base();
public List<DocumentElement> Elements { get; };
public override string Markdown { get; }
}
public sealed class DocumentParagraph : DocumentElement
{
public DocumentParagraph(string markdown) : base(markdown);
}
public sealed class DocumentHeader : DocumentElement
{
public DocumentHeader(string markdown) : base(markdown);
public int? Level { get; set; }
}
public sealed class DocumentFooter : DocumentElement
{
public DocumentFooter(string markdown) : base(markdown);
}
public sealed class DocumentTable : DocumentElement
{
public DocumentTable(string markdown, string[,] cells) : base(markdown);
// This information is useful when chunking large tables that exceed token count limit.
public string[,] Cells { get; }
}
public sealed class DocumentImage : DocumentElement
{
public DocumentImage(string markdown) : base(markdown);
public ReadOnlyMemory<byte>? Content { get; set; }
public string? MediaType { get; set; }
public string? AlternativeText { get; set; }
}
public sealed class Document : IEnumerable<DocumentElement>
{
public Document(string identifier);
public string Identifier { get; }
public List<DocumentSection> Sections { get; }
public string Markdown { get; set; }
/// <summary>
/// Iterate over all elements in the document, including those in nested sections.
/// </summary>
/// <remarks>
/// Sections themselves are not included.
/// </remarks>
public IEnumerator<DocumentElement> GetEnumerator();
IEnumerator IEnumerable.GetEnumerator();
}
API Usage
The following example uses Document
API to build a simple structured document:
Document doc = new("doc")
{
Sections =
{
new DocumentSection()
{
Elements =
{
new DocumentHeader("# Section title"),
new DocumentParagraph("This is a paragraph in section 1."),
new DocumentParagraph("This is another paragraph in section 1."),
new DocumentSection
{
Elements =
{
new DocumentHeader("## Subsection title"),
new DocumentParagraph("This is a paragraph in subsection 1.1."),
new DocumentParagraph("This is another paragraph in subsection 1.1.")
}
}
}
}
}
};
Another one that iterates over all elements and gets their semantic content to be used for generating embeddings:
foreach (DocumentElement element in documents)
{
string? semanticContent = element is DocumentImage img
? img.AlternativeText ?? img.Text
: element.Markdown;
if (!string.IsNullOrEmpty(semanticContent))
{
yield return (element, semanticContent);
}
}
Alternative Designs
Naming: Document
may be a bit too generic. My current best idea for a different name is DocumentContent
.
The fact that Document
implements IEnumerable<DocumentElement>
may be hard to discover by the end users. Because of that, it may be easier to just add an explicit Flatten
method:
public sealed class Document
{
public IEnumerator<DocumentElement> Flatten();
}
Risks
Exposing a public Dictionary<string, object>
can cause serialization headache in the future (durable document pipelines are on our radar). The main goal of the metadata is to allow for storing any information provided by the DocumentReader
that is specific to given implementation. Examples:
ConfidenceScore
: a doubleBoundingRegions
: a list ofBoundingRegion
structs (X, Y, Width, Height)
The DocumentElement.Metadata
is not consumed by any of the chunkers we provide as of now (we focus on RAG), but it may be used by users to implement more advanced scenarios like recreating a document and preserving its structure.