ParquetRsForDotnet

ParquetRsForDotnet is a .NET 8 library for reading and writing parquet data through Rust arrow-rs and parquet-rs with explicit Arrow-native APIs.

The library focuses on two Arrow-native models:

define an explicit ParquetSchema and append write batches through ParquetFileWriter
open a parquet file through ParquetFileReader, then read columns from explicit row-group readers

Repository Layout

src/ - managed library project and source code
tests/ - managed test project and end-to-end verification tests
native/ - Rust cdylib that performs Arrow import and parquet encoding

Architecture

At a high level, the library splits responsibility across managed and native layers:

managed (src/)
- validates the public schema and incoming batches
- materializes CLR arrays into Arrow arrays when needed
- exports Arrow schema and record batches through Arrow C Data
- imports Arrow schema and arrays from native reads
- exposes a seekable source callback table over a managed Stream
- exposes a sink callback table over a managed Stream
native Rust (native/)
- imports Arrow schema and record batches via Arrow FFI for writes
- exports Arrow schema and arrays via Arrow FFI for reads
- writes parquet using parquet::arrow::ArrowWriter
- reads parquet using parquet-rs Arrow readers with row-group/column projection
- streams parquet bytes through managed sink/source callbacks

Data flow:

Write:
CLR arrays / IArrowArray
    -> managed batch validation/materialization
    -> Arrow C Data export
    -> Rust Arrow FFI import
    -> parquet-rs writer
    -> managed Stream sink

Read:
managed Stream source callbacks
    -> parquet-rs reader
    -> Arrow C Data export
    -> managed Arrow import
    -> IArrowArray

Public API

Primary public API surface:

public sealed class ParquetFileWriter : IDisposable
{
    public ParquetFileWriter(Stream output, ParquetSchema schema, ParquetWriteOptions? options = null);

    public void WriteBatch(params Array[] columns);
    public void WriteBatch(IReadOnlyList<Array> columns);

    public void WriteBatch(params IArrowArray[] columns);
    public void WriteBatch(IReadOnlyList<IArrowArray> columns);

    public void Finish();
}

public sealed class ParquetFileReader : IDisposable
{
    public ParquetFileReader(Stream input);

    public ParquetSchema Schema { get; }
    public int RowGroupCount { get; }

    public ParquetRowGroupReader OpenRowGroupReader(int rowGroupIndex);
    public ParquetRowGroupReader OpenRowGroupReader(int rowGroupIndex, params string[] projectedColumnNames);
}

public sealed class ParquetRowGroupReader : IDisposable
{
    public int RowGroupIndex { get; }
    public long RowCount { get; }
    public int ColumnCount { get; }
    public int ProjectedColumnCount { get; }
    public IReadOnlyList<string>? ProjectedColumnNames { get; }

    public IArrowArray ReadColumn(int columnIndex);
    public IArrowArray ReadColumn(string columnName);

    public T[] ReadColumn<T>(int columnIndex);
    public T[] ReadColumn<T>(string columnName);
}

Supporting public schema types:

ParquetSchema
ParquetColumn
ParquetColumnType
ParquetDecimalSettings
ParquetTimestampSettings
ParquetTimestampUnit

`ParquetWriteOptions`

public sealed class ParquetWriteOptions
{
    public int? MaxRowGroupRows { get; init; }
    public long? MaxRowGroupBytes { get; init; }

    public ParquetCompression Compression { get; init; } = ParquetCompression.Zstd;
    public bool EnableDictionaryEncoding { get; init; } = true;
    public ParquetStatisticsLevel StatisticsLevel { get; init; } = ParquetStatisticsLevel.Chunk;
    public ArrowMaterializationMode ArrowMaterializationMode { get; init; } = ArrowMaterializationMode.Default;
    public int? NativeWriteBatchSize { get; init; }

    public string? CreatedBy { get; init; }
    public IReadOnlyDictionary<string, string>? FileMetadata { get; init; }
}

Option intent:

MaxRowGroupRows / MaxRowGroupBytes
- control parquet row-group boundaries and native row-group buffering
Compression
- selects the parquet compression codec
EnableDictionaryEncoding
- enables or disables default dictionary encoding
StatisticsLevel
- controls whether parquet statistics are omitted, written per row group, or written per page
ArrowMaterializationMode
- tuning knob for CLR array-backed batch materialization
NativeWriteBatchSize
- advanced knob that optionally overrides parquet-rs encoder chunk size; it does not split managed WriteBatch(...) calls, set data page size, or define row-group boundaries
- defaults to 8_192 rows in this library when unset
CreatedBy, FileMetadata
- flow into native parquet writer properties

Usage

Appendable batch write

using var destination = File.Create("batches.parquet");

var schema = new ParquetSchema(
[
    new ParquetColumn("id", ParquetColumnType.Int32, isNullable: true),
    new ParquetColumn("name", ParquetColumnType.String, isNullable: true),
    new ParquetColumn("amount", new ParquetDecimalSettings(18, 2), isNullable: true),
    new ParquetColumn("eventTime", new ParquetTimestampSettings(ParquetTimestampUnit.Millisecond, "UTC"), isNullable: true),
]);

using var writer = new ParquetFileWriter(destination, schema, new ParquetWriteOptions
{
    Compression = ParquetCompression.Zstd,
    MaxRowGroupRows = 100_000,
    CreatedBy = "MyEngine",
});

writer.WriteBatch(
    new int?[] { 1, 2 },
    new string?[] { "first", "second" },
    new decimal?[] { 10.5m, 20.25m },
    new DateTime?[]
    {
        new DateTime(2024, 6, 1, 1, 2, 3, DateTimeKind.Utc),
        new DateTime(2024, 6, 1, 4, 5, 6, DateTimeKind.Utc),
    });

writer.WriteBatch(
    new int?[] { 3, null },
    new string?[] { "third", null },
    new decimal?[] { 30.75m, null },
    new DateTime?[]
    {
        new DateTime(2024, 6, 1, 7, 8, 9, DateTimeKind.Utc),
        null,
    });

writer.Finish();

Notes:

column order must exactly match ParquetSchema.Columns
each batch must provide all columns
all columns in a batch must have the same row count
WriteBatch accepts either CLR arrays or IArrowArray values

Appendable Arrow-backed batch write

using Apache.Arrow;
using Apache.Arrow.Memory;
using Apache.Arrow.Types;

var allocator = MemoryAllocator.Default.Value;

var schema = new ParquetSchema(
[
    new ParquetColumn("id", ParquetColumnType.Int32, isNullable: true),
    new ParquetColumn("eventTime", new ParquetTimestampSettings(ParquetTimestampUnit.Millisecond, "UTC"), isNullable: true),
]);

using var destination = File.Create("arrow-batches.parquet");
using var writer = new ParquetFileWriter(destination, schema);

using var ids = new Int32Array.Builder().Append(1).AppendNull().Append(3).Build(allocator);
using var timestamps = new TimestampArray.Builder(new TimestampType(TimeUnit.Millisecond, "UTC"))
    .Append(new DateTimeOffset(new DateTime(2024, 7, 1, 10, 0, 0, DateTimeKind.Utc)))
    .AppendNull()
    .Append(new DateTimeOffset(new DateTime(2024, 7, 1, 12, 0, 0, DateTimeKind.Utc)))
    .Build(allocator);

writer.WriteBatch(ids, timestamps);

Arrow-native row-group column read

using var input = File.OpenRead("arrow-batches.parquet");
using var reader = new ParquetFileReader(input);
using var rowGroup = reader.OpenRowGroupReader(0, "id", "eventTime");

var idColumn = (Int32Array)rowGroup.ReadColumn("id");
var eventTimeColumn = (TimestampArray)rowGroup.ReadColumn("eventTime");

Notes:

CLR read materialization is also available through ReadColumn<T>(...)
decimal CLR reads materialize as SqlDecimal / SqlDecimal?
input streams must be seekable
reads are explicit by row group, then by column
optional row-group projection is selected by column names; integer reads still use original schema ordinals, not projected positions

Build And Test

Managed builds automatically invoke cargo build for the native writer when the Rust toolchain is available at %USERPROFILE%\.cargo\bin.

Build and test the full repository:

dotnet build ParquetRsForDotnet.slnx
dotnet test ParquetRsForDotnet.slnx

Run Rust tests directly:

cd native
cargo test

NuGet Packaging

The project produces a strong-name signed NuGet package with Source Link support, net8.0 and netstandard2.0 assemblies, and bundled native libraries.

netstandard2.0 consumers use DateTime / DateTime? arrays for Date32 and Date64 columns because DateOnly is not available on .NET Standard 2.0. net8.0 consumers continue to use DateOnly / DateOnly? for date columns.

The test project also targets net472 to verify that .NET Framework consumers can run through the netstandard2.0 asset.

Windows-only pack

Build and pack with the Windows native library only:

dotnet pack src/ParquetRsForDotnet.csproj -c Release

This produces ParquetRsForDotnet.0.1.1.nupkg and ParquetRsForDotnet.0.1.1.snupkg under src/bin/Release/.

Cross-build for Linux (opt-in)

To include the Linux native library in the package, install the cross-compilation toolchain once:

winget install zig.zig
cargo install cargo-zigbuild
rustup target add x86_64-unknown-linux-gnu

Then pack with both RIDs:

dotnet pack src/ParquetRsForDotnet.csproj -c Release -p:CrossBuildLinux=true

The resulting .nupkg will contain native libraries under runtimes/win-x64/native/ and runtimes/linux-x64/native/.

For package content verification and native asset troubleshooting, see docs/how-to/package-and-native-assets.md.

Benchmarks

The repository includes a BenchmarkDotNet project under benchmarks/ focused on true-parity multi-batch writer comparisons against ParquetSharp.

Run it with:

dotnet run -c Release --project benchmarks/ParquetRsForDotnet.Benchmarks.csproj

Current benchmark coverage focuses on two representative data patterns:

mixed nullable shape with int, string, decimal, and DateTime
no-decimal nullable shape with int, string, DateTime, and int

For both patterns, the benchmark keeps parity by using:

the same cached input data on both sides
multiple batches per file
one row group per batch
the same compression and logical types

Testing Strategy

The tests intentionally verify parquet output with Parquet.Net rather than relying only on magic-byte checks.

Current coverage includes:

CLR-array batch writes
Arrow-array batch writes
multi-batch writes
temporal and decimal writes
Arrow-native row-group column reads
schema/type validation
native sink behavior
native/source reader behavior
Rust-side option and writer tests

Contributor Notes

For implementation details, ownership rules, and extension points, see docs/read-write-architecture.md.

For AI-assisted consuming-repo integration, start with docs/ai/bootstrap.md and api/public-api.md. These stable public-contract artifacts are also packed into the NuGet package.

Notes

Cargo.lock is kept in source control for reproducible native builds
Rust build artifacts live under native/target/ and are ignored
managed bin/ and obj/ folders are ignored

TSA Bug Filing

TSA bug filing is configured through tsaoptions.json. Official builds are expected to keep TSA bug filing enabled. Learn more

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
.vscode		.vscode
api		api
benchmarks		benchmarks
docs		docs
examples		examples
native		native
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
ParquetRsForDotnet.slnx		ParquetRsForDotnet.slnx
ParquetRsForDotnet.snk		ParquetRsForDotnet.snk
README.md		README.md
azurepipelines-coverage.yml		azurepipelines-coverage.yml
nuget.config		nuget.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParquetRsForDotnet

Repository Layout

Architecture

Public API

`ParquetWriteOptions`

Usage

Appendable batch write

Appendable Arrow-backed batch write

Arrow-native row-group column read

Build And Test

NuGet Packaging

Windows-only pack

Cross-build for Linux (opt-in)

Benchmarks

Testing Strategy

Contributor Notes

Notes

TSA Bug Filing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ParquetRsForDotnet

Repository Layout

Architecture

Public API

ParquetWriteOptions

Usage

Appendable batch write

Appendable Arrow-backed batch write

Arrow-native row-group column read

Build And Test

NuGet Packaging

Windows-only pack

Cross-build for Linux (opt-in)

Benchmarks

Testing Strategy

Contributor Notes

Notes

TSA Bug Filing

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`ParquetWriteOptions`

Packages