ReadableWeb

Overview

Small, modular .NET 10 library and tools to extract article content, images and metadata from web pages. The solution contains extractors, abstractions, HTML parsing implementations, tests and benchmarks.

Projects

ReadableWeb.Abstractions — public interfaces for extractors and processors
ReadableWeb — composition and higher-level services
ReadableWeb.HtmlAgilityPack — HTML Agility Pack based extractor implementation
ReadableWeb.AngleSharp — AngleSharp based extractor implementation
ReadableWeb.Tests — unit tests
ReadableWeb.Benchmarks — benchmark projects
ReadableWeb.TestConsole — sample/test console app

Requirements

.NET 10 SDK
Optional: dotnet-ef or other tooling only if needed for local tasks

Usage examples

Quick extraction via the default HTTP helper:

using ReadableWeb.Extraction;

var extractor = HttpArticleExtractor.CreateDefault();
var article = await extractor.ExtractFromUrlAsync(
    "https://www.example.com/news/story");

Console.WriteLine(article.Title);
Console.WriteLine(article.Excerpt);
foreach (var image in article.Images)
{
    Console.WriteLine($"Image: {image.Url}");
}

Register the library in an ASP.NET Core or worker service using the provided DI extension:

using ReadableWeb;
using ReadableWeb.Configuration;

builder.Services.AddArticleExtraction(builder.Configuration, "ArticleExtraction", options =>
{
    options.EnableImageFileCache = true;
    options.ImageFileCachePath = Path.Combine(builder.Environment.ContentRootPath, "wwwroot/images");
});

Caching

ReadableWeb supports two complementary caching mechanisms to improve performance and reduce bandwidth when extracting articles with images:

Image file cache (local filesystem)

Enable this with EnableImageFileCache = true in the configuration or by setting the option in the DI extension. When enabled the extractor will download image assets referenced by the extracted article and persist them to the specified ImageFileCachePath on disk.
ImageFileCacheBaseUrl should be set to the public base URL segment where the cached images will be served from (for example /article-images). The extractor will rewrite image URLs in the returned ArticleContent to point at the base URL plus the cached filename.
ImageFileCachePath is a filesystem path relative to your application root (or an absolute path). Make sure the directory is writable by the process and is served by your web server (for example, place it under wwwroot in ASP.NET Core or configure static file serving for that path).
IgnoreImageDownloadErrors = true prevents extraction from failing when individual image downloads fail. When false, image download failures may surface as errors.

Example configuration for local image caching:

{
  "ArticleExtraction": {
    "Parser": "HtmlAgilityPack",
    "UseRedis": false,
    "EnableImageFileCache": true,
    "ImageFileCachePath": "wwwroot/article-images",
    "ImageFileCacheBaseUrl": "/article-images",
    "IgnoreImageDownloadErrors": true
  }
}

Distributed cache (Redis)

If UseRedis = true the library will use the configured distributed cache (typically backed by Redis) to store extract results and/or intermediate data depending on your configuration. This reduces repeated extraction work for the same URLs across multiple instances.
To use Redis, register and configure IDistributedCache (for example Microsoft.Extensions.Caching.StackExchangeRedis) in your application and make sure the ArticleExtraction configuration section enables UseRedis.

Example configuration snippet enabling Redis + image cache:

{
  "ArticleExtraction": {
    "Parser": "HtmlAgilityPack",
    "UseRedis": true,
    "EnableImageFileCache": true,
    "ImageFileCachePath": "wwwroot/article-images",
    "ImageFileCacheBaseUrl": "/article-images",
    "IgnoreImageDownloadErrors": true
  }
}

Behavior notes and operational tips

Cached images are intended to be served as static files. Ensure your web server serves files from ImageFileCachePath at the ImageFileCacheBaseUrl you configured.
The library may use deterministic filenames (for example based on a hash of the original image URL) — treat the cache directory as opaque and prefer clearing it via administrative processes when you need to invalidate images.
If you enable a distributed cache (Redis) you should configure an appropriate TTL and eviction policy on the cache side if you need automatic expiration; the library relies on the configured IDistributedCache implementation for storage semantics.
When IgnoreImageDownloadErrors is enabled the extraction will still return article text and metadata even if some images fail to download. Disable this setting during debugging to surface download issues.

Expose the extractor through a minimal API endpoint:

using ReadableWeb.Extraction;

var app = builder.Build();

app.MapGet("/api/article", async (
        [FromServices] IHttpArticleExtractor extractor,
        [FromQuery] string url,
        CancellationToken token) =>
    {
        if (string.IsNullOrWhiteSpace(url))
        {
            return Results.BadRequest("Url query parameter is required.");
        }

        var article = await extractor.ExtractFromUrlAsync(url, cancellationToken: token);
        return Results.Ok(article);
    })
   .Produces<ArticleContent>()
   .WithName("GetArticle");

app.Run();

Or wire it up in a conventional API controller:

using ReadableWeb.Extraction;
using Microsoft.AspNetCore.Mvc;

[ApiController]
[Route("api/[controller]")]
public class ArticleController : ControllerBase
{
    private readonly IHttpArticleExtractor _extractor;

    public ArticleController(IHttpArticleExtractor extractor)
    {
        _extractor = extractor;
    }

    [HttpGet]
    public async Task<IActionResult> Get([FromQuery] string url, CancellationToken token)
    {
        if (string.IsNullOrWhiteSpace(url))
        {
            return BadRequest("Url query parameter is required.");
        }

        var article = await _extractor.ExtractFromUrlAsync(url, cancellationToken: token);
        return Ok(article);
    }
}

Example configuration section consumed by AddArticleExtraction:

{
  "ArticleExtraction": {
    "Parser": "HtmlAgilityPack",
    "UseRedis": false,
    "EnableImageFileCache": true,
    "ImageFileCachePath": "wwwroot/article-images",
    "ImageFileCacheBaseUrl": "/article-images",
    "IgnoreImageDownloadErrors": true
  }
}

Build

Restore and build all projects:

dotnet restore
dotnet build --configuration Release

Run tests

Run unit tests from solution root:

dotnet test

Run benchmarks

Benchmarks use BenchmarkDotNet. Run from the benchmark project directory:

dotnet run -c Release -p ReadableWeb.Benchmarks

Package and publish

This repository includes a GitHub Actions workflow to pack and publish NuGet packages: .github/workflows/publish-nuget.yml. The workflow builds and packs with a version based on commit count and pushes packages to NuGet when the NUGET_API_KEY secret is provided.

Dependency updates

Dependabot configuration is provided in .github/dependabot.yml to open weekly PRs for NuGet package updates.

Contributing

Open issues or PRs for bugs and improvements
Follow existing coding conventions in the repository
Update or add tests for behavior changes

License

No license file included in the repository. Add a LICENSE file if you intend to open source this code.

Contact

For local development questions, run the sample console app ReadableWeb.TestConsole or inspect tests in ReadableWeb.Tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ReadableWeb

Overview

Projects

Requirements

Usage examples

Caching

Behavior notes and operational tips

Build

Run tests

Run benchmarks

Package and publish

Dependency updates

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
ReadableWeb.Abstractions		ReadableWeb.Abstractions
ReadableWeb.AngleSharp.Tests		ReadableWeb.AngleSharp.Tests
ReadableWeb.AngleSharp		ReadableWeb.AngleSharp
ReadableWeb.Benchmarks		ReadableWeb.Benchmarks
ReadableWeb.HtmlAgilityPack		ReadableWeb.HtmlAgilityPack
ReadableWeb.TestConsole		ReadableWeb.TestConsole
ReadableWeb.Tests		ReadableWeb.Tests
ReadableWeb		ReadableWeb
.gitattributes		.gitattributes
.gitignore		.gitignore
GitVersion.yml		GitVersion.yml
LICENSE.txt		LICENSE.txt
README.md		README.md
ReadableWeb.slnx		ReadableWeb.slnx

License

robertmayordomo/ReadableWeb

Folders and files

Latest commit

History

Repository files navigation

ReadableWeb

Overview

Projects

Requirements

Usage examples

Caching

Behavior notes and operational tips

Build

Run tests

Run benchmarks

Package and publish

Dependency updates

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages