Skip to content

Collections with many items saving time issue #1207

Open
@santilland

Description

@santilland

Using pystac[validation] 1.8.3

I am creating collections with a larger amount of items and was surprised by the time it took to save them. I have been doing some very preliminary tests and it somehow seems that the save time increases exponentially with the amount of items in a collection.
For example saving a catalog with 1 collection takes depending on item count:

Items Time
200 0.225s
2000 5.439s
10000 105.975s

If i create 5 collections with 2000 items the saving time is 25s. So the same amount of items are being saved in total but it takes 4 times less when separated into multiple collections.

Any ideas why this could be happening?

Here is a very rough testing script:


import time
from datetime import (
    datetime,
    timedelta,
)
from pystac import (
    Item,
    Catalog,
    CatalogType,
    Collection,
    Extent,
    SpatialExtent,
    TemporalExtent,
)
from pystac.layout import TemplateLayoutStrategy

numdays = 10000
number_of_collections = 1
base = datetime.today()
times = [base - timedelta(days=x) for x in range(numdays)]

catalog = Catalog(
    id = "test",
    description = "catalog to test performance",
    title = "performance test catalog",
    catalog_type=CatalogType.RELATIVE_PUBLISHED,
)

spatial_extent = SpatialExtent([
    [-180.0, -90.0, 180.0, 90.0],
])
temporal_extent = TemporalExtent([[datetime.now()]])
extent = Extent(spatial=spatial_extent, temporal=temporal_extent)


for idx in range(number_of_collections):
    collection = Collection(
        id="big_collection%s"%idx,
        title="collection for items",
        description="some desc",
        extent=extent
    )
    for t in times:
        item = Item(
            id = t.isoformat(),
            bbox=[-180.0, -90.0, 180.0, 90.0],
            properties={},
            geometry = None,
            datetime = t,
        )
        collection.add_item(item)
    collection.update_extent_from_items()

    catalog.add_child(collection)

strategy = TemplateLayoutStrategy(item_template="${collection}/${year}")
catalog.normalize_hrefs("https://exampleurl.com/", strategy=strategy)

start_time = time.perf_counter()
catalog.save(dest_href="../test_build/")
end_time = time.perf_counter()
print(f"Saving Time : {end_time - start_time:0.6f}" )

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThings which are broken

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions