fix: its-live notebook update (#145)

gadomski · web-flow · commit f7a45fa6dc78 · 2025-06-02T05:55:52.000-06:00
Add some performance notes.
diff --git a/docs/notebooks/its-live.ipynb b/docs/notebooks/its-live.ipynb
@@ -441,6 +441,215 @@
     "data_frame = GeoDataFrame.from_arrow(table)\n",
     "data_frame.plot()"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aff144a0",
+   "metadata": {},
+   "source": [
+    "## Performance\n",
+    "\n",
+    "Let's do a small investigation into the performance characteristics of the two (partitioned, non-partitioned) datasets.\n",
+    "We've uploaded them to the bucket `stac-fastapi-geoparquet-labs-375`, which is public via [requester pays](https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html).\n",
+    "In all these examples, we've limited the returned item count to `10`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "e6da363e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "building \"rustac\"\n",
+      "rebuilt and loaded package \"rustac\" in 8.977s\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Getting the first ten items\n",
+      "Got 10 items from the non-partitioned dataset in 7.33 seconds\n",
+      "Got 10 items from the partitioned dataset in 1.34 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "import time\n",
+    "from rustac import DuckdbClient\n",
+    "\n",
+    "client = DuckdbClient()\n",
+    "\n",
+    "href = \"s3://stac-fastapi-geoparquet-labs-375/its-live/**/*.parquet\"\n",
+    "href_partitioned = (\n",
+    "    \"s3://stac-fastapi-geoparquet-labs-375/its-live-partitioned/**/*.parquet\"\n",
+    ")\n",
+    "\n",
+    "print(\"Getting the first ten items\")\n",
+    "start = time.time()\n",
+    "items = client.search(href, limit=10)\n",
+    "print(\n",
+    "    f\"Got {len(items)} items from the non-partitioned dataset in {time.time() - start:.2f} seconds\"\n",
+    ")\n",
+    "\n",
+    "start = time.time()\n",
+    "items = client.search(href_partitioned, limit=10)\n",
+    "print(\n",
+    "    f\"Got {len(items)} items from the partitioned dataset in {time.time() - start:.2f} seconds\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "4e631b6d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Searching by year\n",
+      "Got 10 items from 2024 from the non-partitioned dataset in 19.33 seconds\n",
+      "Got 10 items from 2024 the partitioned dataset in 62.54 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Searching by year\")\n",
+    "start = time.time()\n",
+    "items = client.search(\n",
+    "    href, limit=10, datetime=\"2024-01-01T00:00:00Z/2024-12-31T23:59:59Z\"\n",
+    ")\n",
+    "print(\n",
+    "    f\"Got {len(items)} items from 2024 from the non-partitioned dataset in {time.time() - start:.2f} seconds\"\n",
+    ")\n",
+    "\n",
+    "start = time.time()\n",
+    "items = client.search(\n",
+    "    href_partitioned, limit=10, datetime=\"2024-01-01T00:00:00Z/2024-12-31T23:59:59Z\"\n",
+    ")\n",
+    "print(\n",
+    "    f\"Got {len(items)} items from 2024 the partitioned dataset in {time.time() - start:.2f} seconds\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0d8965a",
+   "metadata": {},
+   "source": [
+    "The non-partitioned dataset has much smaller files, so the search for the first ten items in 2024 didn't take as long because it didn't have to read in large datasets across the network.\n",
+    "Let's use the `year` partitioning filter to speed things up."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "28b83009",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Got 10 items from 2024 the partitioned dataset, using `year`, in 1.09 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "start = time.time()\n",
+    "items = client.search(\n",
+    "    href_partitioned,\n",
+    "    limit=10,\n",
+    "    datetime=\"2024-01-01T00:00:00Z/2024-12-31T23:59:59Z\",\n",
+    "    filter=\"year=2024\",\n",
+    ")\n",
+    "print(\n",
+    "    f\"Got {len(items)} items from 2024 the partitioned dataset, using `year`, in {time.time() - start:.2f} seconds\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e54bdca1",
+   "metadata": {},
+   "source": [
+    "Much better.\n",
+    "Now let's try a spatial search.\n",
+    "During local testing, we determined that it wasn't even worth it to try against the non-partitioned dataset, as it takes too long."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "a9fad4df",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Got 10 items over Helheim Glacier from the partitioned dataset in 9.33 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "helheim = {\"type\": \"Point\", \"coordinates\": [-38.2, 66.65]}\n",
+    "\n",
+    "start = time.time()\n",
+    "items = client.search(href_partitioned, limit=10, intersects=helheim)\n",
+    "print(\n",
+    "    f\"Got {len(items)} items over Helheim Glacier from the partitioned dataset in {time.time() - start:.2f} seconds\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34cf6b59",
+   "metadata": {},
+   "source": [
+    "For experimentation, we've also got a [stac-fastapi-geoparquet](https://github.com/stac-utils/stac-fastapi-geoparquet/) server pointing to the same partitioned dataset.\n",
+    "Since spatial queries take a lot of data transfer from the DuckDB client to blob storage, is it any faster to query using the **stac-fastapi-geoparquet** lambda?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "000e1cd9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Got 10 items over Helheim Glacier from the stac-fastapi-geoparquet server in 2.25 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "import rustac\n",
+    "import requests\n",
+    "\n",
+    "# Make sure the lambda is started\n",
+    "response = requests.get(\"https://stac-geoparquet.labs.eoapi.dev\")\n",
+    "response.raise_for_status()\n",
+    "\n",
+    "start = time.time()\n",
+    "items = await rustac.search(\n",
+    "    \"https://stac-geoparquet.labs.eoapi.dev\",\n",
+    "    collections=[\"its-live-partitioned\"],\n",
+    "    intersects=helheim,\n",
+    "    max_items=10,\n",
+    ")\n",
+    "print(\n",
+    "    f\"Got {len(items)} items over Helheim Glacier from the stac-fastapi-geoparquet server in {time.time() - start:.2f} seconds\"\n",
+    ")"
+   ]
   }
  ],
  "metadata": {