Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/notebooks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ Notebooks
.. toctree::
:maxdepth: 1

Catalog Size Inspection <notebooks/catalog_size_inspection>
Cone Search / Nearest Neighbor <notebooks/cone_search>
155 changes: 155 additions & 0 deletions docs/notebooks/catalog_size_inspection.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "dcd4a1f5",
"metadata": {},
"source": [
"# Catalog Size Inspection\n",
"\n",
"In this notebook, we look at methods to explore the size of the parquet files in a hipscat'ed catalog.\n",
"\n",
"This can be useful to determine if your partitioning will lead to imbalanced datasets.\n",
"\n",
"Author: Melissa DeLucchi (delucchi@andrew.cmu.edu)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b90e3678",
"metadata": {},
"source": [
"## Fetch file sizes\n",
"\n",
"First, we fetch the size on disk of all the parquet files in our catalog. This stage may take some time, depending on how many partitions are in your catalog, and the load characteristics of your machine."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "088dea1b",
"metadata": {},
"outputs": [],
"source": [
"from hipscat.catalog.catalog import Catalog\n",
"from hipscat.io import paths\n",
"import os\n",
"\n",
"### Change this path!!!\n",
"catalog_dir = '../../tests/data/small_sky_order1'\n",
"\n",
"### ----------------\n",
"### You probably won't have to change anything from here.\n",
"\n",
"catalog = Catalog.read_from_hipscat(catalog_dir)\n",
"\n",
"info_frame = catalog.get_pixels().copy()\n",
"\n",
"for index, partition in info_frame.iterrows():\n",
" file_name = result = paths.pixel_catalog_file(catalog_dir, partition['Norder'], partition['Npix'])\n",
" info_frame.loc[index, \"size_on_disk\"] = os.path.getsize(file_name)\n",
"\n",
"info_frame = info_frame.astype(int)\n",
"info_frame[\"gbs\"] = info_frame[\"size_on_disk\"]/(1024 * 1024 * 1024)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "85d70e83",
"metadata": {},
"source": [
"## Summarize pixels and sizes\n",
"\n",
"* healpix orders: distinct healpix orders represented in the partitions\n",
"* num partitions: total number of partition files\n",
"\n",
"**Row size data** - using the `num_rows` field in the partition info, check the balance of data. The ideal row ratio is less than 10, but having a larger ratio doesn't mean there's a problem with your data.\n",
"\n",
"* min rows: number of rows in the smallest partition\n",
"* max rows: number of rows in the larget partition\n",
"* row ratio: max/min - rough indicator of how well your data may balance when distributed across many workers\n",
"\n",
"**Size on disk data** - using the file sizes fetched above, check the balance of your data. If your rows are fixed-width (e.g. no nested arrays, and few NaNs), the ratio here should be similar to the ratio above. If they're very different, and you experience problems when parallelizing operations on your data, you may consider re-structuring the data representation.\n",
"\n",
"* min size_on_disk: smallest file (in GB)\n",
"* max size_on_disk: largest file size (in GB)\n",
"* size_on_disk ratio: max/min\n",
"* total size_on_disk: sum of all parquet catalog files (actual catalog size may vary due to other metadata files)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1f604bb3",
"metadata": {},
"outputs": [],
"source": [
"print(f'healpix orders: {info_frame[\"Norder\"].unique()}')\n",
"print(f'num partitions: {len(info_frame[\"Npix\"])}')\n",
"print('------')\n",
"print(f'min rows: {info_frame[\"num_rows\"].min()}')\n",
"print(f'max rows: {info_frame[\"num_rows\"].max()}')\n",
"print(f'row ratio: {info_frame[\"num_rows\"].max()/info_frame[\"num_rows\"].min():.2f}')\n",
"print('------')\n",
"print(f'min size_on_disk: {info_frame[\"gbs\"].min():.2f}')\n",
"print(f'max size_on_disk: {info_frame[\"gbs\"].max():.2f}')\n",
"print(f'size_on_disk ratio: {info_frame[\"gbs\"].max()/info_frame[\"gbs\"].min():.2f}')\n",
"print(f'total size_on_disk: {info_frame[\"gbs\"].sum():.2f}')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1c5bbe0e",
"metadata": {},
"source": [
"## File size distribution\n",
"\n",
"Below we look at histograms of file sizes.\n",
"\n",
"In our initial testing, we find that there's a \"sweet spot\" file size of 100MB-1GB. Files that are smaller create more overhead for individual reads. Files that are much larger may create slow-downs when cross-matching between catalogs. Files that are *much* larger can create out-of-memory issues for dask when loading from disk.\n",
"\n",
"The majority of your files should be in the \"sweet spot\", and no files in the \"too-big\" category."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61e5c841",
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"plt.hist(info_frame[\"gbs\"])\n",
"\n",
"bins = [0,.5,1,2,100]\n",
"labels = [\"small-ish\", \"sweet-spot\", \"big-ish\", \"too-big\"]\n",
"hist = np.histogram(info_frame[\"gbs\"], bins=bins)[0]\n",
"pcts = hist / len(info_frame)\n",
"for i in range(0, len(labels)):\n",
" print(f\"{labels[i]} \\t: {hist[i]} \\t({pcts[i]*100:.1f} %)\")"
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}