Skip to content

Conversation

@Thorfinson
Copy link
Member

This PR adds an interactive data preprocessing wizard that enables users to upload, clean, and process CSV/Parquet files directly in the browser without requiring command-line tools.

Users can now prepare datasets through a guided workflow:

  • Upload Data - Drag-and-drop CSV/Parquet files with automatic profiling and quality scoring
  • Select Columns - Choose which columns to include with visual statistics (sparklines, box plots)
  • Clean Data - Configure handling for missing values, outliers (IQR/Z-Score methods), and duplicates
  • Configure Features - Set up encoding (One-Hot/Label), scaling (Standard/MinMax/Robust), and glyph mapping (exactly 5 features with smart suggestions)
  • Projection Settings - Enable/configure PCA and t-SNE dimensionality reduction with customizable parameters
  • Review & Process - Review configuration, execute processing, and load results into the dashboard

Key Features

  • Real-time data profiling using Pyodide worker (pandas + scikit-learn in WebAssembly)
  • Smart defaults based on data types (automatic encoding/scaling method selection)
  • Variance-based feature suggestions for glyph visualization mapping
  • Progress tracking with live updates during processing
  • Configuration export/import as JSON for reproducibility
  • Session persistence via localStorage
  • File support: CSV and Parquet (up to 150MB)

@Thorfinson Thorfinson requested a review from dkammer December 21, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants