Skip to content

Latest commit

 

History

History
455 lines (349 loc) · 14.1 KB

File metadata and controls

455 lines (349 loc) · 14.1 KB

NYC Taxi Data Example

Get started with TimescaleDB using New York City taxi trip data. This example demonstrates how to handle high-volume transportation data with location-based analytics and time-series aggregations.

What You'll Learn

  • How to model high-volume transportation data with lat/lon coordinates
  • Time-series aggregations with time_bucket()
  • Optimal segmentation strategies for compression
  • Revenue and usage pattern analysis
  • Loading data with direct to columnstore for instant performance

Prerequisites

  • Docker installed
  • psql PostgreSQL client
  • 15-20 minutes

Quick Start

Step 1: Start TimescaleDB

You have two options to start TimescaleDB:

Option 1: One-line install (Recommended)

The easiest way to get started:

Important: This script is intended for local development and testing only. Do not use it for production deployments. For production-ready installation options, see the TimescaleDB installation guide.

Linux/Mac:

curl -sL https://tsdb.co/start-local | sh

This command:

  • Downloads and starts TimescaleDB (if not already downloaded)
  • Exposes PostgreSQL on port 6543 (a non-standard port to avoid conflicts with other PostgreSQL instances on port 5432)
  • Automatically tunes settings for your environment using timescaledb-tune
  • Sets up a persistent data volume

Option 2: Manual Docker command also used for Windows

Alternatively, you can run TimescaleDB directly with Docker:

docker run -d --name timescaledb \
    -p 6543:5432 \
    -e POSTGRES_PASSWORD=password \
    timescale/timescaledb-ha:pg18

Note: We use port 6543 (mapped to container port 5432) to avoid conflicts if you have other PostgreSQL instances running on the standard port 5432.

Wait about 1-2 minutes for TimescaleDB to download & initialize.

Step 2: Connect to TimescaleDB

Connect using psql:

psql -h localhost -p 6543 -U postgres
# When prompted, enter password: password

You should see the PostgreSQL prompt. Verify TimescaleDB is installed:

SELECT extname, extversion FROM pg_extension WHERE extname = 'timescaledb';

Expected output:

   extname   | extversion
-------------+------------
 timescaledb | 2.x.x

Prefer a GUI? If you'd rather use a graphical tool instead of the command line, you can download pgAdmin and connect to TimescaleDB using the same connection details (host: localhost, port: 6543, user: postgres, password: password).

Step 3: Create the Schema

Create the optimized hypertable by running this SQL in your psql session:

-- Create the hypertable with optimal settings for NYC Taxi data
-- This automatically enables columnstore for fast analytical queries
CREATE TABLE trips (
    vendor_id TEXT,
    pickup_boroname VARCHAR,
    pickup_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,
    dropoff_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,
    passenger_count NUMERIC,
    trip_distance NUMERIC,
    pickup_longitude NUMERIC,
    pickup_latitude NUMERIC,
    rate_code INTEGER,
    dropoff_longitude NUMERIC,
    dropoff_latitude NUMERIC,
    payment_type VARCHAR,
    fare_amount NUMERIC,
    extra NUMERIC,
    mta_tax NUMERIC,
    tip_amount NUMERIC,
    tolls_amount NUMERIC,
    improvement_surcharge NUMERIC,
    total_amount NUMERIC
) WITH (
    tsdb.hypertable,
    tsdb.partition_column='pickup_datetime',
    tsdb.enable_columnstore=true,
    tsdb.segmentby='pickup_boroname',
    tsdb.orderby='pickup_datetime DESC'
);

-- Create indexes 
CREATE INDEX idx_trips_pickup_time ON trips (pickup_datetime DESC);
CREATE INDEX idx_trips_borough_time ON trips (pickup_boroname, pickup_datetime DESC);

This creates a trips table with:

  • Automatic time-based partitioning on pickup_datetime
  • Columnstore enabled for fast analytical queries
  • Segmentation by pickup_boroname for optimal compression (6 boroughs)
  • Full trip details including fares, distances, and coordinates

Step 4: Load Sample Data

First, download and decompress the sample data:

# Download the sample data
wget https://assets.timescale.com/timescaledb-datasets/nyc_taxi_sample_nov_dec_2015.csv.gz

# Decompress the CSV file
gunzip nyc_taxi_sample_nov_dec_2015.csv.gz

# This will create nyc_taxi_sample_nov_dec_2015.csv ready for loading

We provide two approaches for loading data. Choose based on your needs:

Option A: Direct to Columnstore (Recommended - Instant Performance)

This approach writes data directly to the columnstore, bypassing the rowstore entirely. You get instant analytical performance.

From command line:

psql -h localhost -p 6543 -U postgres \
  -v ON_ERROR_STOP=1 \
  -c "SET timescaledb.enable_direct_compress_copy = on;
      COPY trips FROM STDIN WITH (FORMAT csv, HEADER true);" \
  < nyc_taxi_sample_nov_dec_2015.csv

This command reads the CSV file from your local filesystem and pipes it to PostgreSQL, which loads it directly into the columnstore.

Verify data loaded:

SELECT COUNT(*) FROM trips;

Option B: Standard COPY (Fallback)

This approach loads data into the rowstore first. Data will be converted to the columnstore by a background policy (12-24 hours) for faster querying.

From command line:

psql -h localhost -p 6543 -U postgres \
  -v ON_ERROR_STOP=1 \
  -c "COPY trips FROM STDIN WITH (FORMAT csv, HEADER true);" \
  < nyc_taxi_sample_nov_dec_2015.csv

Verify data loaded:

SELECT COUNT(*) FROM trips;

Manually convert to columnstore (Optional):

If you loaded data using standard copy a background process will convert your rowstore data to the columnstore in 12-24 hours, you can manually convert it immediately to get the best query performance:

DO $$
DECLARE ch TEXT;
BEGIN
    FOR ch IN SELECT show_chunks('trips') LOOP
        CALL convert_to_columnstore(ch);
    END LOOP;
END $$;

Step 5: Run Sample Queries

Now let's explore the data with some analytical queries. Run these in your psql session:

Query 1: Overall statistics

\timing on

SELECT
    COUNT(*) as total_trips,
    ROUND(SUM(fare_amount)::numeric, 2) as total_revenue,
    ROUND(AVG(fare_amount)::numeric, 2) as avg_fare,
    ROUND(AVG(trip_distance)::numeric, 2) as avg_distance
FROM trips;

Query 2: Breakdown by vendor

SELECT
    vendor_id,
    COUNT(*) as trips,
    ROUND(AVG(fare_amount)::numeric, 2) as avg_fare,
    ROUND(AVG(tip_amount)::numeric, 2) as avg_tip,
    ROUND(AVG(passenger_count)::numeric, 2) as avg_passengers
FROM trips
GROUP BY vendor_id
ORDER BY trips DESC;

Query 3: Hourly patterns using time_bucket

SELECT
    time_bucket('1 hour', pickup_datetime) AS hour,
    COUNT(*) as trips,
    ROUND(AVG(fare_amount)::numeric, 2) as avg_fare,
    ROUND(SUM(tip_amount)::numeric, 2) as total_tips
FROM trips
GROUP BY hour
ORDER BY hour DESC
LIMIT 24;

Query 4: Payment type analysis

SELECT
    payment_type,
    COUNT(*) as trip_count,
    ROUND(SUM(fare_amount)::numeric, 2) as total_revenue,
    ROUND(AVG(trip_distance)::numeric, 2) as avg_distance,
    ROUND(AVG(tip_amount)::numeric, 2) as avg_tip
FROM trips
GROUP BY payment_type
ORDER BY total_revenue DESC;

Query 5: Daily statistics by borough

SELECT
    time_bucket('1 day', pickup_datetime) AS day,
    pickup_boroname,
    COUNT(*) as trips,
    ROUND(AVG(fare_amount)::numeric, 2) as avg_fare,
    ROUND(MAX(fare_amount)::numeric, 2) as max_fare
FROM trips
GROUP BY day, pickup_boroname
ORDER BY day DESC, pickup_boroname
LIMIT 20;

Query 6: Trips by distance category

SELECT
    CASE
        WHEN trip_distance < 1 THEN 'Short (< 1 mile)'
        WHEN trip_distance < 5 THEN 'Medium (1-5 miles)'
        WHEN trip_distance < 10 THEN 'Long (5-10 miles)'
        ELSE 'Very Long (> 10 miles)'
    END as distance_category,
    COUNT(*) as trips,
    ROUND(AVG(fare_amount)::numeric, 2) as avg_fare,
    ROUND(AVG(tip_amount)::numeric, 2) as avg_tip
FROM trips
GROUP BY distance_category
ORDER BY trips DESC;

What's Happening Behind the Scenes?

Hypertables

When you create a table with tsdb.hypertable, TimescaleDB automatically:

  • Partitions your data into time-based chunks (default: 7 days per chunk)
  • Manages chunk lifecycle automatically
  • Optimizes queries to scan only relevant chunks

Columnstore Compression

With tsdb.enable_columnstore=true:

  • Data is stored in a hybrid row-columnar format
  • Analytical queries only scan the columns they need (massive speedup)
  • Typical compression ratios: 90%+ for time-series data
  • Compression happens transparently - no changes to your queries

Direct to Columnstore

When you use SET timescaledb.enable_direct_compress_copy = on:

  • Data loads directly into compressed columnstore format
  • Bypasses the rowstore entirely
  • Instant analytical performance - no waiting for background compression
  • Perfect for bulk data loads and migrations

Segmentation

The tsdb.segmentby='pickup_boroname' setting:

  • Groups data by pickup borough within each chunk (6 unique values: Manhattan, Brooklyn, Queens, Bronx, Staten Island, EWR)
  • Improves compression ratios (similar data together)
  • Speeds up queries that filter by pickup_boroname
  • Better cardinality than vendor_id (6 values vs 2) for optimal compression
  • Automatically optimized without manual tuning

time_bucket() Function

TimescaleDB's time_bucket() is like PostgreSQL's date_trunc() but more powerful:

  • Works with any interval: 5 minutes, 1 hour, 1 day, etc.
  • Optimized for time-series queries
  • Integrates seamlessly with continuous aggregates
  • Essential for time-series analytics

Sample Queries Explained

See nyc-taxi-queries.sql for the complete set of queries. Each query demonstrates:

  1. Total trips and revenue - Simple aggregations across all data
  2. Breakdown by vendor - Segmentation analysis by taxi vendor
  3. Hourly patterns - Using time_bucket() for time-based aggregation
  4. Payment type analysis - Analyzing payment methods
  5. Daily statistics - Multi-dimensional aggregation (time + borough)
  6. Distance categories - CASE statement with aggregations

Schema Design Choices

Why these settings?

partition_column='pickup_datetime'

  • Time is the natural partition key for time-series data
  • Enables automatic chunk pruning for time-range queries
  • Default chunk interval (7 days) works well for taxi data

segmentby='pickup_boroname'

  • Optimal cardinality with 6 borough values (Manhattan, Brooklyn, Queens, Bronx, Staten Island, EWR)
  • Frequently used in WHERE clauses and GROUP BY for location-based analytics
  • Improves compression by grouping geographically similar trips
  • Better than vendor_id (only 2 values) for compression efficiency

orderby='pickup_datetime DESC'

  • Most queries want recent data first
  • Optimizes for "latest trips" queries
  • Improves query performance for time-range scans

Continuous Aggregates (Advanced)

For real-time dashboards, you can create continuous aggregates that automatically update:

-- Create a continuous aggregate for hourly statistics by borough
CREATE MATERIALIZED VIEW trips_hourly
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', pickup_datetime) AS hour,
    pickup_boroname,
    COUNT(*) as trip_count,
    AVG(fare_amount) as avg_fare,
    SUM(fare_amount) as total_revenue,
    AVG(trip_distance) as avg_distance
FROM trips
GROUP BY hour, pickup_boroname;

-- Add a refresh policy to keep it updated
SELECT add_continuous_aggregate_policy('trips_hourly',
    start_offset => INTERVAL '2 hours',
    end_offset => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour');

Now you can query trips_hourly for instant results on pre-aggregated data.

Troubleshooting

Data didn't load

  • Check the CSV file path is correct
  • Ensure the CSV header matches the schema columns
  • Try loading a few rows first to test: LIMIT 10 in your data file

Direct to columnstore not working

  • Verify TimescaleDB version 2.24+: SELECT extversion FROM pg_extension WHERE extname = 'timescaledb';
  • Ensure you ran SET timescaledb.enable_direct_compress_copy = on; in the same session
  • Check for error messages in the output

Queries seem slow

  • Verify columnstore is enabled: SELECT * FROM timescaledb_information.hypertables WHERE hypertable_name = 'trips';
  • Check if data is compressed: SELECT * FROM timescaledb_information.chunks WHERE hypertable_name = 'trips';
  • Ensure you're querying with time ranges (enables chunk exclusion)

Out of memory during load

  • Reduce batch size in COPY command
  • Increase Docker memory allocation
  • Consider loading data in smaller time-range batches

Use Cases

This NYC Taxi example demonstrates patterns applicable to:

  • Ride-sharing platforms - Track trips, drivers, pricing
  • Fleet management - Vehicle tracking, route optimization
  • Delivery services - Order tracking, delivery times, driver analytics
  • Public transportation - Route analysis, passenger counts, schedule optimization
  • Urban planning - Traffic patterns, popular routes, demand forecasting
  • Logistics - Shipment tracking, route efficiency, cost analysis

Clean Up

When you're done experimenting:

If you used the one-line install:

# Stop the container
docker stop timescaledb-ha-pg18-quickstart

# Remove the container
docker rm timescaledb-ha-pg18-quickstart

# Remove the persistent data volume
docker volume rm timescaledb_data

# (Optional) Remove the Docker image
docker rmi timescale/timescaledb-ha:pg18

If you used the manual Docker command:

# Stop the container
docker stop timescaledb

# Remove the container
docker rm timescaledb

# (Optional) Remove the Docker image
docker rmi timescale/timescaledb-ha:pg18

Note: If you created a named volume with the manual Docker command, you can remove it with docker volume rm <volume_name>.

Contributing

Found an issue or want to improve this example? Contributions welcome! Open an issue or PR on GitHub.


Questions? Check out Timescale Community Forums or TimescaleDB Documentation.