Aggregating data with PostgreSQL

PostgreSQL introduced window functions in the 8.4 version (which means 6 years ago!) but I still feel like this is very little known feature. Let’s see how they work with a real-life example!

The Problem

We have to generate data for stock charting site. The charting library expects data in the OHLC format and we need to support multiple intervals (currently 1min, 5min, 15min, 1h) in the given time period.

The Data

We have a table containing minute OHLC bars for stocks 30-days back. The table schema looks like this:

CREATE TABLE ticker_minute_bars
  date timestamp with time zone,
  open numeric,
  low numeric,
  high numeric,
  close numeric,
  volume bigint,
  key character varying(255),
  id serial NOT NULL,
  CONSTRAINT ticker_minute_bars_pkey PRIMARY KEY (id)

CREATE UNIQUE INDEX index_ticker_minute_bars_on_key_and_date
  ON ticker_minute_bars
  USING btree
  (key COLLATE pg_catalog."default", date);

and it currently contains 4169221 rows of data.

The solution

First we need to somehow establish the boundaries for interval periods in the given time range. We will use awesome generate_series function that is built in PostgreSQL.

For example

with intervals as (
  select start, start + interval '5min' as end
  from generate_series('2015-06-11 9:30', '2015-06-11 16:00', interval '5min') as start)
select * from intervals;

will return exactly what we want:

start                  | end

2015-06-11 09:30:00+02 | 2015-06-11 09:35:00+02
2015-06-11 09:35:00+02 | 2015-06-11 09:40:00+02
2015-06-11 09:40:00+02 | 2015-06-11 09:45:00+02
2015-06-11 09:45:00+02 | 2015-06-11 09:50:00+02

Now that we have boundaries that establish our set of rows we can dive into window functions and aggregates.

with intervals as (
  select start, start + interval '5min' as end
  from generate_series('2015-06-11 9:30', '2015-06-11 16:00', interval '5min') as start
select distinct
  intervals.start as date,
  min(low) over w as low,
  max(high) over w as high,
  first_value(open) over w as open,
  last_value(close) over w as close,
  sum(volume) over w as volume
  join ticker_minute_bars tmb on
    tmb.key = 'AAPL' and >= intervals.start and < intervals.end
window w as (partition by intervals.start)
order by intervals.start

The aggregate calculations are performed over window (or fragment) of data that is determined with partition by intervals.start part of the query.

In the end we can easily query for different intervals and time ranges and we can get results fast (~70ms on my laptop)!

date                | low   | high   | open   | close  | volume

2015-05-29 09:30:00 | 131.1 | 131.44 | 131.26 | 131.17 | 63132
2015-05-29 09:30:00 | 131.1 | 131.38 | 131.18 | 131.25 | 69550

Time: 75.079 ms


Adrian Serafin

Adrian Serafin
I'm experienced Ruby developer. I love PostgreSQL, Redis and optimizing performance. I strongly believe in using right tool for the job and short feedback loops. How can I help you?

group_concat as an alternative for array_agg

For quite some time I have been hardcore [PostgreSQL]( user. I like to leveragedatabase capabilities when poss...… Continue reading

React router on Amazon S3

Published on March 23, 2016