In one of my projects I’m processing large amounts of XML files containing various information about sport events. I have to record all the files that made modifications for the given sport event. Since file can be sometimes re-uploaded I’ve decided to use convenient Rails helper:

SportEventFiles.where(sport_event_id: 686954, file: 'file.xml').first_or_create!

It worked like a charm - quick and easy solution to my problem.

After a while I noticed increase in processing times for a single file. Short investigation lead to the first_or_create! line of code:

# 0.040000   0.000000   0.040000 (  1.555691)
Benchmark.measure do 
  SportEventFile.where(sport_event_id: 686954, file: 'file.xml').first_or_create!
end  

WOW - it took 1.5s to execute this simple function!

Next step was to run explain analyze on the underlying query:

EXPLAIN ANALYZE 
  SELECT 
    "sport_event_files".* 
  FROM 
    "sport_event_files" 
  WHERE 
    "sport_event_files"."file" = 'file.xml' AND 
    "sport_event_files"."sport_event_id" = 686954 
  ORDER BY "sport_event_files"."id";

-- query plan
Index Scan using sport_event_files_pkey on sport_event_files  (cost=0.43..279322.55 rows=1 width=63) (actual time=1226.469..1226.515 rows=1 loops=1)
 Filter: (((file)::text = 'file.xml'::text) AND (sport_event_id = 686954))
 Rows Removed by Filter: 5492868
Total runtime: 1226.542 ms
(4 rows)

Because of the ORDER BY "sport_event_files"."id" at the end Postgresql choose to use index on primary key for the sorting and filter by conditions. This essentially resulted in doing seq scan of the table with almost 5.5 million rows (always a bad idea ;)

Here is what happened when I removed the ORDER BY part.

EXPLAIN ANALYZE 
  SELECT 
    "sport_event_files".* 
  FROM 
    "sport_event_files" 
  WHERE 
    "sport_event_files"."file" = 'file.xml' AND 
    "sport_event_files"."sport_event_id" = 686954;

-- query plan
Index Scan using index_sport_event_files_on_file_and_sport_event_id on sport_event_files  (cost=0.56..8.58 rows=1 width=63) (actual time=0.060..0.061 rows=1 loops=1)
  Index Cond: (((file)::text = 'file.xml'::text) AND (sport_event_id = 686954))
Total runtime: 0.088 ms
(3 rows)

I ended up with index-only scan and query time measured in fractions of millisecond!

I re-wrote my code using .exists? and everything was great again!

I’ve been writing some Ruby app lately (the kind that runs in the terminal and doesn’t include Webrick as http server) and I found myself missing some of the tooling that is available with Rails - especially the command line. The ability to just type rails c and play around with your code was one of the framework killer-features back in the days:

Here’s how to do it in Ruby console application :)

Sample application

Our app will contain only three files

# cat.rb
class Cat
  def make_sound
    puts 'miau' 
  end
end

# config/application.rb
require_relative '../cat'

# main.rb
require_relative 'config/application'
Cat.new.make_sound

Console script

We will follow Rails conventions and put our executables in the bin directory

# bin/console

#!/usr/bin/env ruby
require 'irb'
require 'irb/completion'
require_relative '../config/application'

IRB.start

And that’s it - we can now type bin/console and our Cat class is available!

Reloading code

It wouldn’t feel like a real console without being able to call reload! so let’s try to add it!

First we need to change require to load in application.rb because the former will not load our source file again if it has been already loaded.

# config/application.rb
load File.expand_path('../../cat.rb', __FILE__)

And then we will add reload! function to our script:

# bin/console

#!/usr/bin/env ruby
require 'irb'
require 'irb/completion'

def reload!
  load File.expand_path('../../config/application.rb', __FILE__)
end

reload!
IRB.start

I’m creating a lot of JSON APIs using Grape or Ruby on Rails lately. From time to time I’m facing an issue when I need to return a large amount of data in a single response.

The usual way would be to execute query via ActiveRecord and use some library like JBuilder or ActiveModelSerializers to generate JSON response. It’s nice and easy but sometimes too slow :(

In situation like this I’m relying on PostgreSQL to help me out with it’s row_to_json function.

Let’s convert the chart query from my previous post so it can return json that is ready to send to the chart library right away:

with intervals as (
  select start, start + interval '5min' as end
  from generate_series('2015-05-29 9:30', '2015-05-29 16:00', interval '5min') as start
)
select json_agg(row_to_json(ohlc)) from
(
    select distinct
      intervals.start as date,
      min(low) over w as low,
      max(high) over w as high,
      first_value(open) over w as open,
      last_value(close) over w as close,
      sum(volume) over w as volume
    from
      intervals
      join ticker_minute_bars tmb on
        tmb.key = 'AAPL.BATS' and
        tmb.date >= intervals.start and
        tmb.date < intervals.end
    window w as (partition by intervals.start)
    order by intervals.start
) as ohlc

Time: 76.968 ms

The best part is that JSON serialization overhead comparing to the original query is almost invisible!

Thanks!

PostgreSQL introduced window functions in the 8.4 version (which means 6 years ago!) but I still feel like this is very little known feature. Let’s see how they work with a real-life example!

The Problem

We have to generate data for stock charting site. The charting library expects data in the OHLC format and we need to support multiple intervals (currently 1min, 5min, 15min, 1h) in the given time period.

The Data

We have a table containing minute OHLC bars for stocks 30-days back. The table schema looks like this:

CREATE TABLE ticker_minute_bars
(
  date timestamp with time zone,
  open numeric,
  low numeric,
  high numeric,
  close numeric,
  volume bigint,
  key character varying(255),
  id serial NOT NULL,
  CONSTRAINT ticker_minute_bars_pkey PRIMARY KEY (id)
);

CREATE UNIQUE INDEX index_ticker_minute_bars_on_key_and_date
  ON ticker_minute_bars
  USING btree
  (key COLLATE pg_catalog."default", date);

and it currently contains 4169221 rows of data.

The solution

First we need to somehow establish the boundaries for interval periods in the given time range. We will use awesome generate_series function that is built in PostgreSQL.

For example

with intervals as (
  select start, start + interval '5min' as end
  from generate_series('2015-06-11 9:30', '2015-06-11 16:00', interval '5min') as start)
select * from intervals;

will return exactly what we want:

start                  | end
-----------------------+------------------------

2015-06-11 09:30:00+02 | 2015-06-11 09:35:00+02
2015-06-11 09:35:00+02 | 2015-06-11 09:40:00+02
2015-06-11 09:40:00+02 | 2015-06-11 09:45:00+02
2015-06-11 09:45:00+02 | 2015-06-11 09:50:00+02

Now that we have boundaries that establish our set of rows we can dive into window functions and aggregates.

with intervals as (
  select start, start + interval '5min' as end
  from generate_series('2015-06-11 9:30', '2015-06-11 16:00', interval '5min') as start
)
select distinct
  intervals.start as date,
  min(low) over w as low,
  max(high) over w as high,
  first_value(open) over w as open,
  last_value(close) over w as close,
  sum(volume) over w as volume
from
  intervals
  join ticker_minute_bars tmb on
    tmb.key = 'AAPL' and
    tmb.date >= intervals.start and
    tmb.date < intervals.end
window w as (partition by intervals.start)
order by intervals.start

The aggregate calculations are performed over window (or fragment) of data that is determined with partition by intervals.start part of the query.

In the end we can easily query for different intervals and time ranges and we can get results fast (~70ms on my laptop)!

date                | low   | high   | open   | close  | volume
--------------------+-------+--------+--------+--------+-------

2015-05-29 09:30:00 | 131.1 | 131.44 | 131.26 | 131.17 | 63132
2015-05-29 09:30:00 | 131.1 | 131.38 | 131.18 | 131.25 | 69550

Time: 75.079 ms

Thanks!