In one of my projects I’m processing large amounts of XML files containing
various information about sport events. I have to record all the files that made
modifications for the given sport event. Since file can be sometimes re-uploaded
I’ve decided to use convenient Rails helper:
It worked like a charm - quick and easy solution to my problem.
After a while I noticed increase in processing times for a single file.
Short investigation lead to the first_or_create! line of code:
WOW - it took 1.5s to execute this simple function!
Next step was to run explain analyze on the underlying query:
Because of the ORDER BY "sport_event_files"."id" at the end Postgresql choose to
use index on primary key for the sorting and filter by conditions.
This essentially resulted in doing seq scan of the table with almost 5.5 million rows (always a bad idea ;)
Here is what happened when I removed the ORDER BY part.
I ended up with index-only scan and query time measured in fractions of millisecond!
I re-wrote my code using .exists? and everything was great again!
I’ve been writing some Ruby app lately (the kind that runs in the terminal and doesn’t include
Webrick as http server) and I found myself missing some of the tooling that is available with Rails - especially the command line.
The ability to just type rails c and play around with your code was one of the framework killer-features
back in the days:
Here’s how to do it in Ruby console application :)
Our app will contain only three files
We will follow Rails conventions and put our executables in the bin directory
And that’s it - we can now type bin/console and our Cat class is available!
It wouldn’t feel like a real console without being able to call reload! so let’s try to add it!
First we need to change require to load in application.rb because the former will not load
our source file again if it has been already loaded.
And then we will add reload! function to our script:
PostgreSQL introduced window functions in the 8.4 version (which means 6 years ago!) but I still feel like this is very little known feature. Let’s see how they work with a real-life example!
We have to generate data for stock charting site. The charting library expects data in the OHLC format and we need to support multiple intervals (currently 1min, 5min, 15min, 1h) in the given time period.
We have a table containing minute OHLC bars for stocks 30-days back. The table schema looks like this:
and it currently contains 4169221 rows of data.
First we need to somehow establish the boundaries for interval periods in the given time range. We will use awesome generate_series function that is built in PostgreSQL.
will return exactly what we want:
Now that we have boundaries that establish our set of rows we can dive into window functions and aggregates.
The aggregate calculations are performed over window (or fragment) of data that is determined with partition by intervals.start part of the query.
In the end we can easily query for different intervals and time ranges and we can get results fast (~70ms on my laptop)!