For quite some time I have been hardcore PostgreSQL user. I like to leverage database capabilities when possible to get faster response times in my Ruby On Rails applications. Even though PostgreSQL is still by far my favourite open source RDBMS, I’m currently working with MySQL on daily basis and I often find myself searching for MySQL equivalent for some PostgreSQL features.

For the sake of this post let’s assume following table schema

  CREATE TABLE orders(
    id SERIAL,
    user_id integer,
    total_amount numeric,
    CONSTRAINT orders_pkey PRIMARY KEY (id)
  );

The requirement is to return all users with total amount of money they spent purchasing items. The implementation is pretty straightforward and query looks the same in both database engines.

SELECT user_id, sum(total_amount) FROM orders GROUP BY user_id

Having that we are presented with new requirement:

  • for users that made purchases for more than $1000 we want to get their order ids for further processing.

First part of the task is easily solvable in both databases using HAVING

SELECT user_id, sum(total_amount) FROM orders GROUP BY user_id HAVING sum(total_amount) > 1000

The next part we can satisfy from application code on couple different ways, for example triggering separate queries for each user

Order.group(:user_id).having('sum(total_amount) > 1000').pluck('user_id, sum(total_amount)').each do |user_id, _|
  # separate query to get order ids for each qualified user
  order_ids = Order.where(user_id: user_id).pluck(:id) 
end

This represents classical N+1 queries issue that will most likely have negative impact on performance. Lucky for us it can be solved using built-in features of both MySQL and PostgreSQL.

array_agg in PostgreSQL

In PostgreSQL we can get aggregated values for the group using array_agg function

-- PostgreSQL
SELECT user_id, sum(total_amount), array_agg(id) as order_ids FROM orders GROUP BY user_id HAVING sum(total_amount) > 1000

We got all the information we want with just 1 database round-trip, awesome!

group_concat in MySQL

When working with MySQL there is also a possibility to achieve similar results using group_concat. There are two main differences though:

  • in ruby code we don’t get array but string of ids separated by some character - by default it will be comma
  • there is a length limit applied, after reaching it the value will be quietly truncated (I admit that I did not read MySQL’s documentation carefully enough and was surprised by this)

First bullet is easily solvable but how to get around the second one? It turns out that we can increase max length that is allowed for group_concat results! Consider this code:

with_increased_group_concat_max_len do
  Order.group(:user_id).having('sum(total_amount) > 1000').pluck('user_id, sum(total_amount), group_concat(id)').each do |user_id, _, order_ids|
     # ...
  end
end
def with_increased_group_concat_max_len(value = 1045576)
  original = get_group_concat_max_len
  set_group_concat_max_len(value)
  yield
ensure
  set_group_concat_max_len(original)
end
def get_group_concat_max_len
  row = ActiveRecord::Base.connection.select_one <<-SQL
    SHOW VARIABLES like 'group_concat_max_len'
  SQL

  row['Value'].to_i
end
def set_group_concat_max_len(value)
  ActiveRecord::Base.connection.execute <<-SQL
    SET SESSION group_concat_max_len = #{value};
  SQL
end

We wrapped the query in a block that remembered original value of group_concat_max_len, increased it for the time of executing query and once the query got executed set it back to previous value.

Just as with PostgreSQL, we got all the information we want with just 1 database round-trip, awesome!

WARNING

The max value that can be set for group_concat_max_len depends on other server variable - max_allowed_packet. I strongly recommend to read documentation and verify what’s the maximum value for your installation.

Let’s say that we have to import data from some social network. We will be provided with CSV file generated every 5 minutes containing list of posts that were added/liked/commented since the beginning of the day together with likes’ and comments’ counts for each post.

As you can imagine the file will grow in size during the day; also we are looking at not only inserting new posts but also updating counts for the ones we already imported.

Test environment

  • sample example.csv file containing 50,000 rows with the following format:
post_id;title;published_at;likes_count;comments_count 
  • PostgreSQL table with unique index on external_post_id
CREATE TABLE public.posts
 (
   id integer NOT NULL DEFAULT nextval('posts_id_seq'::regclass),
   title character varying,
   published_at timestamp with time zone,
   likes_count integer NOT NULL DEFAULT 0,
   comments_count integer NOT NULL DEFAULT 0,
   external_post_id integer NOT NULL,
   CONSTRAINT posts_pkey PRIMARY KEY (id)
 );
 CREATE UNIQUE INDEX index_posts_on_external_post_id
   ON public.posts
   USING btree
   (external_post_id);
  • ActiveRecord model
class Post < ActiveRecord::Base
  validates :title, presence: true
  validates :external_post_id, presence: true
  validates :external_post_id, uniqueness: true
end

To better emulate requirements all rows from the CSV file with odd external_post_id will be already present in the database.

Simple approach

To establish baseline let’s start with very simple version that’s using first_or_create provided by ActiveRecord out of the box:

CSV.foreach(Rails.root.join('example.csv')) do |row|
  data = {
    title: row[0],
    published_at: row[1],
    likes_count: row[2],
    comments_count: row[3]
  }
  
  Post.where(external_post_id: row[4]).first_or_create(data).update(data)
end

The result is expected but still pretty disappointing:

      user     system      total         real
112.940000   9.690000 122.630000 (159.846472)

Stepping up our game

We all know there are better ways to perform bulk imports with ActiveRecord and one of them is the activerecord-import gem. It supports PostgreSQL 9.5 on conflict which makes is perfect for out use-case.

According to the documentation the fastest method is to use raw columns and arrays of values and that is exactly what we are going to do:

columns = [:title, :published_at, :likes_count, :comments_count, :external_post_id]
values = CSV.read(Rails.root.join('example.csv'))

Post.import columns, values, validate: false, on_duplicate_key_update: { conflict_target: [:external_post_id] }

The result is amazing - we are almost 23x faster then before!

    user     system      total         real
6.160000   0.060000   6.220000 (  6.974064)

Can we be even faster?

yes, we can!

The fastest way to put data into PostgreSQL is to use the COPY command, but it comes with some limitations. One especially important in our context is the lack of upsert support.

Until it will be available we can achieve similar functionality using TEMP TABLES.

Post.connection.execute <<-SQL
  CREATE TEMP TABLE post_imports
  (
    title character varying,
    published_at timestamp with time zone,
    likes_count integer,
    comments_count integer,
    external_post_id integer
  )
SQL

File.open(Rails.root.join('data.csv'), 'r') do |file|
  Post.connection.raw_connection.copy_data %{copy post_imports from stdin with csv delimiter ',' quote '"'} do
    while line = file.gets do
      Post.connection.raw_connection.put_copy_data line
    end
  end
end

Post.connection.execute <<-SQL
  insert into posts(title, published_at, likes_count, comments_count, external_post_id)
  select title, published_at, likes_count, comments_count, external_post_id
  from post_imports
  on conflict(external_post_id) do
  update set
    title = EXCLUDED.title,
    published_at = EXCLUDED.published_at,
    likes_count = EXCLUDED.likes_count,
    comments_count = EXCLUDED.comments_count
  returning id
SQL

In the above example we are creating temporary table post_imports and copying all the data in there. Next we are taking advantage of INSERT INTO(...) SELECT ... and ON CONFLICT syntax to move the data from temporary table to the posts table.

The code is longer and less pretty then the one using activerecord-import gem. It’s also 10x faster!

    user     system      total         real
0.150000   0.020000   0.170000 (  0.643315)

Final thoughts

It’s not surprising that the method closest to the database turns out to be the quickest.

10x faster than activerecord-import (and 248x faster than simple approach) is very impressive but if we look at the raw execution times we might find that 6.7s is perfectly acceptable in most cases. Because of that I would recommend going with activerecord-import gem but keep in mind that we can do much better if we really need to.

Not so long ago I’ve finished my React + ReactRouter project and I was looking for a neat way to deploy its production version.

Since it’s a 100% static website, the first thing that came to my mind was Amazon S3 and their website hosting. The process is very simple:

  1. create a bucket
  2. make it publicly readable
  3. upload your html/css/js/other files
  4. enable static website hosting on the bucket
  5. change some DNS settings
  6. … you’re done - there is no point 6 ;)

To make it even simpler I used awesome s3_website tool. It required creating simple configuration file and I was able to push my work to S3.

Issues with routing

Unfortunately S3 website hosting wasn’t playing nice with ReactRouter. I typed http://my-awesome-site.com/products/cool-product in the browser’s address bar and I got 404 Not Found error code. I realized that’s completely justified on the S3 side since this file wasn’t present in the bucket.

On staging I was handling such cases with Nginx configuration:

location / {
  ...
  try_files $uri.html $uri $uri/ /index.html;
  ...
}

This basically means: try to match the uri and if you fail just serve index.html from the root directory.

Thanks to above directive routing was done by ReactRouter and site was working perfectly on staging.

Intermediate solution

One possible workaround was to switch ReactRouter to use createHashHistory and setup redirect rules on S3 bucket like this

<RoutingRules>
  <RoutingRule>
    <Condition>
      <HttpErrorCodeReturnedEquals>404</HttpErrorCodeReturnedEquals>
    </Condition>
    <Redirect>
      <HostName>my-awesome-site.com</HostName>
      <ReplaceKeyPrefixWith>#/</ReplaceKeyPrefixWith>
    </Redirect>
  </RoutingRule>
</RoutingRules>

That way when I typed http://my-awesome-site.com/products/cool-product I was redirected to http://my-awesome-site.com/#/products/cool-product. My site worked again but with ugly urls that were neither user nor google friendly.

Cloudfront

I was looking through Amazon Management Console, hoping I could find some solution, and I noticed Error Pages section in my Cloudfront distribution. It turned out that I was able to override default behaviour of 404 http error and that was what I needed. I set Response Page Path to /index.html and HTTP Response Code to 200

cloudfront

Since this is essentially the same configuration that I had in Nginx I was able to access my site through normal url again :)

I’ve been working with some friends on a fun side-project lately - AppShape. It’s a product that will allow users to constantly monitor their applications health from both technical and business perspective.

We decided to split the product into 3 separate applications

  • appshape - user facing web application running on Ruby On Rails
  • appshape-runner - runs user-defined checks against their applications, written in Ruby
  • appshape-reporter - reports results of this checks back to main application, written in Ruby

At the moment we are also using two beanstalkd queues and clockwork process to glue everything together.

This means launching 6 processes in 6 different terminal windows, constantly switching between them to check logs…

I’ve had to come up with something better.

Foreman

I like the simplicity of the Procfile approach provided by Foreman. It’s just one command and you get colorized output, all logs in one window and ability to kill all your processes with CTL+C. I wanted that for my project!

Unfortunately Foreman doesn’t support the way we’ve structured AppShape:

  • it doesn’t handle running processes from a different directory
  • it doesn’t support different RVM’s which we are using for handling multiple ruby versions/gemsets

Subcontractor

After short googling I’ve found subcontractor - nice gem that is supposed to handle all of this.

Let’s take a look at Procfile with subcontractor commands included:

web:            bundle exec puma -C ./config/puma.rb
runner_queue:   beanstalkd -p 11305
reporter_queue: beanstalkd -p 11306
clock:          clockwork clock.rb
runner:         subcontract --rvm 'ruby-2.2.3@appshape-runner' --signal TERM --chdir ../appshape-runner -- bundle exec ruby main.rb
reporter:       subcontract --rvm 'ruby-2.2.3@appshape-results-reporter' --signal TERM --chdir ../appshape-reporter -- bundle exec ruby main.rb

The usage is pretty simple

  • --rvm - sets ruby version and gemset
  • --signal - which signal should be send to the process
  • --chdir - to where change directory
  • -- bundle exec ruby main.rb - command that should be executed

With this setup I was able to start foreman as usual with foreman start and I could see all my processes running.

I would be perfectly happy with the above but I’ve noticed that sometimes processes started via subcontractor were not killed. After few hours of work and multiple restarts I executed ps aux | grep ruby and saw about 20 running ruby processes.

I haven’t been able to fix the issue in a reasonable amount of time so I’ve decided to try another approach.

Foreman (again)

Back in the old days I was using bash -lc when I wanted to run some of my ruby programs via cron. It starts non-interactive login shell and executes given command. The trick here is that login option. Amongst other things it will source your .profile files so RVM will be available!

Here’s my modified Procfile

web: bundle exec puma -C ./config/puma.rb
runner_queue: beanstalkd -p 11305
reporter_queue: beanstalkd -p 11306
clock: clockwork clock.rb
runner: bash -lc "cd ../appshape-runner && bundle exec ruby main.rb"
reporter: bash -lc "cd ../appshape-reporter && bundle exec ruby main.rb"

All process are being killed after Foreman stops and I can benefit from “one command to start them all ™” setup :)

foreman output

In one of my projects I’m processing large amounts of XML files containing various information about sport events. I have to record all the files that made modifications for the given sport event. Since file can be sometimes re-uploaded I’ve decided to use convenient Rails helper:

SportEventFiles.where(sport_event_id: 686954, file: 'file.xml').first_or_create!

It worked like a charm - quick and easy solution to my problem.

After a while I noticed increase in processing times for a single file. Short investigation lead to the first_or_create! line of code:

# 0.040000   0.000000   0.040000 (  1.555691)
Benchmark.measure do 
  SportEventFile.where(sport_event_id: 686954, file: 'file.xml').first_or_create!
end  

WOW - it took 1.5s to execute this simple function!

Next step was to run explain analyze on the underlying query:

EXPLAIN ANALYZE 
  SELECT 
    "sport_event_files".* 
  FROM 
    "sport_event_files" 
  WHERE 
    "sport_event_files"."file" = 'file.xml' AND 
    "sport_event_files"."sport_event_id" = 686954 
  ORDER BY "sport_event_files"."id";

-- query plan
Index Scan using sport_event_files_pkey on sport_event_files  (cost=0.43..279322.55 rows=1 width=63) (actual time=1226.469..1226.515 rows=1 loops=1)
 Filter: (((file)::text = 'file.xml'::text) AND (sport_event_id = 686954))
 Rows Removed by Filter: 5492868
Total runtime: 1226.542 ms
(4 rows)

Because of the ORDER BY "sport_event_files"."id" at the end Postgresql choose to use index on primary key for the sorting and filter by conditions. This essentially resulted in doing seq scan of the table with almost 5.5 million rows (always a bad idea ;)

Here is what happened when I removed the ORDER BY part.

EXPLAIN ANALYZE 
  SELECT 
    "sport_event_files".* 
  FROM 
    "sport_event_files" 
  WHERE 
    "sport_event_files"."file" = 'file.xml' AND 
    "sport_event_files"."sport_event_id" = 686954;

-- query plan
Index Scan using index_sport_event_files_on_file_and_sport_event_id on sport_event_files  (cost=0.56..8.58 rows=1 width=63) (actual time=0.060..0.061 rows=1 loops=1)
  Index Cond: (((file)::text = 'file.xml'::text) AND (sport_event_id = 686954))
Total runtime: 0.088 ms
(3 rows)

I ended up with index-only scan and query time measured in fractions of millisecond!

I re-wrote my code using .exists? and everything was great again!