N+1 queries issue

The N+1 query issue is common performance problem in Rails applications.

To better understand it let’s consider typical e-commerce application where we’re dealing with products with prices in multiple currencies.

  class Product < ActiveRecord::Base
    has_many :prices
  end
  
  class Price < ActiveRecord::Base
    belongs_to :product
    validates :currency_code, presence: true
  end

When we want to display all products and show price in USD we can do something like this

  Product.all.each do |product|
    product.prices.where(currency_code: 'USD').cents
  end

Above code will result in 1 query to fetch all products and 1 query for each product to fetch price in USD

  select * from products;
  select * from prices where product_id = 1 and currency_code = 'USD'
  select * from prices where product_id = 2 and currency_code = 'USD'
  select * from prices where product_id = 3 and currency_code = 'USD'
  -- ...

This will make your database unhappy and your application slow.

Eager loading to the rescue

The easiest way to fix N+1 queries is to use association’s eager loading provided by ActiveRecord (here is excellent article describing all 3 available options and I strongly recommend to read it!)

  Product.all.includes(:prices).each do |product|
    product.prices.find { |price| price.currency_code == 'USD' }.cents
  end

Now our queries will look like this

  select * from products;
  select * from prices where product_id in (1,2,3,...)

Great, no more N+1 issue! On the second look though we’re still loading more data than we actually need.

We just need to show prices in USD and yet we also loaded all other currencies.

Multiple associations

In ActiveRecord we can pass where condition to association. That means we can have usd_prices, eur_prices, etc. and use that in includes.

Here’s refactored example:

  class Product < ActiveRecord::Base
    has_many :prices
    has_many :usd_prices, -> { where(currency_code: 'USD') }
    has_many :eur_prices, -> { where(currency_code: 'EUR') }
    # ...
  end
  
  Product.all.includes(:usd_prices).each do |product|
    product.usd_prices.first.cents
  end
  select * from products;
  select * from prices where currency_code = 'USD' and product_id in (1,2,3,...)

This solution is almost ideal since we don’t have N+1 issue and we only load data that we’re going to use. But It’s not flexible: if we want to handle new currency we have to add association, reload code and so on.

This could be acceptable for currency because we won’t be adding new ones very often but there might be situations that would require more dynamic approach.

Preloading with dynamic scope

We cannot pass dynamic conditions to ActiveRecord’s associations so we need to dig a little deeper and discover ActiveRecord::Associations::Preloader api.

  # Eager loads the named associations for the given Active Record record(s).
  #
  # In this description, 'association name' shall refer to the name passed
  # to an association creation method. For example, a model that specifies
  # <tt>belongs_to :author</tt>, <tt>has_many :buyers</tt> has association
  # names +:author+ and +:buyers+.
  #
  # == Parameters
  # +records+ is an array of ActiveRecord::Base. This array needs not be flat,
  # i.e. +records+ itself may also contain arrays of records. In any case,
  # +preload_associations+ will preload the all associations records by
  # flattening +records+.
  #
  # +associations+ specifies one or more associations that you want to
  # preload. It may be:
  # - a Symbol or a String which specifies a single association name. For
  #   example, specifying +:books+ allows this method to preload all books
  #   for an Author.
  # - an Array which specifies multiple association names. This array
  #   is processed recursively. For example, specifying <tt>[:avatar, :books]</tt>
  #   allows this method to preload an author's avatar as well as all of his
  #   books.
  # - a Hash which specifies multiple association names, as well as
  #   association names for the to-be-preloaded association objects. For
  #   example, specifying <tt>{ author: :avatar }</tt> will preload a
  #   book's author, as well as that author's avatar.
  #
  # +:associations+ has the same format as the +:include+ option for
  # <tt>ActiveRecord::Base.find</tt>. So +associations+ could look like this:
  #
  #   :books
  #   [ :books, :author ]
  #   { author: :avatar }
  #   [ :books, { author: :avatar } ]
  ActiveRecord::Associations::Preloader.new.preload(records, associations, preload_scope = nil)

The third option - preload_scope - is not documented, but it provides us with the ability to filter prices with currency.

  products = Product.all.to_a
  ActiveRecord::Associations::Preloader.new.preload(products, :prices, Price.where(currency_code: 'USD'))
  products.each do |product|
    product.prices.first.cents
  end

Generated queries are identical to the ones from prev. example

  select * from products;
  select * from prices where currency_code = 'USD' and product_id in (1,2,3,...)

but since we’re passing the currency dynamically we can add new one on the fly!

For quite some time I have been hardcore PostgreSQL user. I like to leverage database capabilities when possible to get faster response times in my Ruby On Rails applications. Even though PostgreSQL is still by far my favourite open source RDBMS, I’m currently working with MySQL on daily basis and I often find myself searching for MySQL equivalent for some PostgreSQL features.

For the sake of this post let’s assume following table schema

  CREATE TABLE orders(
    id SERIAL,
    user_id integer,
    total_amount numeric,
    CONSTRAINT orders_pkey PRIMARY KEY (id)
  );

The requirement is to return all users with total amount of money they spent purchasing items. The implementation is pretty straightforward and query looks the same in both database engines.

  SELECT user_id, sum(total_amount) FROM orders GROUP BY user_id

Having that we are presented with new requirement:

  • for users that made purchases for more than $1000 we want to get their order ids for further processing.

First part of the task is easily solvable in both databases using HAVING

  SELECT user_id, sum(total_amount) FROM orders GROUP BY user_id HAVING sum(total_amount) > 1000

The next part we can satisfy from application code on couple different ways, for example triggering separate queries for each user

  Order.group(:user_id).having('sum(total_amount) > 1000').pluck('user_id, sum(total_amount)').each do |user_id, _|
    # separate query to get order ids for each qualified user
    order_ids = Order.where(user_id: user_id).pluck(:id) 
  end

This represents classical N+1 queries issue that will most likely have negative impact on performance. Lucky for us it can be solved using built-in features of both MySQL and PostgreSQL.

array_agg in PostgreSQL

In PostgreSQL we can get aggregated values for the group using array_agg function

  -- PostgreSQL
  SELECT user_id, sum(total_amount), array_agg(id) as order_ids FROM orders GROUP BY user_id HAVING sum(total_amount) > 1000

We got all the information we want with just 1 database round-trip, awesome!

group_concat in MySQL

When working with MySQL there is also a possibility to achieve similar results using group_concat. There are two main differences though:

  • in ruby code we don’t get array but string of ids separated by some character - by default it will be comma
  • there is a length limit applied, after reaching it the value will be quietly truncated (I admit that I did not read MySQL’s documentation carefully enough and was surprised by this)

First bullet is easily solvable but how to get around the second one? It turns out that we can increase max length that is allowed for group_concat results! Consider this code:

  with_increased_group_concat_max_len do
    Order.group(:user_id).having('sum(total_amount) > 1000').pluck('user_id, sum(total_amount), group_concat(id)').each do |user_id, _, order_ids|
       # ...
    end
  end
  def with_increased_group_concat_max_len(value = 1045576)
    original = get_group_concat_max_len
    set_group_concat_max_len(value)
    yield
  ensure
    set_group_concat_max_len(original)
  end
  def get_group_concat_max_len
    row = ActiveRecord::Base.connection.select_one <<-SQL
      SHOW VARIABLES like 'group_concat_max_len'
    SQL
  
    row['Value'].to_i
  end
  def set_group_concat_max_len(value)
    ActiveRecord::Base.connection.execute <<-SQL
      SET SESSION group_concat_max_len = #{value};
    SQL
  end

We wrapped the query in a block that remembered original value of group_concat_max_len, increased it for the time of executing query and once the query got executed set it back to previous value.

Just as with PostgreSQL, we got all the information we want with just 1 database round-trip, awesome!

WARNING

The max value that can be set for group_concat_max_len depends on other server variable - max_allowed_packet. I strongly recommend to read documentation and verify what’s the maximum value for your installation.

Let’s say that we have to import data from some social network. We will be provided with CSV file generated every 5 minutes containing list of posts that were added/liked/commented since the beginning of the day together with likes’ and comments’ counts for each post.

As you can imagine the file will grow in size during the day; also we are looking at not only inserting new posts but also updating counts for the ones we already imported.

Test environment

  • sample example.csv file containing 50,000 rows with the following format:
post_id;title;published_at;likes_count;comments_count 
  • PostgreSQL table with unique index on external_post_id
CREATE TABLE public.posts
 (
   id integer NOT NULL DEFAULT nextval('posts_id_seq'::regclass),
   title character varying,
   published_at timestamp with time zone,
   likes_count integer NOT NULL DEFAULT 0,
   comments_count integer NOT NULL DEFAULT 0,
   external_post_id integer NOT NULL,
   CONSTRAINT posts_pkey PRIMARY KEY (id)
 );
 CREATE UNIQUE INDEX index_posts_on_external_post_id
   ON public.posts
   USING btree
   (external_post_id);
  • ActiveRecord model
class Post < ActiveRecord::Base
  validates :title, presence: true
  validates :external_post_id, presence: true
  validates :external_post_id, uniqueness: true
end

To better emulate requirements all rows from the CSV file with odd external_post_id will be already present in the database.

Simple approach

To establish baseline let’s start with very simple version that’s using first_or_create provided by ActiveRecord out of the box:

CSV.foreach(Rails.root.join('example.csv')) do |row|
  data = {
    title: row[0],
    published_at: row[1],
    likes_count: row[2],
    comments_count: row[3]
  }
  
  Post.where(external_post_id: row[4]).first_or_create(data).update(data)
end

The result is expected but still pretty disappointing:

      user     system      total         real
112.940000   9.690000 122.630000 (159.846472)

Stepping up our game

We all know there are better ways to perform bulk imports with ActiveRecord and one of them is the activerecord-import gem. It supports PostgreSQL 9.5 on conflict which makes is perfect for out use-case.

According to the documentation the fastest method is to use raw columns and arrays of values and that is exactly what we are going to do:

columns = [:title, :published_at, :likes_count, :comments_count, :external_post_id]
values = CSV.read(Rails.root.join('example.csv'))

Post.import columns, values, validate: false, on_duplicate_key_update: { conflict_target: [:external_post_id] }

The result is amazing - we are almost 23x faster then before!

    user     system      total         real
6.160000   0.060000   6.220000 (  6.974064)

Can we be even faster?

yes, we can!

The fastest way to put data into PostgreSQL is to use the COPY command, but it comes with some limitations. One especially important in our context is the lack of upsert support.

Until it will be available we can achieve similar functionality using TEMP TABLES.

Post.connection.execute <<-SQL
  CREATE TEMP TABLE post_imports
  (
    title character varying,
    published_at timestamp with time zone,
    likes_count integer,
    comments_count integer,
    external_post_id integer
  )
SQL

File.open(Rails.root.join('data.csv'), 'r') do |file|
  Post.connection.raw_connection.copy_data %{copy post_imports from stdin with csv delimiter ',' quote '"'} do
    while line = file.gets do
      Post.connection.raw_connection.put_copy_data line
    end
  end
end

Post.connection.execute <<-SQL
  insert into posts(title, published_at, likes_count, comments_count, external_post_id)
  select title, published_at, likes_count, comments_count, external_post_id
  from post_imports
  on conflict(external_post_id) do
  update set
    title = EXCLUDED.title,
    published_at = EXCLUDED.published_at,
    likes_count = EXCLUDED.likes_count,
    comments_count = EXCLUDED.comments_count
  returning id
SQL

In the above example we are creating temporary table post_imports and copying all the data in there. Next we are taking advantage of INSERT INTO(...) SELECT ... and ON CONFLICT syntax to move the data from temporary table to the posts table.

The code is longer and less pretty then the one using activerecord-import gem. It’s also 10x faster!

    user     system      total         real
0.150000   0.020000   0.170000 (  0.643315)

Final thoughts

It’s not surprising that the method closest to the database turns out to be the quickest.

10x faster than activerecord-import (and 248x faster than simple approach) is very impressive but if we look at the raw execution times we might find that 6.7s is perfectly acceptable in most cases. Because of that I would recommend going with activerecord-import gem but keep in mind that we can do much better if we really need to.

Not so long ago I’ve finished my React + ReactRouter project and I was looking for a neat way to deploy its production version.

Since it’s a 100% static website, the first thing that came to my mind was Amazon S3 and their website hosting. The process is very simple:

  1. create a bucket
  2. make it publicly readable
  3. upload your html/css/js/other files
  4. enable static website hosting on the bucket
  5. change some DNS settings
  6. … you’re done - there is no point 6 ;)

To make it even simpler I used awesome s3_website tool. It required creating simple configuration file and I was able to push my work to S3.

Issues with routing

Unfortunately S3 website hosting wasn’t playing nice with ReactRouter. I typed http://my-awesome-site.com/products/cool-product in the browser’s address bar and I got 404 Not Found error code. I realized that’s completely justified on the S3 side since this file wasn’t present in the bucket.

On staging I was handling such cases with Nginx configuration:

location / {
  ...
  try_files $uri.html $uri $uri/ /index.html;
  ...
}

This basically means: try to match the uri and if you fail just serve index.html from the root directory.

Thanks to above directive routing was done by ReactRouter and site was working perfectly on staging.

Intermediate solution

One possible workaround was to switch ReactRouter to use createHashHistory and setup redirect rules on S3 bucket like this

<RoutingRules>
  <RoutingRule>
    <Condition>
      <HttpErrorCodeReturnedEquals>404</HttpErrorCodeReturnedEquals>
    </Condition>
    <Redirect>
      <HostName>my-awesome-site.com</HostName>
      <ReplaceKeyPrefixWith>#/</ReplaceKeyPrefixWith>
    </Redirect>
  </RoutingRule>
</RoutingRules>

That way when I typed http://my-awesome-site.com/products/cool-product I was redirected to http://my-awesome-site.com/#/products/cool-product. My site worked again but with ugly urls that were neither user nor google friendly.

Cloudfront

I was looking through Amazon Management Console, hoping I could find some solution, and I noticed Error Pages section in my Cloudfront distribution. It turned out that I was able to override default behaviour of 404 http error and that was what I needed. I set Response Page Path to /index.html and HTTP Response Code to 200

cloudfront

Since this is essentially the same configuration that I had in Nginx I was able to access my site through normal url again :)

I’ve been working with some friends on a fun side-project lately - AppShape. It’s a product that will allow users to constantly monitor their applications health from both technical and business perspective.

We decided to split the product into 3 separate applications

  • appshape - user facing web application running on Ruby On Rails
  • appshape-runner - runs user-defined checks against their applications, written in Ruby
  • appshape-reporter - reports results of this checks back to main application, written in Ruby

At the moment we are also using two beanstalkd queues and clockwork process to glue everything together.

This means launching 6 processes in 6 different terminal windows, constantly switching between them to check logs…

I’ve had to come up with something better.

Foreman

I like the simplicity of the Procfile approach provided by Foreman. It’s just one command and you get colorized output, all logs in one window and ability to kill all your processes with CTL+C. I wanted that for my project!

Unfortunately Foreman doesn’t support the way we’ve structured AppShape:

  • it doesn’t handle running processes from a different directory
  • it doesn’t support different RVM’s which we are using for handling multiple ruby versions/gemsets

Subcontractor

After short googling I’ve found subcontractor - nice gem that is supposed to handle all of this.

Let’s take a look at Procfile with subcontractor commands included:

web:            bundle exec puma -C ./config/puma.rb
runner_queue:   beanstalkd -p 11305
reporter_queue: beanstalkd -p 11306
clock:          clockwork clock.rb
runner:         subcontract --rvm 'ruby-2.2.3@appshape-runner' --signal TERM --chdir ../appshape-runner -- bundle exec ruby main.rb
reporter:       subcontract --rvm 'ruby-2.2.3@appshape-results-reporter' --signal TERM --chdir ../appshape-reporter -- bundle exec ruby main.rb

The usage is pretty simple

  • --rvm - sets ruby version and gemset
  • --signal - which signal should be send to the process
  • --chdir - to where change directory
  • -- bundle exec ruby main.rb - command that should be executed

With this setup I was able to start foreman as usual with foreman start and I could see all my processes running.

I would be perfectly happy with the above but I’ve noticed that sometimes processes started via subcontractor were not killed. After few hours of work and multiple restarts I executed ps aux | grep ruby and saw about 20 running ruby processes.

I haven’t been able to fix the issue in a reasonable amount of time so I’ve decided to try another approach.

Foreman (again)

Back in the old days I was using bash -lc when I wanted to run some of my ruby programs via cron. It starts non-interactive login shell and executes given command. The trick here is that login option. Amongst other things it will source your .profile files so RVM will be available!

Here’s my modified Procfile

web: bundle exec puma -C ./config/puma.rb
runner_queue: beanstalkd -p 11305
reporter_queue: beanstalkd -p 11306
clock: clockwork clock.rb
runner: bash -lc "cd ../appshape-runner && bundle exec ruby main.rb"
reporter: bash -lc "cd ../appshape-reporter && bundle exec ruby main.rb"

All process are being killed after Foreman stops and I can benefit from “one command to start them all ™” setup :)

foreman output