Let’s say that we have to import data from some social network. We will be provided with CSV file generated every 5 minutes containing list of posts that were added/liked/commented since the beginning of the day together with likes’ and comments’ counts for each post.

As you can imagine the file will grow in size during the day; also we are looking at not only inserting new posts but also updating counts for the ones we already imported.

Test environment

  • sample example.csv file containing 50,000 rows with the following format:
post_id;title;published_at;likes_count;comments_count 
  • PostgreSQL table with unique index on external_post_id
CREATE TABLE public.posts
 (
   id integer NOT NULL DEFAULT nextval('posts_id_seq'::regclass),
   title character varying,
   published_at timestamp with time zone,
   likes_count integer NOT NULL DEFAULT 0,
   comments_count integer NOT NULL DEFAULT 0,
   external_post_id integer NOT NULL,
   CONSTRAINT posts_pkey PRIMARY KEY (id)
 );
 CREATE UNIQUE INDEX index_posts_on_external_post_id
   ON public.posts
   USING btree
   (external_post_id);
  • ActiveRecord model
class Post < ActiveRecord::Base
  validates :title, presence: true
  validates :external_post_id, presence: true
  validates :external_post_id, uniqueness: true
end

To better emulate requirements all rows from the CSV file with odd external_post_id will be already present in the database.

Simple approach

To establish baseline let’s start with very simple version that’s using first_or_create provided by ActiveRecord out of the box:

CSV.foreach(Rails.root.join('example.csv')) do |row|
  data = {
    title: row[0],
    published_at: row[1],
    likes_count: row[2],
    comments_count: row[3]
  }
  
  Post.where(external_post_id: row[4]).first_or_create(data).update(data)
end

The result is expected but still pretty disappointing:

      user     system      total         real
112.940000   9.690000 122.630000 (159.846472)

Stepping up our game

We all know there are better ways to perform bulk imports with ActiveRecord and one of them is the activerecord-import gem. It supports PostgreSQL 9.5 on conflict which makes is perfect for out use-case.

According to the documentation the fastest method is to use raw columns and arrays of values and that is exactly what we are going to do:

columns = [:title, :published_at, :likes_count, :comments_count, :external_post_id]
values = CSV.read(Rails.root.join('example.csv'))

Post.import columns, values, validate: false, on_duplicate_key_update: { conflict_target: [:external_post_id] }

The result is amazing - we are almost 23x faster then before!

    user     system      total         real
6.160000   0.060000   6.220000 (  6.974064)

Can we be even faster?

yes, we can!

The fastest way to put data into PostgreSQL is to use the COPY command, but it comes with some limitations. One especially important in our context is the lack of upsert support.

Until it will be available we can achieve similar functionality using TEMP TABLES.

Post.connection.execute <<-SQL
  CREATE TEMP TABLE post_imports
  (
    title character varying,
    published_at timestamp with time zone,
    likes_count integer,
    comments_count integer,
    external_post_id integer
  )
SQL

File.open(Rails.root.join('data.csv'), 'r') do |file|
  Post.connection.raw_connection.copy_data %{copy post_imports from stdin with csv delimiter ',' quote '"'} do
    while line = file.gets do
      Post.connection.raw_connection.put_copy_data line
    end
  end
end

Post.connection.execute <<-SQL
  insert into posts(title, published_at, likes_count, comments_count, external_post_id)
  select title, published_at, likes_count, comments_count, external_post_id
  from post_imports
  on conflict(external_post_id) do
  update set
    title = EXCLUDED.title,
    published_at = EXCLUDED.published_at,
    likes_count = EXCLUDED.likes_count,
    comments_count = EXCLUDED.comments_count
  returning id
SQL

In the above example we are creating temporary table post_imports and copying all the data in there. Next we are taking advantage of INSERT INTO(...) SELECT ... and ON CONFLICT syntax to move the data from temporary table to the posts table.

The code is longer and less pretty then the one using activerecord-import gem. It’s also 10x faster!

    user     system      total         real
0.150000   0.020000   0.170000 (  0.643315)

Final thoughts

It’s not surprising that the method closest to the database turns out to be the quickest.

10x faster than activerecord-import (and 248x faster than simple approach) is very impressive but if we look at the raw execution times we might find that 6.7s is perfectly acceptable in most cases. Because of that I would recommend going with activerecord-import gem but keep in mind that we can do much better if we really need to.

Not so long ago I’ve finished my React + ReactRouter project and I was looking for a neat way to deploy its production version.

Since it’s a 100% static website, the first thing that came to my mind was Amazon S3 and their website hosting. The process is very simple:

  1. create a bucket
  2. make it publicly readable
  3. upload your html/css/js/other files
  4. enable static website hosting on the bucket
  5. change some DNS settings
  6. … you’re done - there is no point 6 ;)

To make it even simpler I used awesome s3_website tool. It required creating simple configuration file and I was able to push my work to S3.

Issues with routing

Unfortunately S3 website hosting wasn’t playing nice with ReactRouter. I typed http://my-awesome-site.com/products/cool-product in the browser’s address bar and I got 404 Not Found error code. I realized that’s completely justified on the S3 side since this file wasn’t present in the bucket.

On staging I was handling such cases with Nginx configuration:

location / {
  ...
  try_files $uri.html $uri $uri/ /index.html;
  ...
}

This basically means: try to match the uri and if you fail just serve index.html from the root directory.

Thanks to above directive routing was done by ReactRouter and site was working perfectly on staging.

Intermediate solution

One possible workaround was to switch ReactRouter to use createHashHistory and setup redirect rules on S3 bucket like this

<RoutingRules>
  <RoutingRule>
    <Condition>
      <HttpErrorCodeReturnedEquals>404</HttpErrorCodeReturnedEquals>
    </Condition>
    <Redirect>
      <HostName>my-awesome-site.com</HostName>
      <ReplaceKeyPrefixWith>#/</ReplaceKeyPrefixWith>
    </Redirect>
  </RoutingRule>
</RoutingRules>

That way when I typed http://my-awesome-site.com/products/cool-product I was redirected to http://my-awesome-site.com/#/products/cool-product. My site worked again but with ugly urls that were neither user nor google friendly.

Cloudfront

I was looking through Amazon Management Console, hoping I could find some solution, and I noticed Error Pages section in my Cloudfront distribution. It turned out that I was able to override default behaviour of 404 http error and that was what I needed. I set Response Page Path to /index.html and HTTP Response Code to 200

cloudfront

Since this is essentially the same configuration that I had in Nginx I was able to access my site through normal url again :)

I’ve been working with some friends on a fun side-project lately - AppShape. It’s a product that will allow users to constantly monitor their applications health from both technical and business perspective.

We decided to split the product into 3 separate applications

  • appshape - user facing web application running on Ruby On Rails
  • appshape-runner - runs user-defined checks against their applications, written in Ruby
  • appshape-reporter - reports results of this checks back to main application, written in Ruby

At the moment we are also using two beanstalkd queues and clockwork process to glue everything together.

This means launching 6 processes in 6 different terminal windows, constantly switching between them to check logs…

I’ve had to come up with something better.

Foreman

I like the simplicity of the Procfile approach provided by Foreman. It’s just one command and you get colorized output, all logs in one window and ability to kill all your processes with CTL+C. I wanted that for my project!

Unfortunately Foreman doesn’t support the way we’ve structured AppShape:

  • it doesn’t handle running processes from a different directory
  • it doesn’t support different RVM’s which we are using for handling multiple ruby versions/gemsets

Subcontractor

After short googling I’ve found subcontractor - nice gem that is supposed to handle all of this.

Let’s take a look at Procfile with subcontractor commands included:

web:            bundle exec puma -C ./config/puma.rb
runner_queue:   beanstalkd -p 11305
reporter_queue: beanstalkd -p 11306
clock:          clockwork clock.rb
runner:         subcontract --rvm 'ruby-2.2.3@appshape-runner' --signal TERM --chdir ../appshape-runner -- bundle exec ruby main.rb
reporter:       subcontract --rvm 'ruby-2.2.3@appshape-results-reporter' --signal TERM --chdir ../appshape-reporter -- bundle exec ruby main.rb

The usage is pretty simple

  • --rvm - sets ruby version and gemset
  • --signal - which signal should be send to the process
  • --chdir - to where change directory
  • -- bundle exec ruby main.rb - command that should be executed

With this setup I was able to start foreman as usual with foreman start and I could see all my processes running.

I would be perfectly happy with the above but I’ve noticed that sometimes processes started via subcontractor were not killed. After few hours of work and multiple restarts I executed ps aux | grep ruby and saw about 20 running ruby processes.

I haven’t been able to fix the issue in a reasonable amount of time so I’ve decided to try another approach.

Foreman (again)

Back in the old days I was using bash -lc when I wanted to run some of my ruby programs via cron. It starts non-interactive login shell and executes given command. The trick here is that login option. Amongst other things it will source your .profile files so RVM will be available!

Here’s my modified Procfile

web: bundle exec puma -C ./config/puma.rb
runner_queue: beanstalkd -p 11305
reporter_queue: beanstalkd -p 11306
clock: clockwork clock.rb
runner: bash -lc "cd ../appshape-runner && bundle exec ruby main.rb"
reporter: bash -lc "cd ../appshape-reporter && bundle exec ruby main.rb"

All process are being killed after Foreman stops and I can benefit from “one command to start them all ™” setup :)

foreman output

In one of my projects I’m processing large amounts of XML files containing various information about sport events. I have to record all the files that made modifications for the given sport event. Since file can be sometimes re-uploaded I’ve decided to use convenient Rails helper:

SportEventFiles.where(sport_event_id: 686954, file: 'file.xml').first_or_create!

It worked like a charm - quick and easy solution to my problem.

After a while I noticed increase in processing times for a single file. Short investigation lead to the first_or_create! line of code:

# 0.040000   0.000000   0.040000 (  1.555691)
Benchmark.measure do 
  SportEventFile.where(sport_event_id: 686954, file: 'file.xml').first_or_create!
end  

WOW - it took 1.5s to execute this simple function!

Next step was to run explain analyze on the underlying query:

EXPLAIN ANALYZE 
  SELECT 
    "sport_event_files".* 
  FROM 
    "sport_event_files" 
  WHERE 
    "sport_event_files"."file" = 'file.xml' AND 
    "sport_event_files"."sport_event_id" = 686954 
  ORDER BY "sport_event_files"."id";

-- query plan
Index Scan using sport_event_files_pkey on sport_event_files  (cost=0.43..279322.55 rows=1 width=63) (actual time=1226.469..1226.515 rows=1 loops=1)
 Filter: (((file)::text = 'file.xml'::text) AND (sport_event_id = 686954))
 Rows Removed by Filter: 5492868
Total runtime: 1226.542 ms
(4 rows)

Because of the ORDER BY "sport_event_files"."id" at the end Postgresql choose to use index on primary key for the sorting and filter by conditions. This essentially resulted in doing seq scan of the table with almost 5.5 million rows (always a bad idea ;)

Here is what happened when I removed the ORDER BY part.

EXPLAIN ANALYZE 
  SELECT 
    "sport_event_files".* 
  FROM 
    "sport_event_files" 
  WHERE 
    "sport_event_files"."file" = 'file.xml' AND 
    "sport_event_files"."sport_event_id" = 686954;

-- query plan
Index Scan using index_sport_event_files_on_file_and_sport_event_id on sport_event_files  (cost=0.56..8.58 rows=1 width=63) (actual time=0.060..0.061 rows=1 loops=1)
  Index Cond: (((file)::text = 'file.xml'::text) AND (sport_event_id = 686954))
Total runtime: 0.088 ms
(3 rows)

I ended up with index-only scan and query time measured in fractions of millisecond!

I re-wrote my code using .exists? and everything was great again!

I’ve been writing some Ruby app lately (the kind that runs in the terminal and doesn’t include Webrick as http server) and I found myself missing some of the tooling that is available with Rails - especially the command line. The ability to just type rails c and play around with your code was one of the framework killer-features back in the days:

Here’s how to do it in Ruby console application :)

Sample application

Our app will contain only three files

# cat.rb
class Cat
  def make_sound
    puts 'miau' 
  end
end

# config/application.rb
require_relative '../cat'

# main.rb
require_relative 'config/application'
Cat.new.make_sound

Console script

We will follow Rails conventions and put our executables in the bin directory

# bin/console

#!/usr/bin/env ruby
require 'irb'
require 'irb/completion'
require_relative '../config/application'

IRB.start

And that’s it - we can now type bin/console and our Cat class is available!

Reloading code

It wouldn’t feel like a real console without being able to call reload! so let’s try to add it!

First we need to change require to load in application.rb because the former will not load our source file again if it has been already loaded.

# config/application.rb
load File.expand_path('../../cat.rb', __FILE__)

And then we will add reload! function to our script:

# bin/console

#!/usr/bin/env ruby
require 'irb'
require 'irb/completion'

def reload!
  load File.expand_path('../../config/application.rb', __FILE__)
end

reload!
IRB.start