June 06, 2016Adrian Serafin Reading time ~3 minutes

Really fast CSV imports with Rails and PostgreSQL

Let’s say that we have to import data from some social network. We will be provided with CSV file generated every 5 minutes containing list of posts that were added/liked/commented since the beginning of the day together with likes’ and comments’ counts for each post.

As you can imagine the file will grow in size during the day; also we are looking at not only inserting new posts but also updating counts for the ones we already imported.

Test environment

sample example.csv file containing 50,000 rows with the following format:

post_id;title;published_at;likes_count;comments_count

PostgreSQL table with unique index on external_post_id

CREATE TABLE public.posts
 (
   id integer NOT NULL DEFAULT nextval('posts_id_seq'::regclass),
   title character varying,
   published_at timestamp with time zone,
   likes_count integer NOT NULL DEFAULT 0,
   comments_count integer NOT NULL DEFAULT 0,
   external_post_id integer NOT NULL,
   CONSTRAINT posts_pkey PRIMARY KEY (id)
 );
 CREATE UNIQUE INDEX index_posts_on_external_post_id
   ON public.posts
   USING btree
   (external_post_id);

ActiveRecord model

class Post < ActiveRecord::Base
  validates :title, presence: true
  validates :external_post_id, presence: true
  validates :external_post_id, uniqueness: true
end

To better emulate requirements all rows from the CSV file with odd external_post_id will be already present in the database.

Simple approach

To establish baseline let’s start with very simple version that’s using first_or_create provided by ActiveRecord out of the box:

CSV.foreach(Rails.root.join('example.csv')) do |row|
  data = {
    title: row[0],
    published_at: row[1],
    likes_count: row[2],
    comments_count: row[3]
  }
  
  Post.where(external_post_id: row[4]).first_or_create(data).update(data)
end

The result is expected but still pretty disappointing:

      user     system      total         real
112.940000   9.690000 122.630000 (159.846472)

Stepping up our game

We all know there are better ways to perform bulk imports with ActiveRecord and one of them is the activerecord-import gem. It supports PostgreSQL 9.5 on conflict which makes is perfect for out use-case.

According to the documentation the fastest method is to use raw columns and arrays of values and that is exactly what we are going to do:

columns = [:title, :published_at, :likes_count, :comments_count, :external_post_id]
values = CSV.read(Rails.root.join('example.csv'))

Post.import columns, values, validate: false, on_duplicate_key_update: { conflict_target: [:external_post_id] }

The result is amazing - we are almost 23x faster then before!

    user     system      total         real
6.160000   0.060000   6.220000 (  6.974064)

Could we be even faster?

The fastest way to put data into PostgreSQL is to use the COPY command, but it comes with some limitations. One especially important in our context is the lack of upsert support.

Until it will be available we can achieve similar functionality using TEMP TABLES.

Post.connection.execute <<-SQL
  CREATE TEMP TABLE post_imports
  (
    title character varying,
    published_at timestamp with time zone,
    likes_count integer,
    comments_count integer,
    external_post_id integer
  )
SQL

File.open(Rails.root.join('data.csv'), 'r') do |file|
  Post.connection.raw_connection.copy_data %{copy post_imports from stdin with csv delimiter ',' quote '"'} do
    while line = file.gets do
      Post.connection.raw_connection.put_copy_data line
    end
  end
end

Post.connection.execute <<-SQL
  insert into posts(title, published_at, likes_count, comments_count, external_post_id)
  select title, published_at, likes_count, comments_count, external_post_id
  from post_imports
  on conflict(external_post_id) do
  update set
    title = EXCLUDED.title,
    published_at = EXCLUDED.published_at,
    likes_count = EXCLUDED.likes_count,
    comments_count = EXCLUDED.comments_count
  returning id
SQL

In the above example we are creating temporary table post_imports and copying all the data in there. Next we are taking advantage of INSERT INTO(...) SELECT ... and ON CONFLICT syntax to move the data from temporary table to the posts table.

The code is longer and less pretty then the one using activerecord-import gem. It’s also 10x faster!

    user     system      total         real
0.150000   0.020000   0.170000 (  0.643315)

Final thoughts

It’s not surprising that the method closest to the database turns out to be the quickest.

10x faster than activerecord-import (and 248x faster than simple approach) is very impressive but if we look at the raw execution times we might find that 6.7s is perfectly acceptable in most cases. Because of that I would recommend going with activerecord-import gem but keep in mind that we can do much better if we really need to.

March 23, 2016Adrian Serafin Reading time ~1 minute

React router on Amazon S3

Not so long ago I’ve finished my React + ReactRouter project and I was looking for a neat way to deploy its production version.

Since it’s a 100% static website, the first thing that came to my mind was Amazon S3 and their website hosting. The process is very simple:

create a bucket
make it publicly readable
upload your html/css/js/other files
enable static website hosting on the bucket
change some DNS settings
… you’re done - there is no point 6 ;)

To make it even simpler I used awesome s3_website tool. It required creating simple configuration file and I was able to push my work to S3.

Issues with routing

Unfortunately S3 website hosting wasn’t playing nice with ReactRouter. I typed http://my-awesome-site.com/products/cool-product in the browser’s address bar and I got 404 Not Found error code. I realized that’s completely justified on the S3 side since this file wasn’t present in the bucket.

On staging I was handling such cases with Nginx configuration:

location / {
  ...
  try_files $uri.html $uri $uri/ /index.html;
  ...
}

This basically means: try to match the uri and if you fail just serve index.html from the root directory.

Thanks to above directive routing was done by ReactRouter and site was working perfectly on staging.

Intermediate solution

One possible workaround was to switch ReactRouter to use createHashHistory and setup redirect rules on S3 bucket like this

<RoutingRules>
  <RoutingRule>
    <Condition>
      <HttpErrorCodeReturnedEquals>404</HttpErrorCodeReturnedEquals>
    </Condition>
    <Redirect>
      <HostName>my-awesome-site.com</HostName>
      <ReplaceKeyPrefixWith>#/</ReplaceKeyPrefixWith>
    </Redirect>
  </RoutingRule>
</RoutingRules>

That way when I typed http://my-awesome-site.com/products/cool-product I was redirected to http://my-awesome-site.com/#/products/cool-product. My site worked again but with ugly urls that were neither user nor google friendly.

Cloudfront

I was looking through Amazon Management Console, hoping I could find some solution, and I noticed Error Pages section in my Cloudfront distribution. It turned out that I was able to override default behaviour of 404 http error and that was what I needed. I set Response Page Path to /index.html and HTTP Response Code to 200

cloudfront

Since this is essentially the same configuration that I had in Nginx I was able to access my site through normal url again :)

March 12, 2016Adrian Serafin Reading time ~2 minutes

Using Foreman with service oriented architecture

I’ve been working with some friends on a fun side-project lately - AppShape. It’s a product that will allow users to constantly monitor their applications health from both technical and business perspective.

We decided to split the product into 3 separate applications

appshape - user facing web application running on Ruby On Rails
appshape-runner - runs user-defined checks against their applications, written in Ruby
appshape-reporter - reports results of this checks back to main application, written in Ruby

At the moment we are also using two beanstalkd queues and clockwork process to glue everything together.

This means launching 6 processes in 6 different terminal windows, constantly switching between them to check logs…

I’ve had to come up with something better.

Foreman

I like the simplicity of the Procfile approach provided by Foreman. It’s just one command and you get colorized output, all logs in one window and ability to kill all your processes with CTL+C. I wanted that for my project!

Unfortunately Foreman doesn’t support the way we’ve structured AppShape:

it doesn’t handle running processes from a different directory
it doesn’t support different RVM’s which we are using for handling multiple ruby versions/gemsets

Subcontractor

After short googling I’ve found subcontractor - nice gem that is supposed to handle all of this.

Let’s take a look at Procfile with subcontractor commands included:

web:            bundle exec puma -C ./config/puma.rb
runner_queue:   beanstalkd -p 11305
reporter_queue: beanstalkd -p 11306
clock:          clockwork clock.rb
runner:         subcontract --rvm 'ruby-2.2.3@appshape-runner' --signal TERM --chdir ../appshape-runner -- bundle exec ruby main.rb
reporter:       subcontract --rvm 'ruby-2.2.3@appshape-results-reporter' --signal TERM --chdir ../appshape-reporter -- bundle exec ruby main.rb

The usage is pretty simple

--rvm - sets ruby version and gemset
--signal - which signal should be send to the process
--chdir - to where change directory
-- bundle exec ruby main.rb - command that should be executed

With this setup I was able to start foreman as usual with foreman start and I could see all my processes running.

I would be perfectly happy with the above but I’ve noticed that sometimes processes started via subcontractor were not killed. After few hours of work and multiple restarts I executed ps aux | grep ruby and saw about 20 running ruby processes.

I haven’t been able to fix the issue in a reasonable amount of time so I’ve decided to try another approach.

Foreman (again)

Back in the old days I was using bash -lc when I wanted to run some of my ruby programs via cron. It starts non-interactive login shell and executes given command. The trick here is that login option. Amongst other things it will source your .profile files so RVM will be available!

Here’s my modified Procfile

web: bundle exec puma -C ./config/puma.rb
runner_queue: beanstalkd -p 11305
reporter_queue: beanstalkd -p 11306
clock: clockwork clock.rb
runner: bash -lc "cd ../appshape-runner && bundle exec ruby main.rb"
reporter: bash -lc "cd ../appshape-reporter && bundle exec ruby main.rb"

All process are being killed after Foreman stops and I can benefit from “one command to start them all ™” setup :)

foreman output

December 03, 2015Adrian Serafin Reading time ~1 minute

How to add interactive console to your Ruby app?

I’ve been writing some Ruby app lately (the kind that runs in the terminal and doesn’t include Webrick as http server) and I found myself missing some of the tooling that is available with Rails - especially the command line. The ability to just type rails c and play around with your code was one of the framework killer-features back in the days:

Here’s how to do it in Ruby console application :)

Sample application

Our app will contain only three files

# cat.rb
class Cat
  def make_sound
    puts 'miau' 
  end
end

# config/application.rb
require_relative '../cat'

# main.rb
require_relative 'config/application'
Cat.new.make_sound

Console script

We will follow Rails conventions and put our executables in the bin directory

# bin/console

#!/usr/bin/env ruby
require 'irb'
require 'irb/completion'
require_relative '../config/application'

IRB.start

And that’s it - we can now type bin/console and our Cat class is available!

Reloading code

It wouldn’t feel like a real console without being able to call reload! so let’s try to add it!

First we need to change require to load in application.rb because the former will not load our source file again if it has been already loaded.

# config/application.rb
load File.expand_path('../../cat.rb', __FILE__)

And then we will add reload! function to our script:

# bin/console

#!/usr/bin/env ruby
require 'irb'
require 'irb/completion'

def reload!
  load File.expand_path('../../config/application.rb', __FILE__)
end

reload!
IRB.start

June 12, 2015Adrian Serafin Reading time ~1 minute

Returning JSON directly from PostgreSQL

I’m creating a lot of JSON APIs using Grape or Ruby on Rails lately. From time to time I’m facing an issue when I need to return a large amount of data in a single response.

The usual way would be to execute query via ActiveRecord and use some library like JBuilder or ActiveModelSerializers to generate JSON response. It’s nice and easy but sometimes too slow :(

In situation like this I’m relying on PostgreSQL to help me out with it’s row_to_json function.

Let’s convert the chart query from my previous post so it can return json that is ready to send to the chart library right away:

with intervals as (
  select start, start + interval '5min' as end
  from generate_series('2015-05-29 9:30', '2015-05-29 16:00', interval '5min') as start
)
select json_agg(row_to_json(ohlc)) from
(
    select distinct
      intervals.start as date,
      min(low) over w as low,
      max(high) over w as high,
      first_value(open) over w as open,
      last_value(close) over w as close,
      sum(volume) over w as volume
    from
      intervals
      join ticker_minute_bars tmb on
        tmb.key = 'AAPL.BATS' and
        tmb.date >= intervals.start and
        tmb.date < intervals.end
    window w as (partition by intervals.start)
    order by intervals.start
) as ohlc

Time: 76.968 ms

The best part is that JSON serialization overhead comparing to the original query is almost invisible!

Thanks!

Adrian Serafin - Ruby Developer

Latest Posts