Really fast CSV imports with Rails and PostgreSQL

Let’s say that we have to import data from some social network. We will be provided with CSV file generated every 5 minutes containing list of posts that were added/liked/commented since the beginning of the day together with likes’ and comments’ counts for each post.

As you can imagine the file will grow in size during the day; also we are looking at not only inserting new posts but also updating counts for the ones we already imported.

Test environment

  • sample example.csv file containing 50,000 rows with the following format:
post_id;title;published_at;likes_count;comments_count 
  • PostgreSQL table with unique index on external_post_id
CREATE TABLE public.posts
 (
   id integer NOT NULL DEFAULT nextval('posts_id_seq'::regclass),
   title character varying,
   published_at timestamp with time zone,
   likes_count integer NOT NULL DEFAULT 0,
   comments_count integer NOT NULL DEFAULT 0,
   external_post_id integer NOT NULL,
   CONSTRAINT posts_pkey PRIMARY KEY (id)
 );
 CREATE UNIQUE INDEX index_posts_on_external_post_id
   ON public.posts
   USING btree
   (external_post_id);
  • ActiveRecord model
class Post < ActiveRecord::Base
  validates :title, presence: true
  validates :external_post_id, presence: true
  validates :external_post_id, uniqueness: true
end

To better emulate requirements all rows from the CSV file with odd external_post_id will be already present in the database.

Simple approach

To establish baseline let’s start with very simple version that’s using first_or_create provided by ActiveRecord out of the box:

CSV.foreach(Rails.root.join('example.csv')) do |row|
  data = {
    title: row[0],
    published_at: row[1],
    likes_count: row[2],
    comments_count: row[3]
  }
  
  Post.where(external_post_id: row[4]).first_or_create(data).update(data)
end

The result is expected but still pretty disappointing:

      user     system      total         real
112.940000   9.690000 122.630000 (159.846472)

Stepping up our game

We all know there are better ways to perform bulk imports with ActiveRecord and one of them is the activerecord-import gem. It supports PostgreSQL 9.5 on conflict which makes is perfect for out use-case.

According to the documentation the fastest method is to use raw columns and arrays of values and that is exactly what we are going to do:

columns = [:title, :published_at, :likes_count, :comments_count, :external_post_id]
values = CSV.read(Rails.root.join('example.csv'))

Post.import columns, values, validate: false, on_duplicate_key_update: { conflict_target: [:external_post_id] }

The result is amazing - we are almost 23x faster then before!

    user     system      total         real
6.160000   0.060000   6.220000 (  6.974064)

Can we be even faster?

yes, we can!

The fastest way to put data into PostgreSQL is to use the COPY command, but it comes with some limitations. One especially important in our context is the lack of upsert support.

Until it will be available we can achieve similar functionality using TEMP TABLES.

Post.connection.execute <<-SQL
  CREATE TEMP TABLE post_imports
  (
    title character varying,
    published_at timestamp with time zone,
    likes_count integer,
    comments_count integer,
    external_post_id integer
  )
SQL

File.open(Rails.root.join('data.csv'), 'r') do |file|
  Post.connection.raw_connection.copy_data %{copy post_imports from stdin with csv delimiter ',' quote '"'} do
    while line = file.gets do
      Post.connection.raw_connection.put_copy_data line
    end
  end
end

Post.connection.execute <<-SQL
  insert into posts(title, published_at, likes_count, comments_count, external_post_id)
  select title, published_at, likes_count, comments_count, external_post_id
  from post_imports
  on conflict(external_post_id) do
  update set
    title = EXCLUDED.title,
    published_at = EXCLUDED.published_at,
    likes_count = EXCLUDED.likes_count,
    comments_count = EXCLUDED.comments_count
  returning id
SQL

In the above example we are creating temporary table post_imports and copying all the data in there. Next we are taking advantage of INSERT INTO(...) SELECT ... and ON CONFLICT syntax to move the data from temporary table to the posts table.

The code is longer and less pretty then the one using activerecord-import gem. It’s also 10x faster!

    user     system      total         real
0.150000   0.020000   0.170000 (  0.643315)

Final thoughts

It’s not surprising that the method closest to the database turns out to be the quickest.

10x faster than activerecord-import (and 248x faster than simple approach) is very impressive but if we look at the raw execution times we might find that 6.7s is perfectly acceptable in most cases. Because of that I would recommend going with activerecord-import gem but keep in mind that we can do much better if we really need to.

Adrian Serafin

Adrian Serafin
I'm experienced Ruby developer. I love PostgreSQL, Redis and optimizing performance. I strongly believe in using right tool for the job and short feedback loops. How can I help you?

preloading associations with dynamic condition in rails

### N+1 queries issueThe N+1 query issue is common performance problem in Rails applications.To better understand it let's consider typic...… Continue reading

group_concat as an alternative for array_agg

Published on March 23, 2017

React router on Amazon S3

Published on March 23, 2016