Really fast CSV imports with Rails and PostgreSQL
Let’s say that we have to import data from some social network. We will be provided with CSV file generated every 5 minutes containing list of posts that were added/liked/commented since the beginning of the day together with likes’ and comments’ counts for each post.
As you can imagine the file will grow in size during the day; also we are looking at not only inserting new posts but also updating counts for the ones we already imported.
Test environment
- sample example.csv file containing 50,000 rows with the following format:
post_id;title;published_at;likes_count;comments_count
- PostgreSQL table with unique index on
external_post_id
CREATE TABLE public.posts
(
id integer NOT NULL DEFAULT nextval('posts_id_seq'::regclass),
title character varying,
published_at timestamp with time zone,
likes_count integer NOT NULL DEFAULT 0,
comments_count integer NOT NULL DEFAULT 0,
external_post_id integer NOT NULL,
CONSTRAINT posts_pkey PRIMARY KEY (id)
);
CREATE UNIQUE INDEX index_posts_on_external_post_id
ON public.posts
USING btree
(external_post_id);
- ActiveRecord model
class Post < ActiveRecord::Base
validates :title, presence: true
validates :external_post_id, presence: true
validates :external_post_id, uniqueness: true
end
To better emulate requirements all rows from the CSV file with odd external_post_id
will be
already present in the database.
Simple approach
To establish baseline let’s start with very simple version that’s using first_or_create provided by ActiveRecord out of the box:
CSV.foreach(Rails.root.join('example.csv')) do |row|
data = {
title: row[0],
published_at: row[1],
likes_count: row[2],
comments_count: row[3]
}
Post.where(external_post_id: row[4]).first_or_create(data).update(data)
end
The result is expected but still pretty disappointing:
user system total real
112.940000 9.690000 122.630000 (159.846472)
Stepping up our game
We all know there are better ways to perform bulk imports with ActiveRecord and one of them
is the activerecord-import gem. It
supports PostgreSQL 9.5
on conflict
which makes is perfect for out use-case.
According to the documentation the fastest method is to use raw columns and arrays of values and that is exactly what we are going to do:
columns = [:title, :published_at, :likes_count, :comments_count, :external_post_id]
values = CSV.read(Rails.root.join('example.csv'))
Post.import columns, values, validate: false, on_duplicate_key_update: { conflict_target: [:external_post_id] }
The result is amazing - we are almost 23x faster then before!
user system total real
6.160000 0.060000 6.220000 ( 6.974064)
Can we be even faster?

The fastest way to put data into PostgreSQL is to use the COPY command, but it comes with some limitations. One especially important in our context is the lack of upsert support.
Until it will be available we can achieve similar functionality using TEMP TABLES.
Post.connection.execute <<-SQL
CREATE TEMP TABLE post_imports
(
title character varying,
published_at timestamp with time zone,
likes_count integer,
comments_count integer,
external_post_id integer
)
SQL
File.open(Rails.root.join('data.csv'), 'r') do |file|
Post.connection.raw_connection.copy_data %{copy post_imports from stdin with csv delimiter ',' quote '"'} do
while line = file.gets do
Post.connection.raw_connection.put_copy_data line
end
end
end
Post.connection.execute <<-SQL
insert into posts(title, published_at, likes_count, comments_count, external_post_id)
select title, published_at, likes_count, comments_count, external_post_id
from post_imports
on conflict(external_post_id) do
update set
title = EXCLUDED.title,
published_at = EXCLUDED.published_at,
likes_count = EXCLUDED.likes_count,
comments_count = EXCLUDED.comments_count
returning id
SQL
In the above example we are creating temporary table post_imports
and copying all the data in there.
Next we are taking advantage of INSERT INTO(...) SELECT ...
and ON CONFLICT
syntax
to move the data from temporary table to the posts
table.
The code is longer and less pretty then the one using activerecord-import gem. It’s also 10x faster!
user system total real
0.150000 0.020000 0.170000 ( 0.643315)
Final thoughts
It’s not surprising that the method closest to the database turns out to be the quickest.
10x faster than activerecord-import (and 248x faster than simple approach) is very impressive but if we look at the raw execution times we might
find that 6.7s
is perfectly acceptable in most cases. Because of that I would recommend
going with activerecord-import gem but keep in mind that we can do much better if we really need to.