Really fast CSV imports with Rails and PostgreSQL
Let’s say that we have to import data from some social network. We will be provided with CSV file generated every 5 minutes containing list of posts that were added/liked/commented since the beginning of the day together with likes’ and comments’ counts for each post.
As you can imagine the file will grow in size during the day; also we are looking at not only inserting new posts but also updating counts for the ones we already imported.
- sample example.csv file containing 50,000 rows with the following format:
- PostgreSQL table with unique index on
- ActiveRecord model
To better emulate requirements all rows from the CSV file with odd
external_post_id will be
already present in the database.
To establish baseline let’s start with very simple version that’s using first_or_create provided by ActiveRecord out of the box:
The result is expected but still pretty disappointing:
Stepping up our game
We all know there are better ways to perform bulk imports with ActiveRecord and one of them
is the activerecord-import gem. It
PostgreSQL 9.5 on conflict
which makes is perfect for out use-case.
According to the documentation the fastest method is to use raw columns and arrays of values and that is exactly what we are going to do:
The result is amazing - we are almost 23x faster then before!
Can we be even faster?
The fastest way to put data into PostgreSQL is to use the COPY command, but it comes with some limitations. One especially important in our context is the lack of upsert support.
Until it will be available we can achieve similar functionality using TEMP TABLES.
In the above example we are creating temporary table
post_imports and copying all the data in there.
Next we are taking advantage of
INSERT INTO(...) SELECT ... and
ON CONFLICT syntax
to move the data from temporary table to the
The code is longer and less pretty then the one using activerecord-import gem. It’s also 10x faster!
It’s not surprising that the method closest to the database turns out to be the quickest.
10x faster than activerecord-import (and 248x faster than simple approach) is very impressive but if we look at the raw execution times we might
6.7s is perfectly acceptable in most cases. Because of that I would recommend
going with activerecord-import gem but keep in mind that we can do much better if we really need to.