Side effects of using .first_or_create! at scale

In one of my projects I’m processing large amounts of XML files containing various information about sport events. I have to record all the files that made modifications for the given sport event. Since file can be sometimes re-uploaded I’ve decided to use convenient Rails helper:

SportEventFiles.where(sport_event_id: 686954, file: 'file.xml').first_or_create!

It worked like a charm - quick and easy solution to my problem.

After a while I noticed increase in processing times for a single file. Short investigation lead to the first_or_create! line of code:

# 0.040000   0.000000   0.040000 (  1.555691)
Benchmark.measure do 
  SportEventFile.where(sport_event_id: 686954, file: 'file.xml').first_or_create!
end  

WOW - it took 1.5s to execute this simple function!

Next step was to run explain analyze on the underlying query:

EXPLAIN ANALYZE 
  SELECT 
    "sport_event_files".* 
  FROM 
    "sport_event_files" 
  WHERE 
    "sport_event_files"."file" = 'file.xml' AND 
    "sport_event_files"."sport_event_id" = 686954 
  ORDER BY "sport_event_files"."id";

-- query plan
Index Scan using sport_event_files_pkey on sport_event_files  (cost=0.43..279322.55 rows=1 width=63) (actual time=1226.469..1226.515 rows=1 loops=1)
 Filter: (((file)::text = 'file.xml'::text) AND (sport_event_id = 686954))
 Rows Removed by Filter: 5492868
Total runtime: 1226.542 ms
(4 rows)

Because of the ORDER BY "sport_event_files"."id" at the end Postgresql choose to use index on primary key for the sorting and filter by conditions. This essentially resulted in doing seq scan of the table with almost 5.5 million rows (always a bad idea ;)

Here is what happened when I removed the ORDER BY part.

EXPLAIN ANALYZE 
  SELECT 
    "sport_event_files".* 
  FROM 
    "sport_event_files" 
  WHERE 
    "sport_event_files"."file" = 'file.xml' AND 
    "sport_event_files"."sport_event_id" = 686954;

-- query plan
Index Scan using index_sport_event_files_on_file_and_sport_event_id on sport_event_files  (cost=0.56..8.58 rows=1 width=63) (actual time=0.060..0.061 rows=1 loops=1)
  Index Cond: (((file)::text = 'file.xml'::text) AND (sport_event_id = 686954))
Total runtime: 0.088 ms
(3 rows)

I ended up with index-only scan and query time measured in fractions of millisecond!

I re-wrote my code using .exists? and everything was great again!

Adrian Serafin

Adrian Serafin
I'm experienced Ruby developer. I love PostgreSQL, Redis and optimizing performance. I strongly believe in using right tool for the job and short feedback loops. How can I help you?

group_concat as an alternative for array_agg

For quite some time I have been hardcore [PostgreSQL](https://www.postgresql.org) user. I like to leveragedatabase capabilities when poss...… Continue reading

React router on Amazon S3

Published on March 23, 2016