Solving the “Integer out of range” mystery error on a Django and Postgres app

Geocene writes a lot of software to ingests Internet of Things sensor data into backend systems. These data arrive at the backend as raw payloads like JSON or compressed binary files before being parsed and inserted into a time series-focused database. A few weeks ago, our Papertrail Slack integration alerted us that, on one of our projects, none of our payloads were being processed anymore. Every Django Q task setup to parse each payload was failing with the following error: DataError: integer out of range.

The payloads have the following structure:

timestamp: timestamp
value: float
sensor_type: integer
channel: integer

At first, we thought that a bug in the mobile app software that packages and sends the payloads to the backed was resulting in a value of sensor_type or channel that were out of range? So, maybe it was just bad payloads? But this seemed suspect. Every payload was failing, regardless of user and mobile app version. Even good payloads that had successfully parsed in the past were failing when we tried to reprocess them.

Going around the ORM and trying to access the data a little deeper, even a simple SQL insert into our Postgres table resulted in the same error! This was getting weird.

INSERT INTO metric_staging ("timestamp", "value", "channel", "sensor_type_id") 
VALUES  ("2016-06-22 19:10:25-07", 25.1, 2, 5)

Could it be the primary key?

To figure out what was going on, we dug into the design of our metric_staging table. The purpose of this metric_staging table is to serve as a temporary storage for metrics as they move down the data processing pipeline. It’s a helper table that we use to validate the payload before inserting it in the final target. The process works like this

  1. Make a new save point within the current transaction
  2. Insert the parsed payload into metric_stagging
  3. Roll back the previous command, and restore the transaction state to what it was at the time of the save point

If Step 2 passes successfully, that means the parsed payload is clean and we’re good to move it to the main time series metric table. All this to say, metric_staging is usually very small, and is never larger than one payload (~1k-1M rows).

After reviewing the design of the metric_staging table, we remembered that we had made the primary key of the table a sequence. This sequence is an automatically-incremented integer that increases with each subsequent row inserted into metric_staging.

PostgreSQL provides three serial pseudo-types SMALLSERIAL, SERIAL, and BIGSERIAL with the characteristics shown in the table below. When we designed the app, we had given metric_staging.id the SERIAL type, which tops out at 232, or about 2 billion, unique values.

Because our metric_staging table was always so small at any given point in time (remember, this table is just a temporary stopover for metrics in the data processing pipeline), we had not realized that the table’s primary key had reached its limit! Any table that has its primary key as a sequence just keeps incrementing its value even if you roll back a transaction and even if you delete all the rows. You’ll end up with an empty table, but the sequence remains unchanged. In our case, after years of data validation, the primary key exceeded the limit of 4 bytes.

Postgres Type Storage Size Range
SMALLSERIAL 2 bytes 1 to 32,767
SERIAL 4 bytes 1 to 2,147,483,647
BIGSERIAL 8 bytes 1 to 9,223,372,036,854,775,807

The Solution

The Right Way™ to fix this issue is to alter the primary key from SERIAL to BIGSERIAL. Our SERIAL key got us through about 4 years of the app’s life. So, at current ingress rates, the BIGSERIAL key should buy us another 17 billion years. By that point, the Sun will have swallowed the Earth, and our app will probably be obsolete.

ALTER SEQUENCE api_metricstaging_id_seq AS BIGINT;
ALTER TABLE api_metricstaging_id_seq ALTER id TYPE BIGINT;

This fix would solve the issue, but this bug was hot and needed a fix right away. We don’t like to apply schema changes to our tables without significant testing, and time was a wasting. So, to buy some time, we wondered if we could just reset the cursor and set the sequence value back to zero. It turned out this was an easy and safe option for our situation. ALTER SEQUENCE changes the parameters of an existing sequence generator, and this sets the counter to zero.

ALTER SEQUENCE api_metricstaging_id_seq RESTART WITH 0

Once we applied this ALTER SEQUENCE, the data pipeline started processing new payloads again. We then had the breathing room to make the schema changes, and queue up all of the failed payloads to be reprocessed.

Conclusion

Now the fix seems so simple. But, as is typical, figuring out what is wrong is so much harder than fixing the problem itself. The simple fix relied on having a solid development environment, debugging tools, and test data to go step-by-step from the first entry point (an API endpoint in this example) to the last exit point, which was a SQL command.

As a team, Geocene takes pride in responding quickly to these kinds of situations. You cannot avoid bugs, but you can lay the groundwork of good design and documentation to make sure that problems, even mysterious bugs, can be solved as quickly as possible.