The Grouparoo Blog
In Brian’s post, Building a Sync Engine, he talks about the value of using a High Water Mark to keep track of the latest bit of data you’ve imported. This approach is often a better pattern than using
Offset, especially when the underlying data might be changing. In this post, I’m gong to dive even deeper into this topic, and suggest that you should be storing you High Water Marks as strings whenever possible.
Consider the following query:
SELECT * FROM USERS WHERE UPDATED_AT >= '2020-08-27 12:00:00' ORDER BY updated_at ASC LIMIT 10
Here, we are asking for the next 10 users who have been updated since noon on August 27th. This query is a good implementation of using a High Water Mark to remember the
updated_at timestamp of the last User we saw and to get the next batch. In this example, the previous value of our High Water Mark was
There are a number of scenarios in which
2020-08-27 12:00:00 might actually not be the correct string representation of the High Water Mark. The types of bugs to watch out for fall into 2 main categories:
new Date().toISOString(), etc - all of which will generate different strings.
More insidiously, there are other issues hidden in the
stringification category - those revolving around Timezones and clock drift. When your code builds the
Date object from response from your database. Which timezone will it be using - the Timezone of your Database or the Timezone of your Application? Do you know if the database is returning timestamps in its timezone or a more global representation of time (ie:
Timestamp with Timezone in Postgres). Are the results the same in Staging vs Production... and does your ORM know the difference?
resolution class of problems is less dangerous than the
stringification problems, but it can result in duplicated reads and therefore slower imports. Consider these rows in Postgres:
We’ve got values of
2020-07-25 12:18:56.831 for
updated_at– that’s precision down to fractions of a second! However, that data is lost when the
pg package reads that row and casts it to a
new Date(). When we eventually build a string out of it to make our next query, we only get
2020-07-25 12:18:56 back. If you follow the advice in our previous post to always compare with equality (
=>) you won’t skip any rows, but you’ll read the same row back again each time.
So what’s the solution here? Knowing that we will need to convert our High Water Mark to and from a string type, we should ask the database to do the string conversion for us. This approach is called "casting" - converting data from one type format to another, and the Database is the best place to do it.
Casting the High Water Mark to a string at the database ensures:
- The string representation of the High Water Mark is in a format the database can accept.
- The string representation of the High Water Mark is in the timezone the database is already using/assuming.
- The string representation of the High Water Mark is represented with the maximum accuracy the database can use.
This turns our example query into the following:
# Postgres SELECT *, updated_at::text as __hwm FROM USERS WHERE UPDATED_AT >= '2020-08-27 12:00:00' ORDER BY updated_at ASC LIMIT 10 # MySQL SELECT *, CAST(updated_at as CHAR) as __hwm FROM USERS WHERE UPDATED_AT >= '2020-08-27 12:00:00' ORDER BY updated_at ASC LIMIT 10
We ask the database both for all the data about the rows we are selecting, and we ask for
updated_at to be cast as a string for us, returned as
__hwm. We can now use
__hwm directly in subsequent queries without any of the problems listed above.
Tagged in Connections Engineering Notes
See all of Evan Tahler's posts.
Evan is the CTO and co-founder of Grouparoo, an open source data framework that easily connects your data to business tools. Evan is an open-source innovator, and frequent speaker at software development conferences focusing on Product Management, Node.JS, Rails, and databases.
Learn more about Evan @ https://www.evantahler.com