Using TABLE API to extract large volumes of (changing) data

Quite a lot of customers use the Table API via REST to extract information out of ServiceNow for various reasons like consolidating in local data warehouses etc. While the API itself is quite optimized and can also handle larger data volumes pretty fast, we still limit the number of records to get per call to a managemable amout for the instance.

Question like 'Is it Practical to Query around 250,000 Records in a Single REST Transaction?' shows that getting more than the per-call limit is sometimes required. I wanted to document a way how this can be achieved, also keeping in mind that the data you query might be updated while you extract....

Chunking of large data sets

For the sake of this article assume we have a set of data and focus on sys_id as unique identifier and an updated timestamp. To make it a bit easier we also assume that all records have been updated in the last 24 hours - so we can focus on the hours only. The data could look something like this:

sys_Id	sys_updated_on (ASC)
aaaa	1:00
bbbb	1:30
cccc	2:00
dddd	2:30
eeee	3:00
ffff	3:30
gggg	4:00
hhhh	4:30
iiii	5:00
kkkk	22:00

Based on the principle of chunking data we could several strategies to split the volume. An obvious one is by sys_id, but this would mean we need some form of query which returns all sys_id's at once to make sure we know the whole data set. If new sys_id's will be generated inbetween, we only get them in the next run - next day. Also the long list of sys_id's might already be too much for one call. So let's try another approach and use the timestamp...

Using sysparm_limit we can define how much records we want to get returned. Let's take 4 for this example, order by sys_updated_on and run the query. We get these records:

sys_Id	sys_updated_on (ASC)
aaaa	1:00
bbbb	1:30
cccc	2:00
dddd	2:30

If we now take the highest sys_updated_on we see in the data set (2:30) and use this to update our query as starting point, the next call will get us these records:

sys_Id	sys_updated_on (ASC)
dddd	2:30
eeee	3:00
ffff	3:30
gggg	4:00

We can loop this functionality until we get no more further records - this allows us to cycle through large volumes easily.

Now what happens when data changes in between?

Lets assume record 'eeee' gets modified while we do our thing. As the records gets a new sys_update_on timestamp, it simply moves to the end of our list and will be returned again once we reach this chunk of data. So the only thing we need to worry about is to make sure we overwrite - or update - the information we got for 'eeee' in the previous call. All our other calls are still working fine as non of the other records moves around. If we would use a simple limit and offset, all records would shift and we actually might miss some updates.

So, what can we summarize as learnings for this?

Remembered sys_updated_on needs to be rounded down (‘floor’) to nearest second
* Needed as more than one record can be in the same second, unlucky chunk splits could than loose records
Duplicate records will be returned because of this, use sys_id to coalesce information
Do not use the offset as this will lead to missing records
Chunk size must be bigger than max. number of records changed per second
Stop condition for a run is an empty records set
Provides only a snapshot per execution (mostly daily), multiple updates inbetween not possible
Quite a lot of tables do have an index on sys_updated_on, you should still check it for your bespoke scenario. If there is no index, this will heavily impact the query

Scaling it up - how to run multiple threads?

Now assume we really really have to read millions of records in every cycle. If we do this with above sketched method it still could take quite some significant amount of time. Can we have multiple threads running in parallel?

Yes, with good planing this can work. Here is an example of how to split the work load across three threadss:

0:00	0:30	1:00	1:30	2:00	2:30	3:00	3:30	4:00	4:30	5:00	5:30	6:00	6:30	7:00	7:30	8:00
Thread 1	Thread 2	Thread 3	Thread 1
Call 1.1	Call 1.2	Call 2.1	Call 2.2	Call 3.1	Call 3.2	Call 1.3	Call 1.4

As you can see, we simply split the time span we want to cover into smaller windows. In this example I used 2 hours per thread. Again, this works great as all records getting modified would just be added to the lend.

As we cannot fully guarantee when the last batch is being read and some records might move into that window, we should define our windows in such ways that the upper limit is always lower than anything which gets moddified as we work.

I.e. limit your whole time window already to something like last night 24:0). This way everything which gets moddified will move to the next cycle and not just the last thread.