A Time Series Storage for Coordinates

- go geo

TL;DR; Knowing your data helps compress them better than common algorithms.

Problem

For one of my side projects, an IoT database, I wanted a specialized time series to store timestamps coupled with coordinates.
I needed a simple solution which allows live and cold compressed storage with gaps in it: IoT devices can be off for days then reappear.
But couldn’t find anything fitting my needs, the Gorilla Paper from Facebook is really nice but expects a 4 hours maximum gap between time events.

Solution

Time series for geospacial data have particular properties:

It means we could store deltas between events to reduce the space needed.

Storage

By storing the delta between two events, we need less space:
1201986030 then 10 seconds later 1201986040, the delta is 10, which can be stored in an uint8, or in an uint16 for a bigger delta value but still reducing the original needed uint32.

The next event will probably occur 10 seconds later at 1201986050, by storing deltas of deltas (so we need a signed integer), we can store 0 (zero is nice since it takes zero space) or the delta +/- !
Many time series, Prometheus included, are using this trick.

Same applies to the coordinates, a moving car or a not moving car, can be compressed by this technic.

Because we may have large gap between times or between coordinates, larger than a uint16, we may have to store a fully uncompressed value.

We still have to store something for every events indicating if it’s a zero, uint8, uint16 delta or a full value.

	Delta0  DeltaEncoding = 0b00
	Delta8  DeltaEncoding = 0b01
	Delta16 DeltaEncoding = 0b10
	Full32  DeltaEncoding = 0b11

Using two bits we can tell the timestamp encoding, same goes for latitude and longitude, reading it back with & mask.

func (d DeltaEncoding) TSDelta() DeltaEncoding {
	return d & 0b000011
}

func (d DeltaEncoding) LatDelta() DeltaEncoding {
	return d & 0b001100 >> 2
}

func (d DeltaEncoding) LngDelta() DeltaEncoding {
	return d & 0b110000 >> 4
}

Compression

Let’s recap with an example:

#timestamplatlngencoded
1120198603048.222222.3333347a4d9ee004994ce00038f75
2120198604048.222222.33333010a
3120198605048.222222.3333300
4120198606048.222212.3333414ff01
5120198607048.222202.3333500
6120198608048.222192.3333600

From #1 to #3, you can see a not moving device, with a constant timestamp will end consume only 1 byte.

From #4 to #6, a moving device with a constant speed will take 1 byte per coordinate, then end consume and share the exact same 1 control byte.

By using these simple solutions on real data, it’s compressing better than common compressors, being very easy on CPU consumption:

Let’s compare the compressed size to other algorithms:

Size: 64776     Snappy 61007    LZ4 57225       TSD 23948
Size: 18864     Snappy 18870    LZ4 18267       TSD 9883
Size: 15768     Snappy 12721    LZ4 11860       TSD 4775

Conclusion

There is nothing revolutionary here and it can be improved a lot (6 bits used on the control byte, XORing values, RLE…) but it demonstrates a simple specialized piece of software could perform better than someone else compression library.

The code for tsd is up on Github !

Hiring

Looking for a remote full time employee or contractor, contact me.