A Time Series Storage for Coordinates

Dec 11, 2019 - go geo

TL;DR; Knowing your data helps compress them better than common algorithms.

Problem

For one of my side projects, an IoT database, I wanted a specialized time series to store timestamps coupled with coordinates.
I needed a simple solution which allows live and cold compressed storage with gaps in it: IoT devices can be off for days then reappear.
But couldn’t find anything fitting my needs, the Gorilla Paper from Facebook is really nice but expects a 4 hours maximum gap between time events.

Solution

Time series for geospacial data have particular properties:

events are occurring often regularly
object is not moving at all or
object is often moving at a constant speed

It means we could store deltas between events to reduce the space needed.

Storage

One uint32 is needed to encode an uncompressed time using Unix ts
Two float32 are needed to encode an uncompressed latitude/longitude
(stored as uint32: float32 * 10e5, 0.00001 degree represents ~1 meter)

By storing the delta between two events, we need less space:
1201986030 then 10 seconds later 1201986040, the delta is 10, which can be stored in an uint8, or in an uint16 for a bigger delta value but still reducing the original needed uint32.

The next event will probably occur 10 seconds later at 1201986050, by storing deltas of deltas (so we need a signed integer), we can store 0 (zero is nice since it takes zero space) or the delta +/- !
Many time series, Prometheus included, are using this trick.

Same applies to the coordinates, a moving car or a not moving car, can be compressed by this technic.

Because we may have large gap between times or between coordinates, larger than a uint16, we may have to store a fully uncompressed value.

We still have to store something for every events indicating if it’s a zero, uint8, uint16 delta or a full value.

	Delta0  DeltaEncoding = 0b00
	Delta8  DeltaEncoding = 0b01
	Delta16 DeltaEncoding = 0b10
	Full32  DeltaEncoding = 0b11

Using two bits we can tell the timestamp encoding, same goes for latitude and longitude, reading it back with & mask.

func (d DeltaEncoding) TSDelta() DeltaEncoding {
	return d & 0b000011
}

func (d DeltaEncoding) LatDelta() DeltaEncoding {
	return d & 0b001100 >> 2
}

func (d DeltaEncoding) LngDelta() DeltaEncoding {
	return d & 0b110000 >> 4
}

Compression

Let’s recap with an example:

#	timestamp	lat	lng	encoded
1	`1201986030`	`48.22222`	`2.33333`	`47a4d9ee004994ce00038f75`
2	`1201986040`	`48.22222`	`2.33333`	`010a`
3	`1201986050`	`48.22222`	`2.33333`	`00`
4	`1201986060`	`48.22221`	`2.33334`	`14ff01`
5	`1201986070`	`48.22220`	`2.33335`	`00`
6	`1201986080`	`48.22219`	`2.33336`	`00`

From #1 to #3, you can see a not moving device, with a constant timestamp will end consume only 1 byte.

From #4 to #6, a moving device with a constant speed will take 1 byte per coordinate, then end consume and share the exact same 1 control byte.

By using these simple solutions on real data, it’s compressing better than common compressors, being very easy on CPU consumption:

Let’s compare the compressed size to other algorithms:

Size: 64776     Snappy 61007    LZ4 57225       TSD 23948
Size: 18864     Snappy 18870    LZ4 18267       TSD 9883
Size: 15768     Snappy 12721    LZ4 11860       TSD 4775

Conclusion

There is nothing revolutionary here and it can be improved a lot (6 bits used on the control byte, XORing values, RLE…) but it demonstrates a simple specialized piece of software could perform better than someone else compression library.

The code for tsd is up on Github !

Hiring

Looking for a remote full time employee or contractor, contact me.