A Time Series Storage for Coordinates
- go geo
TL;DR; Knowing your data helps compress them better than common algorithms.
Problem
For one of my side projects, an IoT database, I wanted a specialized time series to store timestamps coupled with coordinates.
I needed a simple solution which allows live and cold compressed storage with gaps in it: IoT devices can be off for days then reappear.
But couldn’t find anything fitting my needs, the Gorilla Paper from Facebook is really nice but expects a 4 hours maximum gap between time events.
Solution
Time series for geospacial data have particular properties:
- events are occurring often regularly
- object is not moving at all or
- object is often moving at a constant speed
It means we could store deltas between events to reduce the space needed.
Storage
- One
uint32
is needed to encode an uncompressed time using Unix ts - Two
float32
are needed to encode an uncompressed latitude/longitude
(stored as uint32: float32 * 10e5, 0.00001 degree represents ~1 meter)
By storing the delta between two events, we need less space:1201986030
then 10 seconds later 1201986040
, the delta is 10
, which can be stored in an uint8
, or in an uint16
for a bigger delta value but still reducing the original needed uint32
.
The next event will probably occur 10 seconds later at 1201986050
, by storing deltas of deltas (so we need a signed integer), we can store 0
(zero is nice since it takes zero space) or the delta +/- !
Many time series, Prometheus included, are using this trick.
Same applies to the coordinates, a moving car or a not moving car, can be compressed by this technic.
Because we may have large gap between times or between coordinates, larger than a uint16
, we may have to store a fully uncompressed value.
We still have to store something for every events indicating if it’s a zero, uint8
, uint16
delta or a full value.
Delta0 DeltaEncoding = 0b00
Delta8 DeltaEncoding = 0b01
Delta16 DeltaEncoding = 0b10
Full32 DeltaEncoding = 0b11
Using two bits we can tell the timestamp encoding, same goes for latitude and longitude, reading it back with &
mask.
func (d DeltaEncoding) TSDelta() DeltaEncoding {
return d & 0b000011
}
func (d DeltaEncoding) LatDelta() DeltaEncoding {
return d & 0b001100 >> 2
}
func (d DeltaEncoding) LngDelta() DeltaEncoding {
return d & 0b110000 >> 4
}
Compression
Let’s recap with an example:
# | timestamp | lat | lng | encoded |
---|---|---|---|---|
1 | 1201986030 | 48.22222 | 2.33333 | 47a4d9ee004994ce00038f75 |
2 | 1201986040 | 48.22222 | 2.33333 | 010a |
3 | 1201986050 | 48.22222 | 2.33333 | 00 |
4 | 1201986060 | 48.22221 | 2.33334 | 14ff01 |
5 | 1201986070 | 48.22220 | 2.33335 | 00 |
6 | 1201986080 | 48.22219 | 2.33336 | 00 |
From #1 to #3, you can see a not moving device, with a constant timestamp will end consume only 1 byte.
From #4 to #6, a moving device with a constant speed will take 1 byte per coordinate, then end consume and share the exact same 1 control byte.
By using these simple solutions on real data, it’s compressing better than common compressors, being very easy on CPU consumption:
Let’s compare the compressed size to other algorithms:
Size: 64776 Snappy 61007 LZ4 57225 TSD 23948
Size: 18864 Snappy 18870 LZ4 18267 TSD 9883
Size: 15768 Snappy 12721 LZ4 11860 TSD 4775
Conclusion
There is nothing revolutionary here and it can be improved a lot (6 bits used on the control byte, XORing values, RLE…) but it demonstrates a simple specialized piece of software could perform better than someone else compression library.
The code for tsd is up on Github !
Hiring
Looking for a remote full time employee or contractor, contact me.