08 Mar 2016, 11:23

A geo database for polygons, optimization

If you read this blog, you know I’ve recently released a project called regionagogo, a geo shape lookup database, described in this blogpost.

It uses the current Go S2 implementation, which is not yet as complete as the C++ implementation, for example the region coverer of a shape does not really compute cell around the shape but around the bounding box instead.

Using the shape of the polygon makes the covered cells more precise and smaller, resulting at the end to less PIP tests which are costly.

S2 over a shape
The same coverage with Go S2 would have returned 8 big cells of the same size, covering over regions.

I’ve created regionagogogen a quick and dirty command line program for OSX that takes a GeoJSON file containing your regions and then compute the database using the S2 C++ implementation, it’s for OSX only cause I’m using ObjC as a bridge to C++ which I don’t know enough.

Also note that regionagogogen includes an S2 port that works on iOS/OSX and some ObjC geo helper classes.

The Docker image fore regionagogo is also using the optimized database.

18 Feb 2016, 17:16

A geo database for polygons, foundations

On a previous post, I’ve described how to use the S2 geo library to create a fast geo database, but it was to store locations (points) and only to perform range queries, a complete geo database would have regions/polygons queries.

Looking for a solution

I had this need: querying for the countries or subregions of hundreds of coordinates per second, without relying on an external service.

One solution, using my previous technique, could have been to store every cities in the world and then perform a proximity query around my point to get the closest cities, but it works only in populated area and it’s only an approximation.

I looked into others solutions, there is some smart ideas using UTF-grid, but it’s a client side solution and also an approximation tied to the resolution of the computed grid.

S2 to the rescue

S2 cells have some nice properties, they are segments on the Hilbert curve, expressed as range of uint64, so I had the intuition the problem to perform fast region lookup could be simplified as find all mathematical segments containing my location expressed as an uint64.

S2 over a country

Using a Segment Tree datastructure, I first tried an in memory engine, using Natural Earth Data, loading the whole world countries shapes into S2 loops (a Loop represents a simple spherical polygon), transforming then into cells using the region coverer, it returns cells of different levels, add them to the segment tree.

Segment Tree

To query, simply tranform the location into an S2 Cell (level 30) and perform a stubbing query that intersects the segments, every segments crossed are cells that covered a part of a Loop.
It will reduce the problem to test a few Loop vs thousands of them, finally perform ContainsPoint against the found loops cause the point could be inside the Cell but not inside the Loop itself.

Et voilà! It works!

The segment tree structure itself is very low on memory, the loops/polygons data could be stored on disk and loaded on requests, I’ve tested a second implementation using LevelDB using this technique.

If you have a very large tree (for example cities limits for the whole world), you can even put the segment tree on a KV storage, using this paper Interval Indexing and Querying on Key-Value Cloud Stores.

Region a gogo

As a demonstration here is a working microservice called regionagogo, simply returning the country & state for a given location.
It loads geo data for the whole world and answers to HTTP queries using small amout of memory.

GET /country?lat=19.542915&lng=-155.665857

{
    "code": "US",
    "name": "Hawaii"
}

Here is a Docker image so you can deploy it on your stack.

Note that it performs really well but can be improved a lot, for example the actual Go S2 implementation is still using Rect boxing around loops, that’s why regionagogo is using a data file so it can be generated from the C++ version.

Future

This technique seems to work well for stubbing queries, region queries, geofencing …
It can be a solid foundation to create a flexible and simple geo database.

26 Jan 2016, 14:01

A fast geo database with Google S2 take #2

Six months ago, I wrote on this blog about Geohashes and LevelDB with Go, to create a fast geo database.
This post is very similar as it works the same way but replacing GeoHashes with Google S2 library for better performances.

There is an S2 Golang implementation maintened by Google not as complete as the C++ one but close.

For the storage this post will stay agnostic to avoid any troll, but it applies to any Key Value storages: LevelDB/RocksDB, LMDB, Redis…
I personnaly use BoltDB and gtreap for my experimentations.

This post will focus on Go usage but can be applied to any languages.

Or skip to the images below for visual explanations.

Why not Geohash?

Geohash is a great solution to perform geo coordinates queries but the way it works can sometimes be an issue with your data.

  • Remember geohashes are cells of 12 different widths from 5000km to 3.7cm, when you perform a lookup around a position, if your position is close to a cell’s edge you could miss some points from the adjacent cell, that’s why you have to query for the 8 neightbour cells, it means 9 range queries into your DB to find all the closest points from your location.

  • If your lookup does not fit in level 4 39km by 19.5km, the next level is 156km by 156km!

  • The query is not performed around your coordinates, you search for the cell you are in then you query for the adjacent cells at the same level/radius, based on your needs, it means it works very approximately and you can only perform ‘circles’ lookup around the cell you are in.

  • The most precise geohash needs 12 bytes storage.

  • -90 +90 and +180 -180, -0 +0 are not sides by sides prefixes.

Why S2?

S2 cells have a level ranging from 30 ~0.7cm² to 0 ~85,000,000km².
S2 cells are encoded on an uint64, easy to store.

The main advantage is the region coverer algorithm, give it a region and the maximum number of cells you want, S2 will return some cells at different levels that cover the region you asked for, remember one cell corresponds to a range lookup you’ll have to perform in your database.

The coverage is more accurate it means less read from the DB, less objects unmarshalling…

Real world study

We want to query for objects inside Paris city limits using a rectangle:

h5
Using level 5 we can’t fit the left part of the city.
We could add 3 cells (12 total DB queries ) on the left but most algorithms will zoom out to level 4.

h4
But now we are querying for the whole region.

s2
Using s2 asking for 9 cells using a rectangle around the city limits.

s2 vs h4
The zones queried by Geohash in pink and S2 in green.

Example S2 storage

Let’s say we want to store every cities in the world and perform a lookup to find the closest cities around, first we need to compute the CellId for each cities.

// Compute the CellID for lat, lng
c := s2.CellIDFromLatLng(s2.LatLngFromDegrees(lat, lng))

// store the uint64 value of c to its bigendian binary form
key := make([]byte, 8)
binary.BigEndian.PutUint64(key, uint64(c))

Big endian is needed to order bytes lexicographically, so we can seek later from one cell to the next closest cell on the Hilbert curve.

c is a CellID to the level 30.

Now we can store key as the key and a value (a string or msgpack/protobuf) for our city, in the database.

Example S2 lookup

For the lookup we use the opposite procedure, first looking for one CellID.

// citiesInCellID looks for cities inside c
func citiesInCellID(c s2.CellID) {
  // compute min & max limits for c
  bmin := make([]byte, 8)
  bmax := make([]byte, 8)
  binary.BigEndian.PutUint64(bmin, uint64(c.RangeMin()))
  binary.BigEndian.PutUint64(bmax, uint64(c.RangeMax()))

  // perform a range lookup in the DB from bmin key to bmax key, cur is our DB cursor
  var cell s2.CellID
  for k, v := cur.Seek(bmin); k != nil && bytes.Compare(k, bmax) <= 0; k, v = cur.Next() {
    buf := bytes.NewReader(k)
    binary.Read(buf, binary.BigEndian, &cell)

    // Read back a city
    ll := cell.LatLng()
    lat := float64(ll.Lat.Degrees())
    lng := float64(ll.Lng.Degrees())
    name = string(v)
    fmt.Println(lat, lng, name)
  }
}

Then compute the CellIDs for the region we want to cover.

rect := s2.RectFromLatLng(s2.LatLngFromDegrees(48.99, 1.852))
rect = rect.AddPoint(s2.LatLngFromDegrees(48.68, 2.75))

rc := &s2.RegionCoverer{MaxLevel: 20, MaxCells: 8}
r := s2.Region(rect.CapBound())
covering := rc.Covering(r)

for _, c := range covering {
    citiesInCellID(c)
}

RegionCoverer will return at most 8 cells (in this case 7 cells: 4 8, 1 7, 1 9, 1 10) that is guaranteed to cover the given region, it means we may have to exclude cities that were NOT in our rect, with func (Rect) ContainsLatLng(LatLng).

Congrats we have a working geo db.

S2 can do more with complex shapes like Polygons and includes a lot of tools to compute distances, areas, intersections between shapes…

Here is a Github repo for the data & scripts generated for the images

30 Aug 2015, 18:25

A blazing fast geo database with LevelDB, Go and Geohashes

You probably have heard of LevelDB it’s a blazing fast key value store (as a library not a daemon), that uses Snappy compression.
There is plenty of usages for it, the API is very simple at least in Go (I will be using Goleveldb).

The key is a []byte the value is a []byte so you can “get”, “put” & “delete” that’s it.

I needed a low memory low cpu system that could collect millions of geo data and query over them, Geohash has an interesting property you can encode longitude and latitude into a string : f2m616nn this hash represents the lat & long 46.770, -71.304 f2m616nn, if you shorten the string to f2m61 it still refers to the same lat & long but with less precisions f2m61.
A 4 digits hash leads to 19545 meters precision, to perfom a lookup around a position you simply query for the 8 adjacent blocks. A Geohash library for Go.

Here you would store all of your data points matching a geohash to the same set.
Problem there is no such thing as a set in LevelDB.

But there is a cursor so you can seek to a position then iterate over the next or previous one (byte ordered).
So your data could be stored that way: 4 digits geohash + a uniq id.

Then you can perform proximity lookup by searching for the 8 adjacents hashes from the position your are looking with a precision of 20km, good but not very flexible.

We can have a more generic solution, first we need a key a simple int64 uniq id.

// NewKey generates a new key using time prefixed by 'K'
func NewKey() Key {
	return NewKeyWithInt(time.Now().UnixNano())
}

// NewKeyWithInt returns a key prefixed by 'K' with value i
func NewKeyWithInt(id int64) Key {
	key := bytes.NewBufferString("K")
	binary.Write(key, binary.BigEndian, id)
	return key.Bytes()
}

Here we can encode a key with a Unix timestamp so our key is not just a key it’s also an encoded time value, it will be uniq thanks to the nanosecond precision. We are using BigEndian so it can be byte compared: older will be before newer after.

Now about geo encoding our key will be of the form:
G201508282105dhv766K��Ϸ�Y� (note the end of the key is binary encoded) You always need a prefix for your keys so you can seek and browse them without running over different keys types, here I have a G as Geo, then a string encoded date prefix, so we can search by date, but we don’t want extra precision here, it would add extra seek to LevelDB, (that’s why we have a modulo of 10 for minutes) then we add a precise geohash and finally our previous uniq id.

// NewGeoKey generates a new key using a position & a key
func NewGeoKey(latitude, longitude float64) GeoKey {
	t := time.Now().UTC()
	kint := t.UnixNano()
	kid := NewKeyWithInt(kint)
	// G + string date + geohash 6 + timestamped key 
	// G201508282105dhv766K....
	gk := geohash.EncodeWithPrecision(latitude, longitude, 6)
	ts := t.Format("2006010215")

	// modulo 10 to store 10mn interval
	m := t.Minute() - t.Minute()%10
	zkey := []byte("G" + ts + fmt.Sprintf("%02d", m) + gk)
	zkey = append(zkey, kid...)
	return zkey
}

We can now lookup by flexible date & by flexible proximity like a Redis ZRANGE, you simply need to reverse the process.

// GeoKeyPrefix return prefixes to lookup using a GeoKey and timerange
func GeoKeyPrefix(start, stop time.Time) []string {
	var res []string
	d := 10 * time.Minute
	var t time.Time
	t = start
	for {
		if t.After(stop) {
			break
		}

		key := "G" + t.Format("2006010215") + fmt.Sprintf("%02d", t.Minute()-t.Minute()%10)
		res = append(res, key)
		t = t.Add(d)
	}
	return res
}

Lookup that way:

	d := time.Duration(-10) * time.Minute
	geoPrefixs := GeoKeyPrefix(time.Now().UTC().Add(d), time.Now().UTC())

	// find adjacent hashes in m
	// 1, 5003530
	// 2, 625441
	// 3, 123264
	// 4, 19545
	// 5, 3803
	// 6, 610
	gk := geohash.EncodeWithPrecision(lat, long, 4)
	adjs := geohash.CalculateAllAdjacent(gk)
	adjs = append(adjs, gk)

	// for each adjacent blocks
	for _, gkl := range adjs {

		// for each time range modulo 10
		for _, geoPrefix := range geoPrefixs {
			startGeoKey := []byte(geoPrefix + gkl)
			iter := s.NewIterator(util.BytesPrefix(startGeoKey), nil)

			for iter.Next() {
				log.Println(iter.Value())
			}
			iter.Release()
		}
	}

It can be optimized, reducing the size of the keys, but it performs extremely well storing around 3 millions geopoints per day, using less than 3% cpu and can received hundreds of queries per second.

Oh did I forget to mention it’s running on a Raspberry Pi? :)

I could maybe turn it into a library but it’s so simple it’s probably useless.
Next blog post: what are those millions points used for?