10 Apr 2016, 23:30

Big Geo Data, Spatial Indexes and Parallelization

As geospatial datasets continue to increase in size, the necessity to be able to query and gain meaning from them efficiently has never been more important. I have talked a little bit previously about why geospatial problems are inherently complex to solve (predominantly down to the multidimensional nature of the data). To demonstrate swiftly however, we can imagine that some years ago it would have been pretty normal to want to know which of 1000 points sit inside which of 100 polygons (e.g. Post Boxes in administrative boundaries for example). Overall computationally that’s not too bad; if we went through sequentially and checked everything we’d be at 100000 operations, which although to us might sound like a lot, isn’t too bad for a computer. However, what about if we were take take some large machine generated data (e.g. water height sensors), or mass social media (e.g. Twitter, Instagram etc) dataset instead? We are looking at 10s if not 100s of thousands of points in some circumstances. The sequential (bruteforce) approach is perhaps no longer so appealing.

Thankfully this is a well known problem, and the geospatial academic and industrial community has come up with (amongst other things) spatial indexes. Spatial indexes organise data into structures that can quickly be traversed to get access to the underlying geographic data in an efficient fashion. Common indexes for spatial data include things such as Quadtrees, R-trees, Geohashes and Space Filling Curves (e.g. Z order, Hilbert Curves), some of which you may have heard of. In Big O terms (a concept I touched on in my previous post) the reduction in complexity can be significant.

For example finding a object in a Quadtree is on average a O(log n) operation (Finkel and Bentley, 1974), compared to a O(n) operation for sequential searching. That is to say as we traverse down the Quadtree through each the quads to find the object it will take log4 * n time. Quadtrees are a popular spatial indexing method and one the first spatial data structures (de Berg, 2008), and are used a fair amount for range (bounding box) searching in various fields including GIS. The approach with a Quadtree is to recursively divide the geometric space into quadtrants (4). This happens continuously until there are no more points in any of the quads. This is demonstrated here:

Taking the quadtree example further, lets discuss how this data structure helps us perform a spatial query. Let’s say we have a large set of points and we want to see if we draw some arbitrary bounding box which points are contained within that box (it could be postboxes around the UK for example). We start at the top of the Quadtree tree structure checking which quadrant(s) intersect the bounding box, and we continuously work down the the tree until we reach the final points within the child quadrants. We then check if that point actually resides within the box specified. This speeds up the search, as we only need to check quadrants that intersect the bounding box, reducing the number of points we need to check for containment drastically. This is in comparison to checking all points for containment within the bounding box sequentially. For a more practical example how indexing speeds up query, check out Nicholas Duggan’s post entitled ‘Why Spatial Indexing is Really Important’ which covers speedups provided in PostGIS through indexing.

Lets take the previous Quadtree problem statement around bounding box searching and take a look at a practical JavaScript implementation, in this case in the form of an interactive example from Mike Bostock. As you can see the app is pretty performant and helps demonstrate visually the benefit of a Quadtree data structure for spatial data.

Another innovative approach to indexing is to use what is termed a ‘space filling curve’. A space filling curve (SFC) allows you to represent 2D or 3D space as a single one dimensional line that ‘curves’ (more like snakes!) around the space. This requires passing a continuous line through a (imaginary) finite resolution grid over the geometric space. Generally the line does this in a fractal like pattern, that is to say it recursively looks the same as you get closer and further away from the curve. Sounds crazy right? Well when this is performed it allows us (albeit not perfectly) to maintain locality in a one dimensional data structure. By locality we mean points close together in our 1D data structure (such as an array) are relatively close together in 2D or 3D space. However, inevitably they will never be exact and there will be times where x,y coordinates are close but there position in the 1 dimensional array are far apart (Mishra, 2011). Some examples of space filling curves include: Hilbert Curve, Peano Curve and the Z Order Curve. Here’s an example of a Hilbert Curve:

If you imagine, we could label each block sequentially (1, 2, 3, 4 etc, although normally in binary) and stretch it out to get a one dimensional data structure. Now lets take a look at how we could use a space filling curve for a our advantage for a range query. We can take our one dimensional data structure and index this also using something like a B+tree. Having done that we can take all the curve points (blocks) that sit underneath a given queries bounding box. From here we can take the curve points, say for example points (blocks) 4-9, and query the B+tree index taking the pointer to the underlying object and finally verifying this actually sits within the bounding box (Dai, 2011). As you can see, perhaps not the easiest concept in the world to explain, but hopefully using the following example from Jared Winick you can begin to understand visually how the search works. It’s important to know Winick’s example uses a Z Order Curve , not a Hilbert Curve as depicted above (the Z curve is more… z shaped).

There are a fair few open source implementations of Space Filling Curves for dealing with large spatial datasets, one from Google called Uzaygezen (not maintained to my knowledge) and another called SFCurve from LocationTech (currently experimental). The Z-order curve can be used BigTable-like systems such as HBase, Cassandra and Accumulo (which Geomesa is based off) to get a bounding box query from Z-order curve index for range scanning (bounding box searching).

If you are struggling with the Space Filling Curve concept, I thoroughly recommend this (somewhat ancient!) video regarding the subject:

Lastly, what of parallelization? That is to say can we distribute processing for intense geospatial queries across multiple machines or processing cores? We have already identified in the previous blog post, that the reason geo queries are so hard to parallelize is the intrinsic necessity for locality to be maintained whilst splitting data. That is to say, we cant just split up and data in any old fashion to have it processed, as the location of the data is an integral part of the processing/querying:

"The items of ordinary commercial databases are generally isolated and lack explicit relationships; while for spatial data there are many associations, such as topological relationships, directional relationships, and metrical relationships, which should be taken into full account when the spatial data is partitioned" - Meng et al. (2007)

Luckily space filling curves and other spatial indexing methods have allowed us to split data up in ways that maintain locality, therefore maintaining relationships that are necessary in topological and geometric operations. In more academic terms:

"The big spatial data are partitioned into blocks according to the geographic space and block size threshold 1, and these blocks are uniformly distributed on cluster nodes. Then the geographically adjacent spatial objects are stored sequentially in terms of space filling curve which could preserve the geographic proximity of spatial objects." - Zhang et al (n.d.)

Indeed, here we can illustrate that not only do spatial indexes allow us to greatly speed up spatial querying, but also allow us to parallelize data across nodes for operations such as MapReduce!

I know that’s been a lot to take in, but hopefully that’s helped illustrate something that’s I’ve been trying to get my head round recently. I’d be really interested in hearing your thoughts, comments and even corrections (I always want to learn more!).

14 Jan 2016, 19:16

On Well Known Text and the Web

Over the past week I have made a start on looking back at the problem of transferring spatial data out of databases into front end applications. This issue stemed from a project I worked on UCL called Lacuna. Lacuna was a 3D, WebGL based Geographic Information System that I worked on for my masters dissertation. I have since deprecated Lacuna because, in all honesty I dived straight in the deep end and at the time I had only a superficial knowledge of JavaScript and PHP. As such it accrued a lot of technical debt from trying to finish my dissertation in a reasonable time frame.

With the knowledge I have accrued since then I have decided to give tackling CRUD operations on spatial databases via web services. One decision I made was it would require something more performant for the backend. For the language I knew PHP5 wasn’t going to cut it this time around. I weighed up the pros and cons of various languages such as Node and PHP7 but decided that Go was a solid choice. However for the backend I would keep using PostGIS due to it and PostgreSQL relative maturity.

Go is an interesting language, and I’m equal parts enjoying and bashing my head on the desk playing with it for this latest project. The backend of Lacuna was responsible predominately for getting the 3D geometries stored in PostGIS and bringing them to the front end to be rendered by Three.js.

This shouldn’t be an overly complex task, but this time around, I realised that to do it properly is actually quite a significant pain. The major reason for this is Well Known Text. For those of you that aren’t familiar Well Known Text is a markup language for geometries. Specifically it is the text based, human readable version of Well Known Binary (WKB), which is the standard for storing geometries in spatial databases. Let me show you an example of what Well Known Text looks like:

  ((0 0 0, 0 1 0, 1 1 0, 1 0 0, 0 0 0)),
  ((0 0 0, 0 1 0, 0 1 1, 0 0 1, 0 0 0)),
  ((0 0 0, 1 0 0, 1 0 1, 0 0 1, 0 0 0)),
  ((1 1 1, 1 0 1, 0 0 1, 0 1 1, 1 1 1)),
  ((1 1 1, 1 0 1, 1 0 0, 1 1 0, 1 1 1)),
  ((1 1 1, 1 1 0, 0 1 0, 0 1 1, 1 1 1))

You can find a full specification here. As it stands there 18 different Geometry types and 4 variations on each (2D, Z, M, ZM). My job as the developer is to convert these Well Known Text fragments so that I can use clientside in a 3D geometry viewer. This on the surface seems like a plausible task; I mean, firstly someone must have hit this problem before, there’s got to be a library out there that does this right? It turns out there are a few and they’re great, but inevitably they have their limitations. One recurring issue is the 3D support is varying (either they lack Z or ZM geometries, PolyhedralSurfaces, TIN Z or a combination). Generally this is okay if you are doing things with Leaflet, OpenLayers, Google or Esri APIs and working in a 2D environment. However, as the GI industry progresses we are incrementally seeing more 3D geographic information, with demand increasing for both data and systems that can handle 3D.

In reality most web systems will mostly likely require that the data at some point make it’s way into JSON for consumption by the ubiquity that is JavaScript. As an example Terraformer (an open source library from Esri) tackles this by converting to GeoJSON. Again this is great stuff, but unfortunately GeoJSON is a 2D specification (as it stands).

OK you say, well stop being lazy and write your own parser and come up with your own JSON schema. So I am, and I’m about 40% of the way there, but I felt compelled to point out some issues I see with WKT. One of the major ones is the way the text itself is delimited. WKT doesn’t appear (to my knowledge) to leverage any obvious standard for structuring data (i.e. JSON, XML or YAML). As such you more-or-less have to write the parser from scratch. Lets take the example above of the POLYHEDRALSURFACE Z: the X Y Z coordinates are delimited by spaces which, amongst other things makes converting them into something more usable like JSON quite a complex task. Delimitation between coordinates in the same part of the geometry, different parts of the same geometry and separate geometries in their entirety are all delimited by commas. This in turn makes it tough to parse effectively. What characters should you split on? Is that even a good methodology? How can I ensure the integrity of the regex if I go down that path? These are of course rhetorical questions but I hope you get my point.

The approach I took in the end was using a combination of string manipulation and regex, but it feels fragile (read: hacky), and by all means this may not the most optimal. Perhaps a better approach would be to break down the parentheses but it’s hard to tell which ones are superficial and which are integral to the structure of the text. Unfortunately the specification is quite dense for developers to digest in comparison to something like GeoJSON (MapBox have poked a little fun at this on the wellknown GitHub README).

Again though this poses another problem, now I’ve parsed the WKT into JSON, but how should the JSON schema look? Maybe like GeoJSON? But really there’s no standard so it feels somewhat arbitrary. The real point of this post is to open up a dialogue on how to handle WKT going forward. It seems like a standard that has been adopted by spatial database and as such is not going anywhere fast. Storing 3D data in databases makes sense, but in honesty WKT is a tough thing to transition to something meaningful for clientside apps (unlike JSON, XML, YAML). Should we be working towards better support for 3D in projects like wellknown from MapBox, Terraformer from Esri and GeoScript? And should we be looking at ways to produce a standard for 3D geo data in the web i.e. by extending GeoJSON? I am really keen to hear what approaches people have taken to handling WKT in development.

As an after thought, how about something like this for modelling WKT as JSON?:

    dimensions : "Z"
    geometries : [
        [[ {x: 0, y: 0, z: 0}, {x: 0, y: 1, z: 0}, {x: 1, y: 1, z: 0}, {x: 1, y: 0, z: 0}, {x: 0, y: 0, z: 0}]],
        [[ {x: 0, y: 0, z: 0}, {x: 0, y: 1, z: 0}, {x: 0, y: 1, z: 1}, {x: 0, y: 0, z: 1}, {x: 0, y: 0, z: 0}]],
        [[ {x: 0, y: 0, z: 0}, {x: 1, y: 0, z: 0}, {x: 1, y: 0, z: 1}, {x: 0, y: 0, z: 1}, {x: 0, y: 0, z: 0}]],
        [[ {x: 1, y: 1, z: 1}, {x: 1, y: 0, z: 1}, {x: 0, y: 0, z: 1}, {x: 0, y: 1, z: 1}, {x: 1, y: 1, z: 1}]],
        [[ {x: 1, y: 1, z: 1}, {x: 1, y: 0, z: 1}, {x: 1, y: 0, z: 0}, {x: 1, y: 1, z: 0}, {x: 1, y: 1, z: 1}]],
        [[ {x: 1, y: 1, z: 1}, {x: 1, y: 1, z: 0}, {x: 0, y: 1, z: 0}, {x: 0, y: 1, z: 1}, {x: 1, y: 0, z: 1}]]

I know what you’re thinking; this is longer than the original! That is true, however after gzipping and transfer over the wire the difference will be very negligible because it’s a repeating pattern (x, y, z).