23 Dec 2016, 16:56

Visualising Facebook's Population Data on the Web

Over the past month or so (on and off!) I’ve had the opportunity to explore two really interesting ideas. The first was relating to a talk I saw at FOSS4G in Bonn this year by Fabian Schindler. Fabian’s talk was on how to use GeoTIFF data in the web using a set of libraries called geotiff.js and plotty.js. You can see a post about his talk here. The talk got me thinking and interesting in what sort of GeoTIFF data could be pulled into the web.

Around the same time, a news article caught my eye about Facebook having teamed up with a research institute to create a global dataset of population densities. I later managed to find the data on the International Earth Science Information Network (CIESIN) website here. From here I explored ways of making use of the datasets in the web as per the talk I’d seen at FOSS4G.

The Problems

Upon examining the raw data I noticed some immediate problems with my plan:

  • The data is relatively large for the web - a country could be upward of 100mb
  • The data has multiple bands that may not all be useful to end users
  • WebGL isn’t great with large textures

To take on all these problems it was necessary to preprocess the data into a more manageable and usable format.

Preprocessing the Data

To solve the listed problems was a relatively tricky endeavour. There was a multistep process:

  • Extact a single band
  • Downsample the raster (reduce it’s resolution)
  • Split the raster into managable chunks
  • Allow the browser to know where these chunks are somehow

For all of these processes I made use of Python and GDAL. The band extraction was fairly straight forward process, but the downsampling and splitting were somewhat more complicated issues.

You can see the full solutions to the downsampling and band extraction problems on the GitHub repo. Splitting the data was probably the hardest problem to solve as I struggled to find any examples of this being done across the web that weren’t making call outs to the shell from within Python (something I wanted to avoid).

In order to correctly split data it was necessary to subdivide the raster into a given size grid. For this to work correctly we needed to get the top left and bottom right coordinates of all the grid cells. After some thought on solving this mental puzzle, I deduced that you can create an arbitrary (n by n) sized grid of such coordinates using the following function:

def create_tiles(minx, miny, maxx, maxy, n):

    width = maxx - minx
    height = maxy - miny

    matrix = []

    for j in range(n, 0, -1):
        for i in range(0, n):

            ulx = minx + (width/n) * i 
            uly = miny + (height/n) * j 

            lrx = minx + (width/n) * (i + 1)
            lry = miny + (height/n) * (j - 1)
            matrix.append([[ulx, uly], [lrx, lry]])

    return matrix

Splitting the tiles allows us to send the raster in chunks whilst avoiding using a tile server or any kind of dynamic backend. I created a JSON file that contained metadata for all the necessary resulting files, allowing us to determine their centroid and file location prior to requesting all of them.

Displaying the Data

Displaying the data on the frontend took a little bit of trial and error. I used a combination of OpenLayers 3, plotty.js and geotiff.js to accomplish the end result. geotiff.js allows us to read the GeoTIFF data, and plotty.js allows us to create a canvas element that can be used by OpenLayers 3 to correctly place the elements.

To request and handle the asynchronous loading of the data I used the Fetch API and Promises (I’ve included polyfills for both in the demo). Once all the promises have resolved we now have all the tiffs loaded into memory. From here we can use a select dropdown that allows us to change the colors used for presenting the data.

The end result looks a little something like this:

Pros and Cons of this Method

Pros

  • We got to a point where we can display the data in the web
  • The data can be restyled dynamically clientside
  • No dynamic backend or file server required, just static files after preprocessing

Cons

  • Tiles are actually less size efficient than one big file, but are necessary to get the data to display
  • The downsampling factors have to be quite large to get it to be a reasonable file size
  • Even tiled, file sizes are quite large (i.e. 9 tiles at 2mb a file == 18mb which is a lot for the web)

One interesting idea about having the data client side as opposed to a raw image is you could even go as far as to figure out how to do basic visual analytics on the client. For example you could find a way to display only the highest values or lowest values for a given GeoTIFF.

Have a Go

The repo is located at : https://www.github.com/JamesMilnerUK/facebook-data-viz

There are instructions in the README for usage.

10 Apr 2016, 23:30

Big Geo Data, Spatial Indexes and Parallelization

As geospatial datasets continue to increase in size, the necessity to be able to query and gain meaning from them efficiently has never been more important. I have talked a little bit previously about why geospatial problems are inherently complex to solve (predominantly down to the multidimensional nature of the data). To demonstrate swiftly however, we can imagine that some years ago it would have been pretty normal to want to know which of 1000 points sit inside which of 100 polygons (e.g. Post Boxes in administrative boundaries for example). Overall computationally that’s not too bad; if we went through sequentially and checked everything we’d be at 100000 operations, which although to us might sound like a lot, isn’t too bad for a computer. However, what about if we were take take some large machine generated data (e.g. water height sensors), or mass social media (e.g. Twitter, Instagram etc) dataset instead? We are looking at 10s if not 100s of thousands of points in some circumstances. The sequential (bruteforce) approach is perhaps no longer so appealing.

Thankfully this is a well known problem, and the geospatial academic and industrial community has come up with (amongst other things) spatial indexes. Spatial indexes organise data into structures that can quickly be traversed to get access to the underlying geographic data in an efficient fashion. Common indexes for spatial data include things such as Quadtrees, R-trees, Geohashes and Space Filling Curves (e.g. Z order, Hilbert Curves), some of which you may have heard of. In Big O terms (a concept I touched on in my previous post) the reduction in complexity can be significant.

For example finding a object in a Quadtree is on average a O(log n) operation (Finkel and Bentley, 1974), compared to a O(n) operation for sequential searching. That is to say as we traverse down the Quadtree through each the quads to find the object it will take log4 * n time. Quadtrees are a popular spatial indexing method and one the first spatial data structures (de Berg, 2008), and are used a fair amount for range (bounding box) searching in various fields including GIS. The approach with a Quadtree is to recursively divide the geometric space into quadtrants (4). This happens continuously until there are no more points in any of the quads. This is demonstrated here:

Taking the quadtree example further, lets discuss how this data structure helps us perform a spatial query. Let’s say we have a large set of points and we want to see if we draw some arbitrary bounding box which points are contained within that box (it could be postboxes around the UK for example). We start at the top of the Quadtree tree structure checking which quadrant(s) intersect the bounding box, and we continuously work down the the tree until we reach the final points within the child quadrants. We then check if that point actually resides within the box specified. This speeds up the search, as we only need to check quadrants that intersect the bounding box, reducing the number of points we need to check for containment drastically. This is in comparison to checking all points for containment within the bounding box sequentially. For a more practical example how indexing speeds up query, check out Nicholas Duggan’s post entitled ‘Why Spatial Indexing is Really Important’ which covers speedups provided in PostGIS through indexing.

Lets take the previous Quadtree problem statement around bounding box searching and take a look at a practical JavaScript implementation, in this case in the form of an interactive example from Mike Bostock. As you can see the app is pretty performant and helps demonstrate visually the benefit of a Quadtree data structure for spatial data.

Another innovative approach to indexing is to use what is termed a ‘space filling curve’. A space filling curve (SFC) allows you to represent 2D or 3D space as a single one dimensional line that ‘curves’ (more like snakes!) around the space. This requires passing a continuous line through a (imaginary) finite resolution grid over the geometric space. Generally the line does this in a fractal like pattern, that is to say it recursively looks the same as you get closer and further away from the curve. Sounds crazy right? Well when this is performed it allows us (albeit not perfectly) to maintain locality in a one dimensional data structure. By locality we mean points close together in our 1D data structure (such as an array) are relatively close together in 2D or 3D space. However, inevitably they will never be exact and there will be times where x,y coordinates are close but there position in the 1 dimensional array are far apart (Mishra, 2011). Some examples of space filling curves include: Hilbert Curve, Peano Curve and the Z Order Curve. Here’s an example of a Hilbert Curve:

If you imagine, we could label each block sequentially (1, 2, 3, 4 etc, although normally in binary) and stretch it out to get a one dimensional data structure. Now lets take a look at how we could use a space filling curve for a our advantage for a range query. We can take our one dimensional data structure and index this also using something like a B+tree. Having done that we can take all the curve points (blocks) that sit underneath a given queries bounding box. From here we can take the curve points, say for example points (blocks) 4-9, and query the B+tree index taking the pointer to the underlying object and finally verifying this actually sits within the bounding box (Dai, 2011). As you can see, perhaps not the easiest concept in the world to explain, but hopefully using the following example from Jared Winick you can begin to understand visually how the search works. It’s important to know Winick’s example uses a Z Order Curve , not a Hilbert Curve as depicted above (the Z curve is more… z shaped).

There are a fair few open source implementations of Space Filling Curves for dealing with large spatial datasets, one from Google called Uzaygezen (not maintained to my knowledge) and another called SFCurve from LocationTech (currently experimental). The Z-order curve can be used BigTable-like systems such as HBase, Cassandra and Accumulo (which Geomesa is based off) to get a bounding box query from Z-order curve index for range scanning (bounding box searching).

If you are struggling with the Space Filling Curve concept, I thoroughly recommend this (somewhat ancient!) video regarding the subject:

Lastly, what of parallelization? That is to say can we distribute processing for intense geospatial queries across multiple machines or processing cores? We have already identified in the previous blog post, that the reason geo queries are so hard to parallelize is the intrinsic necessity for locality to be maintained whilst splitting data. That is to say, we cant just split up and data in any old fashion to have it processed, as the location of the data is an integral part of the processing/querying:

"The items of ordinary commercial databases are generally isolated and lack explicit relationships; while for spatial data there are many associations, such as topological relationships, directional relationships, and metrical relationships, which should be taken into full account when the spatial data is partitioned" - Meng et al. (2007)

Luckily space filling curves and other spatial indexing methods have allowed us to split data up in ways that maintain locality, therefore maintaining relationships that are necessary in topological and geometric operations. In more academic terms:

"The big spatial data are partitioned into blocks according to the geographic space and block size threshold 1, and these blocks are uniformly distributed on cluster nodes. Then the geographically adjacent spatial objects are stored sequentially in terms of space filling curve which could preserve the geographic proximity of spatial objects." - Zhang et al (n.d.)

Indeed, here we can illustrate that not only do spatial indexes allow us to greatly speed up spatial querying, but also allow us to parallelize data across nodes for operations such as MapReduce!

I know that’s been a lot to take in, but hopefully that’s helped illustrate something that’s I’ve been trying to get my head round recently. I’d be really interested in hearing your thoughts, comments and even corrections (I always want to learn more!).

20 Mar 2016, 21:44

The Hard Thing About Hard Geo Things

As spatial data sets have become larger, aggregated faster and from a wider variety of sources, handling large spatial datasets has become a challenge. As I’m sure many of you have experienced, many GIS operations can take a fairly long time to complete, especially with traditional desktop GIS techniques. Last week at the pub after geomob, I was discussing with a friend who works for Google in the Maps team about how they deal with such large datasets, and what techniques you can use to solve problems with them.

The conversation made me think about a couple of key questions about operating on geospatial data:

  • What sort of computational complexity are we looking at for fundamental GIS operations?
  • What problems can be parallelized or are suitable for approaches such as MapReduce?

For the purpose of this post we will use Big O notation to express the complexity of a problem. Big O notation expresses how complex a problem in relation to its inputs. This allows it to be neutral of all other factors such as hardware, data size (i.e. MB, GB etc), time units and so forth. As an example you can see in the graph above a linear problem (green - O(n)) requires a lot less operations to complete than a quadratic problem (light blue - O(n2)).

At its most basic level a geocode could be seen as a database lookup; the user searches for a postcode, this is checked in a database and a corresponding latitude and longitude is returned. The problem is slightly more complex than this obviously, but such a geocoder would without any optimisations be a O(n) problem. A O(n) problem is a linear problem who’s number of operations to complete is matched by the number of inputs given to it (i.e. 100 inputs results in 100 operations). This problem actually becomes a O(log n) problem if we were to index the postcode column.

Another example of a linear GIS problem is a geographic transformation (scale, rotate, translation etc) as we loop through all coordinates (inputs) and perform some mathematical function to transform the coordinates position. Similarly in a coordinate system reprojection we are mapping a function against all coordinates to transform the coordinate position into another coordinate system. As we only loop through the data and operate on it once, the problem is again linear.

However, let’s take an example problem to work out which UK postcodes belong to which Ordnance Survey grid cell. In human terms this a trivial problem, as one can deduct very quickly and visually which postcode centroid is in which cell.

However for a machine that problem is slightly more complex. It needs to programmatically deduce which grid cell it is in. This problem is actually a point in bounding box problem, which is a slightly simpler derivative of the point in polygon (PIP) problem. Ignoring this function for a minute however, we examine that in order to solve the problem we must loop through all 1.8 million UK postcodes and determine which of the 55 Ordnance Survey grid cells this sits in!

A naive pseudocode implementation might look like:


	for postcode in postcodes:
		for grid in osgrid:
			if pointInBoundingBox(postcode, grid):
				return True

The double for loop inherently makes this a O(n * m) problem which in worst case scenario a O(n2) problem i.e. n and m are of equal length. Ignoring Big O notation for a moment, we can actually say specifically for this problem it will take 99,000,000 operations to complete. This doesn’t take into account the added complexity that might be added by the internals of the pointInBoundingBox function (more for loops and so forth). Essentially, to obtain the answer we must compare both sets of inputs against each other which can lead to some very large number of operations as they increase. Let’s just say don’t be too surprised when your desktop GIS crashes solving these kinds of problems!

A slightly more nuanced approach might use some sort of R-Tree/Quadtree/etc to allow for a better way of dividing up and searching through the grid cells and postcode points. These data structures allow data to be better searched as we don’t have to loop through all of the inputs only the parent (containing) cells.

Unfortunately the process of building the tree is essentially a very similar problem to the original question itself. To build the tree you have to loop through all the points and all the grid cells which is of equal complexity to solving the original question. A quote I remember from my old job regarding this situation is to solve the problem, you have to solve the problem.

Spatial indexes on spatial databases use such trees to speed up spatial querying. Using spatial index and subsequently a data structure to quickly determine which section of a coordinate space a geometry exists in allows for faster querying. Here you save a lot of time on the queries (for reasons expressed above) but indexing all the data in the first place will take time in itself.

The next question is could we parallelize this problem? That is to say, can we split this problem into smaller chunks and process it across multiple cores or machines? We can, but only if don’t divide the OS grid cells. Let me explain why; let’s say we use half the grid cells to one machine along with half the points to be processed. Because we know the points must have the correct grid cells to pass the test, we can’t just send any grid cells and any points to the machines. In order to find out which grid cell a point resides in we have no option but to test against all the grid cells (in the naive approach). However, we can divvy up the postcode points in any way we want to. This whole process isn’t too bad as we only have 55 grid cells in this example, but let’s say we had a million grid cells, you can see how the computational time rapidly increases.

As you can see the problem is inherently complex due to the multidimensional nature of geographic data and the kinds of topological questions we ask of it. Alongside this most problems in GIS revolve around at least two data sets, and often these can be very large in size. Many operations, for example the topological query explained above require a function to be performed that requires looping through the datasets. You can see another example of a O(n2) algorithm comparing line intersection algorithms here: Fischer et al, 2009). Using bruteforce (testing all solutions) routing algorithms can be as bad as O(cn) where c is a positive constant. In the most naive approach the algorithm is problem is actually O(n!) (i.e. the factorial of the number of inputs).

Obviously there are many problems that are much harder than the geo problems expressed above, for example writing a program that can beat a human at a game of Go, which has 10 101023 possible moves. However hopefully this post has nonetheless illustrated the complexities with handling big geo datasets. It has been demonstrated that although all problems vary in complexity, many of our day to day GIS operations that involve two data sets are of approximately O(n2) complexity without optimisations, and are also tricky to parallelize.

I would be really interested in hearing people’s insights as to how the industry/academia is moving forward in solving these problems. Equally if you have any corrections or points of contentions please feel free to reach out.