23 Dec 2016, 16:56

Visualising Facebook's Population Data on the Web

Over the past month or so (on and off!) I’ve had the opportunity to explore two really interesting ideas. The first was relating to a talk I saw at FOSS4G in Bonn this year by Fabian Schindler. Fabian’s talk was on how to use GeoTIFF data in the web using a set of libraries called geotiff.js and plotty.js. You can see a post about his talk here. The talk got me thinking and interesting in what sort of GeoTIFF data could be pulled into the web.

Around the same time, a news article caught my eye about Facebook having teamed up with a research institute to create a global dataset of population densities. I later managed to find the data on the International Earth Science Information Network (CIESIN) website here. From here I explored ways of making use of the datasets in the web as per the talk I’d seen at FOSS4G.

The Problems

Upon examining the raw data I noticed some immediate problems with my plan:

  • The data is relatively large for the web - a country could be upward of 100mb
  • The data has multiple bands that may not all be useful to end users
  • WebGL isn’t great with large textures

To take on all these problems it was necessary to preprocess the data into a more manageable and usable format.

Preprocessing the Data

To solve the listed problems was a relatively tricky endeavour. There was a multistep process:

  • Extact a single band
  • Downsample the raster (reduce it’s resolution)
  • Split the raster into managable chunks
  • Allow the browser to know where these chunks are somehow

For all of these processes I made use of Python and GDAL. The band extraction was fairly straight forward process, but the downsampling and splitting were somewhat more complicated issues.

You can see the full solutions to the downsampling and band extraction problems on the GitHub repo. Splitting the data was probably the hardest problem to solve as I struggled to find any examples of this being done across the web that weren’t making call outs to the shell from within Python (something I wanted to avoid).

In order to correctly split data it was necessary to subdivide the raster into a given size grid. For this to work correctly we needed to get the top left and bottom right coordinates of all the grid cells. After some thought on solving this mental puzzle, I deduced that you can create an arbitrary (n by n) sized grid of such coordinates using the following function:

def create_tiles(minx, miny, maxx, maxy, n):

    width = maxx - minx
    height = maxy - miny

    matrix = []

    for j in range(n, 0, -1):
        for i in range(0, n):

            ulx = minx + (width/n) * i 
            uly = miny + (height/n) * j 

            lrx = minx + (width/n) * (i + 1)
            lry = miny + (height/n) * (j - 1)
            matrix.append([[ulx, uly], [lrx, lry]])

    return matrix

Splitting the tiles allows us to send the raster in chunks whilst avoiding using a tile server or any kind of dynamic backend. I created a JSON file that contained metadata for all the necessary resulting files, allowing us to determine their centroid and file location prior to requesting all of them.

Displaying the Data

Displaying the data on the frontend took a little bit of trial and error. I used a combination of OpenLayers 3, plotty.js and geotiff.js to accomplish the end result. geotiff.js allows us to read the GeoTIFF data, and plotty.js allows us to create a canvas element that can be used by OpenLayers 3 to correctly place the elements.

To request and handle the asynchronous loading of the data I used the Fetch API and Promises (I’ve included polyfills for both in the demo). Once all the promises have resolved we now have all the tiffs loaded into memory. From here we can use a select dropdown that allows us to change the colors used for presenting the data.

The end result looks a little something like this:

Pros and Cons of this Method

Pros

  • We got to a point where we can display the data in the web
  • The data can be restyled dynamically clientside
  • No dynamic backend or file server required, just static files after preprocessing

Cons

  • Tiles are actually less size efficient than one big file, but are necessary to get the data to display
  • The downsampling factors have to be quite large to get it to be a reasonable file size
  • Even tiled, file sizes are quite large (i.e. 9 tiles at 2mb a file == 18mb which is a lot for the web)

One interesting idea about having the data client side as opposed to a raw image is you could even go as far as to figure out how to do basic visual analytics on the client. For example you could find a way to display only the highest values or lowest values for a given GeoTIFF.

Have a Go

The repo is located at : https://www.github.com/JamesMilnerUK/facebook-data-viz

There are instructions in the README for usage.

20 Mar 2016, 21:44

The Hard Thing About Hard Geo Things

As spatial data sets have become larger, aggregated faster and from a wider variety of sources, handling large spatial datasets has become a challenge. As I’m sure many of you have experienced, many GIS operations can take a fairly long time to complete, especially with traditional desktop GIS techniques. Last week at the pub after geomob, I was discussing with a friend who works for Google in the Maps team about how they deal with such large datasets, and what techniques you can use to solve problems with them.

The conversation made me think about a couple of key questions about operating on geospatial data:

  • What sort of computational complexity are we looking at for fundamental GIS operations?
  • What problems can be parallelized or are suitable for approaches such as MapReduce?

For the purpose of this post we will use Big O notation to express the complexity of a problem. Big O notation expresses how complex a problem in relation to its inputs. This allows it to be neutral of all other factors such as hardware, data size (i.e. MB, GB etc), time units and so forth. As an example you can see in the graph above a linear problem (green - O(n)) requires a lot less operations to complete than a quadratic problem (light blue - O(n2)).

At its most basic level a geocode could be seen as a database lookup; the user searches for a postcode, this is checked in a database and a corresponding latitude and longitude is returned. The problem is slightly more complex than this obviously, but such a geocoder would without any optimisations be a O(n) problem. A O(n) problem is a linear problem who’s number of operations to complete is matched by the number of inputs given to it (i.e. 100 inputs results in 100 operations). This problem actually becomes a O(log n) problem if we were to index the postcode column.

Another example of a linear GIS problem is a geographic transformation (scale, rotate, translation etc) as we loop through all coordinates (inputs) and perform some mathematical function to transform the coordinates position. Similarly in a coordinate system reprojection we are mapping a function against all coordinates to transform the coordinate position into another coordinate system. As we only loop through the data and operate on it once, the problem is again linear.

However, let’s take an example problem to work out which UK postcodes belong to which Ordnance Survey grid cell. In human terms this a trivial problem, as one can deduct very quickly and visually which postcode centroid is in which cell.

However for a machine that problem is slightly more complex. It needs to programmatically deduce which grid cell it is in. This problem is actually a point in bounding box problem, which is a slightly simpler derivative of the point in polygon (PIP) problem. Ignoring this function for a minute however, we examine that in order to solve the problem we must loop through all 1.8 million UK postcodes and determine which of the 55 Ordnance Survey grid cells this sits in!

A naive pseudocode implementation might look like:


	for postcode in postcodes:
		for grid in osgrid:
			if pointInBoundingBox(postcode, grid):
				return True

The double for loop inherently makes this a O(n * m) problem which in worst case scenario a O(n2) problem i.e. n and m are of equal length. Ignoring Big O notation for a moment, we can actually say specifically for this problem it will take 99,000,000 operations to complete. This doesn’t take into account the added complexity that might be added by the internals of the pointInBoundingBox function (more for loops and so forth). Essentially, to obtain the answer we must compare both sets of inputs against each other which can lead to some very large number of operations as they increase. Let’s just say don’t be too surprised when your desktop GIS crashes solving these kinds of problems!

A slightly more nuanced approach might use some sort of R-Tree/Quadtree/etc to allow for a better way of dividing up and searching through the grid cells and postcode points. These data structures allow data to be better searched as we don’t have to loop through all of the inputs only the parent (containing) cells.

Unfortunately the process of building the tree is essentially a very similar problem to the original question itself. To build the tree you have to loop through all the points and all the grid cells which is of equal complexity to solving the original question. A quote I remember from my old job regarding this situation is to solve the problem, you have to solve the problem.

Spatial indexes on spatial databases use such trees to speed up spatial querying. Using spatial index and subsequently a data structure to quickly determine which section of a coordinate space a geometry exists in allows for faster querying. Here you save a lot of time on the queries (for reasons expressed above) but indexing all the data in the first place will take time in itself.

The next question is could we parallelize this problem? That is to say, can we split this problem into smaller chunks and process it across multiple cores or machines? We can, but only if don’t divide the OS grid cells. Let me explain why; let’s say we use half the grid cells to one machine along with half the points to be processed. Because we know the points must have the correct grid cells to pass the test, we can’t just send any grid cells and any points to the machines. In order to find out which grid cell a point resides in we have no option but to test against all the grid cells (in the naive approach). However, we can divvy up the postcode points in any way we want to. This whole process isn’t too bad as we only have 55 grid cells in this example, but let’s say we had a million grid cells, you can see how the computational time rapidly increases.

As you can see the problem is inherently complex due to the multidimensional nature of geographic data and the kinds of topological questions we ask of it. Alongside this most problems in GIS revolve around at least two data sets, and often these can be very large in size. Many operations, for example the topological query explained above require a function to be performed that requires looping through the datasets. You can see another example of a O(n2) algorithm comparing line intersection algorithms here: Fischer et al, 2009). Using bruteforce (testing all solutions) routing algorithms can be as bad as O(cn) where c is a positive constant. In the most naive approach the algorithm is problem is actually O(n!) (i.e. the factorial of the number of inputs).

Obviously there are many problems that are much harder than the geo problems expressed above, for example writing a program that can beat a human at a game of Go, which has 10 101023 possible moves. However hopefully this post has nonetheless illustrated the complexities with handling big geo datasets. It has been demonstrated that although all problems vary in complexity, many of our day to day GIS operations that involve two data sets are of approximately O(n2) complexity without optimisations, and are also tricky to parallelize.

I would be really interested in hearing people’s insights as to how the industry/academia is moving forward in solving these problems. Equally if you have any corrections or points of contentions please feel free to reach out.