pyGIS: October 2012

Wednesday 24 October 2012

GDAL Contour

I wanted to create some contours from a DEM file today and decided to use GDAL instead of Arc or FME and I was again pleasantly surprised by the simplicity and power of the GDAL environment. My first port of call was to read the documentation, which showed that this command's syntax does not differ greatly from
the rasterize command discussed in the last post.

I decided to generate a set of 10m contours as a shapefile using the following command:

The switches do the following:

-b 1 selects the band of the image to process, which defaults to 1
-a elevation is the name of the contour elevation attribute which will be created
-snodata -9999 tells GDAL the value of nodata cells in the input raster, so they can be ignored
ns67ne.tif contour.shp are the input and output files, respectively
-i 10 is the spacing between each contour

Monday 22 October 2012

GDAL Rasterize

Posted by GIStuart No comments:

I wanted to convert a vector outline of the GB coastline into a raster in order to use it as a background layer in MapGuide Maestro. I would usually do such a process in ArcMap, but I am trying to learn open source alternatives to these types of functions. As a result I have been playing around with GDAL and OGR for a week or so now, and have been very impressed with their power and versatility, I only wish I could run them at a UNIX shell at work instead of at the windows command line. With these two packages, their python bindings, numpy, pyshp and matplotlib I think I could begin to move away from ArcGIS without too much pain.

Version 1.8 and up of GDAL support the creation of output raster files when running the GDAL_RASTERIZE command, which makes this operation a single line process, there are a number of switches and options which can be used but here I will only go through the switches used for this process. The command I used to convert the vector outline to a raster file was:

The individual switches operate as follows:

-a ID Indicates the attribute to be used as the key when converting the vector file. This can be extended using an SQL statement and the -SQL switch to allow for selections of only parts of a file to be converted.
-l GB_Fix Is the name of the layer to be converted
-a_nodata -9999 Assigns a raster nodata value of -9999
-a_srs EPSG:3857 Sets the coordinate system using an EPSG code
-tr 1000 1000 Sets the raster cell size, the smaller these values are the bigger the output file will be. I initially ran the command with 10 10 and got a file of over 15Gb
-ot Float32 The datatype of the output file, either Byte, Float32 or Float64 depending on the file format
GB_Fix.shp dest.tif The input and output filenames, including extensions

This is just scraping the surface of this one command but serves as a starting point for any future conversions I have to perform using GDAL.

Friday 12 October 2012

Python os.walk

Posted by GIStuart No comments:

In my processing of OS OpenData into a useful format, I realized that I needed to traverse a lot of directories from a root folder in order to get to the files I wanted, as each map tile is contained in its own folder and these tiles are spread across a number of data CDs. The solution, as with many things in GIS, is to use Python, or more specifically the OS.Walk module.

The module returns a triplet of values: a string for the current directory, a list of the directories within the walked path and a list of the files within the walked path. As with all tuples/triplets these can be iterated through using for loops to access the files and folders at each stage of the walk.

The syntax for this is as follows:

Which will iterate through the files and folders in the provided path and print the paths, directories and files to the screen. From here it is possible to test for files of a particular extension or filename by replacing the print statements with conditional statements or other operations.

PyShp Attribute Types and Point files

Posted by GIStuart No comments:

A quick note on attribute types when using pyshp, these have always seemed really clunky to me with a lot of trial and error needed to get things working, but I came across the xbase documentation which is the format used to store the attributes in shapefiles and this document is a treasure trove of information.

The following information is of particular relevance in terms of storing geodata data as attributes, which I could never get right before:

C is ASCII characters
N is a double precision integer limited to around 18 characters in length
D is for dates in the YYYYMMDD format, with no spaces or hyphens between the sections
F is for floating point numbers with the same length limits as N
L is for logical data which is stored in the shapefile's attribute table as a short integer as a 1 (true) or a 0 (false). The values it can receive are 1, 0, y, n, Y, N, T, F or the python builtins True and False

Each of these datatypes is stored in an attribute created by pyshp using the code:

Where the two numbers are the length and the precision of the numerical value being stored. These are not needed for integers and non numeric datatypes, but a length of 1 is needed for logical data.

I have written a small code sample which creates data from lists to show the different datatypes being used correctly in pyshp to create points:

Hopefully I will have time soon to write up examples for polygons and lines and more complex use cases of pyshp.

Thursday 11 October 2012

Python's Timeit Module

Posted by GIStuart No comments:

I am working on a small script which combines different parts of the Ordinance Survey OpenData into a single file, to produce for example a complete outline of the UK from all of the different map tiles the OS provides it's data in. I hope to post some more parts of the script and maybe a full listing over the next day or so, as I am using it as an opportunity to learn some new python tricks.

The first task I came across was filtering out the files that I wanted to merge and storing them in a list. I automatically wrote this statement inside a for loop:

Which looped through all the folders and anything with 'AdministrativeBoundary' in the name is concatenated to the base path and added to the list. However, I wondered if this was the most efficient way of comparing strings, when Python's re module is available. I swapped the line with the equivalent regex:

Doubtless there are better ways of doing this whole operation, which I may return to in future posts but for now I simply wanted to know which of these two string comparisons produced the fastest output. Because this is Python, there is a great little module for this called timeit.

At its simplest level, the timeit module can run a single command, in this case a list comprehension converting the string hello world to upper case and output the execution time:

This time, given in seconds, is the time taken for 1000000 iterations of the command to be run and this number of iterations can be set as follows:

This is great if you need to compare two single line statements, or are willing to spend the time condensing a block of code into a single line statement, but more often you will want to time the execution of a block of code. This is performed by creating a module to test and calling it from timeit:

This has the same result as the single line statements but now imports the TestModule and executes it a defined number of times.

To try and control for timing errors and other operating system processes getting in the way of an accurate result, it is a good idea to run the timeit() command several times to get a range of results. This could be done with a loop, but there is a convenience funtcion called repeat() which does this for you, this works in the same way as the timeit() command, but takes two arguments; repeat, the number of times to loop through the test and number, as discussed above:

This returns a list of results instead of a single float, upon which all manner of tests could be performed. But before I broke out scipy and matplotlib I read this in the timeit documentation:

Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

So the sensible thing to do is to take the minimum value found in the list and consider that the optimal benchmark execution time for a code block.

Back to the question posed earlier: what is faster, Python's in function or the equivalent regex? Using what I just learned about the timeit module I copied the two chunks of code into a new file and wrapped them up as two modules, called them both through timeit and repeated the tests 20 times to get two minimum execution times:

Upon running the code the results were a bit disappointing, there is very little difference between the two methods, suggesting that I should stick with in and remove the import of the regex library to keep things simple.

Wednesday 10 October 2012

Coordinate Systems and Web Mapping

Posted by GIStuart No comments:

I have, as have many other GIS people, long struggled with the range of coordinate systems in use in different areas and locations. I am currently trying to cope with on the fly translations between British National Grid and WGS84 Web Mercator, the system of choice for Google Maps et al, for a Web GIS I am developing at work.

I came across this blog post explaining the history of the Web Mercator projection, which served to clear up a lot of my questions and confusion.

For my future reference use EPSG code 3857 and follow the these WKT parameters to correct for Google's assumption that the Earth is spherical:

PROJCS["Popular Visualisation CRS / Mercator", GEOGCS["Popular Visualisation CRS", DATUM["WGS84", SPHEROID["WGS84", 6378137.0, 298.257223563, AUTHORITY["EPSG","7059"]], AUTHORITY["EPSG","6055"]], PRIMEM["Greenwich", 0, AUTHORITY["EPSG", "8901"]], UNIT["degree", 0.0174532925199433, AUTHORITY["EPSG", "9102"]], AXIS["E", EAST], AXIS["N", NORTH], AUTHORITY["EPSG","4055"]], PROJECTION["Mercator"], PARAMETER["semi_minor",6378137], PARAMETER["False_Easting", 0], PARAMETER["False_Northing", 0], PARAMETER["Central_Meridian", 0], PARAMETER["Latitude_of_origin", 0], UNIT["metre", 1, AUTHORITY["EPSG", "9001"]], AXIS["East", EAST], AXIS["North", NORTH], AUTHORITY["EPSG","3785"]]

Monday 8 October 2012

ArcMap Label Engine Scripting

Posted by GIStuart No comments:

A question came up on the GIS StackExchange this morning asking about how to perform string indexing in VBScript for ArcMap 10's label engine. It made me realize how verbose VBScript is when compared to Python, and to an extent, JavaScript.

The question required the first 2 characters of a field to be placed on the first line of the label, then the remainder of the field to be on the second line and the third line to contain a another field.

To do this in VBScript requires a left function to get the first two characters and a mid function combined with len to get the remainder of the string:

The JavaScript code is a bit cleaner, using \n instead of the horrendous VBNewline, but still uses the substring function combined with a length function to chop up the string:

The Python code uses the \n newline statement and simple string indexing to chop up the string, string[:2] returns the first two characters of a string and string[2:] returns everything after the first to characters until the end of the string, with no need to use len(). This is a much more elegant solution, which I also think is a lot easier to read, regardless of language preference: