Geodata

Tip

Use QGIS to display geospatial data and to create maps in PDF or image formats (e.g., tif, png, jpg).

Geodata sources

Geospatial data can be retrieved for various purposes from different sources. Here are some of them:

Visualization

GIS software is needed to display geospatial data and many tools exist. This website primarily provides examples using QGIS. Since the use of GIS software, especially QGIS, is necessary in several places on the website, explanations on how to install QGIS are already included on the Get Started > Geospatial software page.

Tip

The BASEMENT pre-processing page features the basics of geospatial data handling with QGIS. Therefore, this introduction to numerical modeling is also a good introduction to QGIS.

Geodatabase

A geodatabase (also known as spatial database) can store, query (e.g., using Structured Query Language SQL), or modify data with geographic references (geospatial data). Primarily, geospatial data consist of vector data (see shapefiles), but raster data can also be implemented. A geodatabase links these data with attribute tables and geographic coordinates. The special aspect of geodatabases is that these data can be queried and manipulated by users via a (web or local) GIS (geographic information system) server. With software like QGIS (or ArcGIS Pro), for example, queries can be made on a kind of local server using locally stored geodata. The typical geodatabase format is .gdb, which works actually like a directory in QGIS or ArcGIS, and the maximum size of a .gdb file is 1 terabyte.

gdb

Fig. 24 Functional skeleton of a geodatabase.

Vector data

Vector data are visually smooth and efficient for overlay operations, especially regarding shape-driven geo-information such as roads or surface delineations. Vector data are typically less storage-intensive, easier to scale, and more compatible with relational environments. Common formats are .shp, JSON or TIN.

The shapefile format was invented by Esri (download their PDf documentation) and information contained in shapefiles can be:

  • Polygons (surface patches).

  • Points with x-y-z coordinates and an m field containing point data.

  • (Poly) lines consisting of lines defined by start points and endpoints.

Shapefile

Note

The gdal.ogr driver name for shapefile handling is ogr.GetDriverByName('ESRI Shapefile'). A shapefile is not just one file and consists of three essential parts: * a .shp file, where geometries are stored, * a .shx file, where indices of the geometries are stored, * a .prj file that stores the projection, and * a .dbf file containing attribute information (constitutes the attribute table).

These three files need to be in the same folder - otherwise, the shapefile does not work. A couple of other files may occur when we manipulate a shapefile (e.g., .atx, .sb*, .shp.xml, .cpg, .mxs, .ai*, or .fb*), but we can ignore those files.

Shapefile vector data typically has an attribute table (just like any other geodatabase) in which each polygon, line or point object can be assigned an attribute value. Attributes are defined by columns along with their names (column headers) and can have numeric (e.g., float, double, int, or long), text (string), or date/time (e.g. yyyymmdd or HH:MM:SS) formats.

shapefile presentration

Shapefile versus geodatabase

A shapefile can be understood as a concurring format to a geodatabase. Which file format is better? Strictly speaking, both a geodatabase and a shapefile can perform similar operations, but a shapefile requires more storage space to store similar contents, cannot store combinations of data and time, nor does it support raster files or Null (not-a-number) values. So basically we are better off with geodatabases, but the usage of shapefiles is popular and many geospatial operations focus on shapefile manipulations.

Triangulated Irregular Network (TIN)

A triangulated irregular network (TIN) represents a surface consisting of multiple triangles. In hydraulic engineering and water resources research, one of the most important usage of TIN is the generation of computational meshes for numerical models (e.g., on this website’s BASEMENT tutorial). In such models, a TIN consists of lines and nodes forming georeferenced, three-dimensionally sloped triangles of the surface, which represent a digital elevation model (DEM). TIN nodes have georeferenced coordinates and potentially more attribute information such as node IDs and elevation. The advantage of a TIN DEM over a raster DEM is that it requires less storage space. Alas, manipulating a TIN is not that easy like manipulating a raster. The below figure shows an example TIN created with `matplotlib.tri.TriAnalyzer <https://matplotlib.org/3.1.1/api/tri_api.html#matplotlib.tri.TriAnalyzer>`__), and based on a showcase from the matplotlib docs. The file ending of a TIN is .TIN.

tin-illu

Fig. 25 Illustration of a TIN.

GeoJSOn

Note

The gdal.ogr driver name for shapefile handling is ogr.GetDriverByName('GeoJSON').

GeoJSON is an open format for representing geographic data with simple feature access standards, where JSON denotes JavaScript Object Orientation (read more about JSON file manipulation in the Python intro on this website). The GeoJSON file name ending is .geojson and a file typically has the following structure:

{
  "type": "FeatureCollection",      "features": [
    {
      "type": "Feature",    "geometry": {
        "type": "Point",      "coordinates": [9.104028940200806, 48.74417005744522]
      },    "properties": {
        "name": "IWS"
      }
    }
  ]
}

Visit geojson.io to build a customized GeoJSON file. While GeoJSON metadata can provide height information (z values) as a properties value, there is a more suitable offspring to encode geospatial topology in the form of the still rather young TopoJSON format.

Gridded cell (raster) data

Raster datasets store pixel values (cells), which require large storage space, but have a simple structure. A big advantage of rasters is the possibility to perform powerful geospatial and statistical analyses. Common Raster datasets are, among others, .tif (GeoTIFF), GRID (a folder with a BND, HDR, STA, VAT, and other files), .flt (floating points), ASCII (American standard Code for Information Interchange), and many more image-like file types.

Tip

Preferably use the GeoTIFF format in raster analyses. A GeoTIFF file, typically includes a .tif file (with heavy data) and a .tfw (a six-line plain text world file containing georeference information) file.

Note

The gdal driver name for GeoTIFF handling is gdal.GetDriverByName('GTiff').

raster file illustration GeoTiff

Fig. 26 Illustration of the Natural Earth’s NE1_50M_SR_W.tif raster zoomed on Nepal, with point and line shapefiles indicating major cities and country borders, respectively. Take note of the tile-like appearance of the grid, where each tile corresponds to a 50m-x-50m raster cell.

Projections and coordinate systems

In geospatial data analyses, a projection represents an approach to flatten (a part of) the globe. In this flattening process, latitudinal (North/South) and longitudinal (West/East) coordinates of a location on the globe (three-dimensional 3D) are projected into the coordinates of a two-dimensional (2d) map. When 3D coordinates are projected onto 2d coordinates, distortions occur and there is a variety of projection systems used in geospatial analyses. In practice this means that if we use geospatial data files with different projections, a distortion effect propagates in all subsequent calculations. It is absolutely crucial to avoid distortion effects by ensuring that the same projections and coordinate systems are applied to all geospatial data used. This starts with the creation of a new geospatial layer (e.g., a point vector shapefile) in QGIS and should be used consistently in all program codes. To specify a projection or coordinate system in QGIS, click on Project > Properties > CRS tab and select a COORDINATE_SYSTEM. For example, an appropriate coordinate system for central Europe is ESRI:31493 (read more in the QGIS docs). Projected systems may vary with regions (local coordinate systems), which can, for example, be found at epsg.io or spatialreference.org.

In shapefiles, information about the projection is stored in a .prj file (recall definitions in the geospatial data section), which is a plain text file. The Open Spatial Consortium (OGC) and Esri use Well-Known Text (WKT) files for standard descriptions of coordinate systemsa and such a WKT-formatted .prj file can look like this:

PROJCS["unknown",GEOGCS["GCS_unknown",
DATUM["D_Unknown_based_on_GRS80_ellipsoid",SPHEROID["GRS_1980",6378137.0,298.257222101]],
PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],
PROJECTION["Lambert_Conformal_Conic"], PARAMETER["False_Easting",6561666.66666667],
                             ..., UNIT["US survey foot",0.304800609601219]]

In GeoJSON files, the standard coordinate system is WGS84 according to the developer’s specifications. The units and measures defined in the WKT-formatted .prj file also determine the units of WK***B* (Well-Known Binary) definitions of geometries such as line length (e.g., in meters, feet or many more), or polygon area (square meters, square kilometers, acres, and many more).

Tip

To ensure that all geometries are measures in meters and powers of meters, use EPSG:3857 (former 900913 - g00glE) to define the WKT-formatted projection file.