Structured Objects

This lesson starts with some background information about XML and what XML has to do with workbooks and JSON: XML is the abbreviation of ExtensibleMarkupLanguage that defines rules for encoding documents. XML was designed for straightforward usage over the internet and we encounter XML documents all the time, on websites (e.g., XHTML), in the shape office documents (e.g., Office Open XML such as docx, pptx, or xlsx), or podcasts (e.g., RSS). The strength of the XML format is its characteristic of being both machine-readable (i.e., a computer can process XML files) and human-readable (i.e., we can read it like a newspaper). In simple words, entering formulas in an xlsx workbook is simultaneously machine-readable and human-readable, since humans and computers can interpret and evaluate the formulas in this XML frame. Other file formats such as JSON (JavaScript Object Notation) resemble XML and this is why we look at how Python can extract information from and export information to both formats. In water resources engineering and research, we are often interested in the exchange of information with office workbooks (xlsx files), or with JSON files, which provide, for example, boundary conditions for a numerical model.

Workbook (xlsx) handling

Why do we want to communicate with workbooks at all? We have already seen that Python is much more powerful than office programs for the systematic analysis of datasets. However, Python requires the abstraction of data fields in our minds, for example to visualize the structure of a nested list. For this reason, data from and for marketing, your (future) boss, or public entities are often required to have visually easy-to-use workbook formats, which can be overlooked quickly. Still, we want to leverage the content of such workbook information efficiently with Python and we want to produce visually simplistic output that anyone can read without any Python knowledge.

We have already seen that pandas provides easy routines for importing and exporting data from and to workbooks, respectively (recall the data processing & file handling page). pandas uses xlrd, XlsxWriter or openpyxl depending on what is available in the current Python environment. Both packages have different behaviors, in particular when it comes to writing new or in existing workbooks. This is why the most robust approach is the direct use of one particular workbook handling library, where openpyxl is one of the most powerful options (note: this assertion is subjective) and this section introduces workbook handling with openpyxl.

This introduction uses the following workbook-related terms:

  • workbook is the main xlsx file we work with (also called spreadsheet);

  • sheet is tabular content of a workbook and one workbook can have multiple sheets;

  • columns are vertical lines in a sheet;

  • rows are horizontal lines in a sheet;

  • cells are elements of a sheet.

Create a workbook

openpyxl has a Workbook class that enables to create and fill workbooks with data. Typically, an instance of the Workbook class is called wb and worksheets contain the letters ws.

import openpyxl as oxl
wb = oxl.Workbook()  # create a Workbook instance
ws = wb.active  # activate worksheet
ws.titel = "Gaussian 2D"  # name worksheet
ws["A1"] = "Gaussian sample data"  # write to cell A1

# generate some data
x, y = np.meshgrid(np.linspace(-1, 1, 20), np.linspace(-1, 1, 20))
dis = np.sqrt(x * x + y * y)
sigma, mu = 1.0, 0.0
gaussian = np.exp(-((dis - mu) ** 2 / (2.0 * sigma ** 2)))

# write data to worksheet
m, n = gaussian.shape
for i in range(1, m):
    for j in range(1, n):
        _ = ws.cell(row=i+1, column=j, value=gaussian[i-1, j-1])

print("Workbook data in cell A2: "+ str(ws["A2"].value))
print("Corresponds to np.array value: " + str(gaussian[0, 0]))

# save and close (destruct object) workbook
wb.save(filename="data/python_workbook.xlsx")
wb.close()
Workbook data in cell A2: 0.3678794411714422
Corresponds to np.array value: 0.3678794411714422
img/py-xlsx.png

Read and manipulate an existing workbook

Attention

When openpyxl opens an existing

workbook, it cannot read graphical objects (e.g., graphs, shapes, or images). For this reason, saving the workbook with the same name will make graphical objects disappear.

openpyxl reads existing workbooks with openpyxl.load_workbook(filename=str()). This function accepts optional keyword arguments, where the most important are:

  • read_only=BOOLEAN decides whether a workbook is opened in read only mode. A workbook can only be manipulated if read_only=False (this is the default option, which can be useful to handle large files or to ensure that graphical objects are not lost).

  • write_only=BOOLEAN decides whether a workbook is opened in write only mode. If write_only=True (the default is False), no data can be read from a workbook, but writing data is significantly faster (i.e., this option is useful to write large datasets).

  • data_only=BOOLEAN determines if cell formulae or cell data will be read. For example, when a workbook cell’s content is =PI(), data_only=False (this is the default option) reads the cell value as =PI() and data_only=True reads the cell value as 3.14159265359.

  • keep_vba=BOOLEAN controls whether Visual Basic elements (macros) are kept or not. The default is keep_vba=False (i.e., no preservation). Still, keep_vba=True will not enable a modification of Visual Basic elements.

If read_only=False, we can manipulate cell values and also cell formats, including data formats (e.g., date, time, and many more), font properties (and many more cell styles), or colors in HEX Color Code (find your favorite color here). The following example opens the above created python_workbook.xlsx, adds a new worksheet, illustrates the implementation of some cell styles and fills it with randomized discharge measurements.

import datetime
from openpyxl.styles import Font, Alignment, PatternFill
wb = oxl.load_workbook(filename="data/python_workbook.xlsx", read_only=False)
ws = wb.create_sheet(title="Discharge")

# define title styles
title_font = Font(name="Tahoma", size="11", bold=True, italic=True, color="C1D0DE")
title_fill = PatternFill(fill_type="solid", start_color="050505", end_color="073AD4")
title_align = Alignment(horizontal='center', vertical='bottom', text_rotation=0,
                        wrap_text=False, shrink_to_fit=False, indent=0)

date_time_format = "yyyy-mm-dd hh:mm:ss"
ws["A1"] = "Date-Time (%s)" % date_time_format

title_cell_flow = ws["B1"]
title_cell_flow.value = "Discharge (CMS)"
title_cell_flow.font = title_font
title_cell_flow.fill = title_fill
title_cell_flow.alignment = title_align

# define time period and time delta of 1 hour = 3600 seconds
current_date_time = datetime.datetime(2040, 12, 24, 0, 0)
dt = datetime.timedelta(seconds=3600)

# write random discharges to workbooks
for row in ws.iter_rows(min_row=2, max_row=26, min_col=1, max_col=2):
    row[0].value = current_date_time
    row[0].number_format = date_time_format
    row[1].value = np.random.random_sample(size=None) * 100
    row[1].number_format = "0.00"
    current_date_time += dt

wb.save("data/python_workbook_reloaded.xlsx")
wb.close()
img/py-xlsx-reloaded.png

The below code block provides the short helper function read_columns to read only one or more columns into a (nested) list (reads until the maximum number of rows, defined by ws.rows, in a workbook is reached). A similar function can be written for reading rows.

def read_columns(ws, start_row=0, columns="ABC"):
    return [ws["{}{}".format(column, row)].value for row in range(start_row, int(ws.rows.__sizeof__()) + 1) for column in columns]

# example usage:
wb = oxl.load_workbook(filename="data/python_workbook.xlsx", read_only=False)
ws = wb.active
col_D = read_columns(ws, start_row=2, columns="D")
col_F = read_columns(ws, start_row=2, columns="F")
wb.close()

Formulae in workbooks

The optional keyword argument data_only=False enables reading workbook formula instead of cell values. However, not all workbook formulae are recognized by openpyxl and in the case of doubts, a dirty try-and-error approach is the only remedy. As an example, change SQRT in the below example to the formula in question.

from openpyxl.utils import FORMULAE
print("SQRT" in FORMULAE)
True

(Un)merge cells

Merging and un-merging cells is a popular office function for style purposes and openpyxl also provides functions to perform merge operations:

ws.merge_cells(start_row=1, end_row=3, start_column=1, end_column=2)
ws.unmerge_cells(start_row=1, end_row=3, start_column=1, end_column=2)

Charts (plots)

In the unlikely event that you want to insert plots directly into workbooks (`matplotlib <https://hydro-informatics.github.io/hypy_pyplot.html#matplotlib>`__ is way more powerful), openpyxl offers features for this purpose as well. To illustrate the creation of an area chart, the below code block re-uses the first column of random values in the previously created python_workbook.xlsx.

from openpyxl.chart import AreaChart, Reference, Series

wb = oxl.load_workbook(filename="data/python_workbook.xlsx", read_only=False)
ws = wb.active

chart = AreaChart()
chart.title = "Random Gaussian"
chart.style = 10
chart.x_axis.title = "Cell row"
chart.y_axis.title = "Random value (-)"

col_D = Reference(ws, min_col=4, min_row=2, max_row=20)
col_F = Reference(ws, min_col=6, min_row=2, max_row=20)

chart.add_data(col_F, titles_from_data=False)
chart.add_data(col_D, titles_from_data=False)

ws.add_chart(chart, "B2")

wb.save("data/python_workbook_chart.xlsx")
wb.close()
img/py-xlsx-plot.png

Other workbook charts are available and their implementation (still: why would you?) is explained in the openpyxl docs.

Customize workbook manipulation

There are many ways of modifying workbooks and openpyxl provides close-to “shovel-ready” methods to manipulate workbooks. Still, in order to avoid re-reading this lesson every time you want to manipulate a workbook, it is much more convenient to have your own workbook manipulation classes ready. For example, use custom Read and Write classes, where Read is the parent class of the Write class (see inheritance of classes). The Read class may contain tailored functions for reading specific columns, rows, or arrays. The below code block illustrates a basic example for such Read and Write classes with the above read_columns function implemented as a method of the Read class.

import openpyxl as oxl

class Read:
    def __init__(workbook_name="", *args, **kwargs):
        read_only = kwargs.get("read_only")
        data_only = kwargs.get("data_only")
        sheet_name = kwargs.get("data_only")
        self.wb = oxl.load_workbook(filename=workbook_name, read_only=read_only, data_only=data_only)
        if sheet_name:
            self.ws = self.wb.worksheets[worksheet]
        else:
            self.ws = self.wb[self.wb.sheetnames[0]]

    def read_columns(self, start_row=0, columns="ABC"):
        return [self.ws["{}{}".format(column, row)].value for row in range(start_row, len(self.ws.rows) + 1) for column
                in columns]

    def __call__(self):
        print(dir(self))


class Write(Read):
    def __init__(workbook_name="", *args, **kwargs):
        data_only = kwargs.get("data_only")
        sheet_name = kwargs.get("data_only")
        Read.__init__(workbook_name=workbook_name, read_only=False, data_only=data_only, sheet_name=sheet_name)

An extended example script with more complex Read and Write classes can be downloaded from the course repository.

Challenge

What are your favorite fonts, table

colors and layouts? Write your own Read and Write classes with formatting methods to have a personal template ready to be used at any time.

An example from water resources engineering and research

The ecological restoration or enhancement of rivers requires, among other data, information on preferred water depths and flow velocities of target fish species. This information is established by biologists and then often provided in the shape of so-called habitat suitability index (HSI) curves in workbook formats. As water resources researchers and engineers, we produce geospatially explicit data of water depth and flow velocity with numerical models. The output of two- or three-dimensional numerical models is way too large for being handled with office applications. So we need an advanced tool such as Python to handle the geospatially explicit data and to read and interpolate HSI curves from workbooks. How does that look like technically? The course exercises will help you to find out …

Exercise

Get more familiar with workbook handling in the Sediment transport (1D) exercise.

JSON

JavaScript Object Notation (JSON) files have a similar structure to XML and enable the structured storage of (human-readable) data. For example, the numerical code BASEMENT v.3.x (read more on the numerical modelling pages) uses a model.json and a simulation.json file to store model setup parameters such as material properties. In water resources engineering and research, we often want to automate running numerical models, which involves the optimization of model parameters stored in json files. This is where Python steps in with the JSON package and pandasJSON modules.

JSON file structure

A JSON file consists of two types of data structures, which are dictionary objects and arrays in the form of lists of values. The dictionary objects in a JSON file correspond to the same format that we already know: Pairs of keys (names) and values embraced by curly brackets (braces) {"name": value}. The value can be a string, numeric, a comma-separated list [] (array) of data, or another dictionary. The following example shows a JSON file called river_struct.json with a RIVER key that has a nested dictionary as value. The value-dictionary contains three keys (NAME, GEOMETRY, and HYDRAULICS).

Tip

Take a couple of minutes to understand the

elements of river_struct.json. What is the purpose of the FLOWBOUNDARIES in GEOMETRY? How could the FLOWBOUNDARIES be related to the BOUNDARY key of HYDRAULICS? What units could the FRICTION values correspond to? Can you find the river on a map?

{
    "RIVER": {
        "NAME": "Vanilla Flow",
        "GEOMETRY": {
            "REGIONS": [
                {
                  "type": "wet",
                  "name": "riverbed"
                },
                {
                  "type": "dry",
                  "name": "floodplain"
                }
            ],
            "FLOWBOUNDARIES": [
                {
                  "name": "Inflow",
                  "nodes": [1, 3, 7, 31]
                },
                {
                  "name": "Outflow",
                  "nodes": [89, 90, 76, 69, 95]
                }
            ]
        },
        "HYDRAULICS": {
            "BOUNDARY": [
                {
                    "discharge_file": "/simulation/directory/Inflow.txt",
                    "name": "Inflow",
                    "slope": 0.005,
                    "type": "hydrograph"
                },
                {
                    "name": "Outflow",
                    "type": "zero_gradient"
                }
            ],
            "FRICTION": {
                "cobble": 20.0,
                "gravel": 26.0,
                "sand": 41
            }
        },
        "LOCATION": [48.744079, 9.103928]
    }
}
{'RIVER': {'NAME': 'Vanilla Flow',
  'GEOMETRY': {'REGIONS': [{'type': 'wet', 'name': 'riverbed'},
    {'type': 'dry', 'name': 'floodplain'}],
   'FLOWBOUNDARIES': [{'name': 'Inflow', 'nodes': [1, 3, 7, 31]},
    {'name': 'Outflow', 'nodes': [89, 90, 76, 69, 95]}]},
  'HYDRAULICS': {'BOUNDARY': [{'discharge_file': '/simulation/directory/Inflow.txt',
     'name': 'Inflow',
     'slope': 0.005,
     'type': 'hydrograph'},
    {'name': 'Outflow', 'type': 'zero_gradient'}],
   'FRICTION': {'cobble': 20.0, 'gravel': 26.0, 'sand': 41}},
  'LOCATION': [48.744079, 9.103928]}}

Read (decode) and write (encode) JSON files with the json library

JSON files can be implemented in many programming languages including HTML and Python. This is also the reason why Jupyter notebooks (as used in this course) can be run in Python and displayed as a web page. Python has a built-in json library that enables JSON decoding and encoding. The `json <https://anaconda.org/jmcmurray/json>`__ library provides a json.dumps(DATA) method to “dump” (i.e., encode) data in JSON format. Vice versa, the json.load() method reads data from JSON files. The following example shows a JSON file called river_struct.json with a RIVER key that has a nested dictionary as value. The value-dictionary contains three keys (NAME, GEOMETRY, and HYDRAULICS).

import json
# create arbitrary nested data (list, dictionary, tuple)
data_for_json = ["list_element1", {"dict_key": ("tuple_element", "text", 1.0, None)}]

# create a json file
json_file = open("data/my-first.json", mode="w+")
# encode the random nested data list in json format and write to file
json_file.write(json.dumps(data_for_json))
# close file
json_file.close()

# re-open the json file to read data
with open("data/my-first.json", mode="r") as re_opened_file:
    raw_data = re_opened_file.readline()

# decode json data in a Python variable
data_from_json = json.loads(raw_data)
print(json.dumps(data_from_json))
["list_element1", {"dict_key": ["tuple_element", "text", 1.0, null]}]

The Python docs provide more options and descriptions on using the json library. However, here we will (once again) make use of the pandas library, which offers some powerful features for handling json data.

Read (decode) and write (encode) JSON files with pandas

pandas (recall data and file handling) enables reading JSON files into its convenient table format with an embedded usage of the json library. The following code block uses the `pandas.read_json(FILE) <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html>`__ method to read the above shown river_struct.json sample file.

river = pd.read_json("data/river_struct.json")
print(river)
                                                        RIVER
GEOMETRY    {'REGIONS': [{'type': 'wet', 'name': 'riverbed...
HYDRAULICS  {'BOUNDARY': [{'discharge_file': '/simulation/...
LOCATION                                [48.744079, 9.103928]
NAME                                             Vanilla Flow

Since a river without data is like ice cream without taste, we will add (random) data of flow characteristics to the data structure. Let us assume that we have used the data from river_struct.json to simulate a stationary discharge in a two-dimensional numerical model. As a result we have two regular grids (arrays) with data on flow velocity and flow depth. Now, we want to append both the flow depth and flow velocity arrays in the form of a result structure (dictionary in the river_struct.json and give the river a new name.

# create random data
h = np.random.weibull(np.arange(0,100)).reshape(10, 10)
u = np.random.weibull(np.arange(0,100)).reshape(10, 10)

# append RESULTS row to pandas dataframe
river_dict = river.to_dict()
river_dict["RIVER"].update({"RESULTS": {"flow_depth": h, "flow_velocity": u}})
updated_river = pd.DataFrame.from_dict(river_dict)

# re-NAME RIVER
updated_river["RIVER"]["NAME"] = "Honey river"
print(updated_river)

# export to JSON
updated_river.to_json("data/river_results.json")
                                                        RIVER
GEOMETRY    {'REGIONS': [{'type': 'wet', 'name': 'riverbed...
HYDRAULICS  {'BOUNDARY': [{'discharge_file': '/simulation/...
LOCATION                                [48.744079, 9.103928]
NAME                                              Honey river
RESULTS     {'flow_depth': [[0.0, 1.8204724788697753, 0.98...
img/py-json-file.png

Exercise

Get more familiar with JSON file

handling in the geospatial ecohydraulics exercise (requires understanding the chapter on geospatial Python).