Reading¶
The functions read_excel() and read_csv() do what it says on the
tin. They can read StarTable data from files as well as text streams.
They generate tuples of (block_type, block) where
blockis a StarTable blockblock_typeis aBlockTypeenum that indicates which type of StarTable blockblockis. Possible values:DIRECTIVE,TABLE,TEMPLATE_ROW,METADATA, andBLANK.
Reading example¶
Let’s make some CSV StarTable data, put it in a text stream, and read that stream. (Reading from a CSV file is no different.)
from io import StringIO
from pdtable import read_csv
csv_data = StringIO(
dedent(
"""\
author: ;XYODA ;
purpose:;Save the galaxy
***gunk
grok
jiggyjag
**places;
all
place;distance;ETA;is_hot
text;km;datetime;onoff
home;0.0;2020-08-04 08:00:00;1
work;1.0;2020-08-04 09:00:00;0
beach;2.0;2020-08-04 17:00:00;1
**farm_animals;;;
your_farm my_farm other_farm;;;
species;n_legs;avg_weight;
text;-;kg;
chicken;2;2;
pig;4;89;
cow;4;200;
unicorn;4;NaN;
"""
)
)
# Read the stream. Syntax is the same if reading CSV file.
# Reader function returns a generator; unroll it in a list for convenience.
block_list = list(read_csv(csv_data))
assert len(block_list) == 4
The reader generates tuples of (BlockType, block). Note that
blank/comment lines are read but not parsed.
>>> for bt, b in block_list:
... print(bt, type(b))
...
BlockType.METADATA <class 'pdtable.auxiliary.MetadataBlock'>
BlockType.DIRECTIVE <class 'pdtable.auxiliary.Directive'>
BlockType.TABLE <class 'pdtable.proxy.Table'>
BlockType.TABLE <class 'pdtable.proxy.Table'>
Here’s one of the tables.
>>> t = block_list[2][1]
>>> assert t.name == "places"
>>> print(t)
**places
all
place [text] distance [km] ETA [datetime] is_hot [onoff]
0 home 0.0 2020-08-04 08:00:00 True
1 work 1.0 2020-08-04 09:00:00 False
2 beach 2.0 2020-08-04 17:00:00 True
We can pick the tables (and leave out blocks of other types) like this:
from pdtable import BlockType
tables = [b for bt, b in block_list if bt == BlockType.TABLE]
assert len(tables) == 2
Filtered reading¶
Blocks can be filtered prior to parsing, by passing a callable as
filter argument to the reader functions. This can reduce reading
time substantially when reading only a few tables from an otherwise
large file or stream.
The callable must accept two arguments: block type, and block name. Only
those blocks for which the filter callable returns True are fully
parsed. Other blocks are parsed only superficially, meaning that only the block’s
top-left cell is parsed, which is just enough to recognize block’s type and name to
pass to the filter callable, thus avoiding the much more expensive task
of parsing the entire block, e.g. the values in all columns and rows of
a large table.
Let’s design a filter that only accepts tables whose name contains the
word 'animal'.
def is_table_about_animals(block_type: BlockType, block_name: str) -> bool:
return block_type == BlockType.TABLE and "animal" in block_name
Now let’s see what happens when we use this filter when re-reading the same CSV text stream as before.
>>> csv_data.seek(0)
>>> block_list = list(read_csv(csv_data, filter=is_table_about_animals))
>>> assert len(block_list) == 1
>>> block_list[0][1]
**farm_animals
my_farm, your_farm, other_farm
species [text] n_legs [-] avg_weight [kg]
0 chicken 2.0 2.0
1 pig 4.0 89.0
2 cow 4.0 200.0
3 unicorn 4.0 NaN
Non-table blocks were ignored, and so were table blocks that weren’t animal-related.
Handling directives¶
StarTable Directive blocks are intended to be interpreted and handled at
read time, and then discarded. The client code (i.e. you) is responsible
for doing this. Handling directives is done outside of the pdtable
framework. This would typically done by a function that consumes a
BlockIterator (the output of the reader functions), processes the
directive blocks encountered therein, and in turn yields a processed
BlockIterator, as in:
def handle_some_directive(bi: BlockIterator, *args, **kwargs) -> BlockIterator:
...
Usage would then be:
block_it = handle_some_directive(read_csv('foo.csv'), args...)
Let’s imagine a directive named include, which the client
application is meant to interpret as: include the contents of the listed
StarTable CSV files along with the contents of the file you’re reading
right now. Such a directive could look like:
***include
bar.csv
baz.csv
Note that there’s nothing magical about the name “include”; it isn’t
StarTable syntax. This name, and how such a directive should be
interpreted, is entirely defined by the application. We could just as
easily imagine a rename_tables directive requiring certain table
names to be amended upon reading.
But let’s stick to the “include” example for now. The application wants
to interpret the include directive above as: read all the blocks from the two additional CSV
files listed and throw them in the pile. Perhaps even recursively,
i.e. similarly handle any include directives encountered in
bar.csv and baz.csv and so on. A handler for this could be
designed to be used as:
block_it = handle_includes(read_csv('foo.csv'), input_dir='./some_dir/', recursive=True)
and look like this:
import functools
from pdtable import BlockIterator, Directive
def handle_includes(bi: BlockIterator, input_dir, recursive: bool = False) -> BlockIterator:
"""Handles 'include' directives, optionally recursively.
Handles 'include' directives.
'include' directives must contain a list of files located in directory 'input_dir'.
Optionally handles 'include' directives recursively. No check is done for circular references.
For example, if file1.csv includes file2.csv, and file2.csv includes file1.csv, then infinite
recursion ensues upon reading either file1.csv or file2.csv with 'recursive' set to True.
Args:
bi:
A block iterator returned by read_csv
input_dir:
Path of directory in which include files are located.
recursive:
Handle 'include' directives recursively, i.e. 'include' directives in files themselves
read as a consequence of an 'include' directive, will be handled. Default is False.
Yields:
A block iterator yielding blocks from...
* if recursive, the entire tree of files in 'include' directives.
* if not recursive, the top-level file and those files listed in its 'include' directive (if
any).
"""
deep_handler = (
functools.partial(handle_includes, input_dir=input_dir, recursive=recursive)
if recursive
else lambda x: x
)
for block_type, block in bi:
if block_type == BlockType.DIRECTIVE:
directive: Directive = block
if directive.name == "include":
# Don't emit this directive block; handle it.
for filename in directive.lines:
yield from deep_handler(read_csv(Path(input_dir) / filename))
else:
yield block_type, block
else:
yield block_type, block
This is only an example implementation. You’re welcome to poach it, but
note that no check is done for circular references! E.g. if bar.csv
contains an include directive pointing to foo.csv then calling
handle_includes with recursive=True will result in a stack
overflow. You may want to perform such a check if relevant for your
application.
Alternatively, look into the section on Loading input sets below, which includes a more robust implementation
of the include-directive.
Loading input sets¶
The term “reading” has been used above to describe the low-level process of reading a single input location.
We will use the term “loading” for the process of reading a full input set, possibly from multiple storage backends,
while recording information about the input locations for each table. Support for “loading” in this sense is
implemented in the io.load and table_origin modules. These modules are built as layers on top of the
fundamental readers, and a given project can choose to not use either, to use only table_origin or to use both.
The design is intended to be flexible enough to support most scenarios, while still being convenient to use
for the most basic use-case of reading a folder of files.
As an example of using the modules, this is how you would use the convenience function load_files to load
an filesystem input-set based on a root folder and a file name pattern:
bundle = pdtable.TableBundle(pdtable.load_files(
root_folder=inputs,
file_name_start_pattern="(input|setup)_"))
The included load information can be accessed via the .metadata-field as:
input_location = bundle['bar_table'].metadata.origin.input_location
The resulting object can be inspected via the string-representation:
>>> print(input_location.interactive_identifier)
>>> print('\n'.join(str(input_location.load_specification).split(';')))
Row 0 of 'bar.csv'
included as "bar.csv" from Row 0 of 'input_foo.csv'"
included as "input_foo.csv" from <root_folder: C:\Code\github_corp\pdtable\pdtable\test\io\input\with_include>"
included as "/" from <root>"
Alternatively, the function make_location_trees can be used to generate a tree-representaion of the load
process for an entire table bundle:
>>> for tree in pdtable.io.load.make_location_trees(bundle):
... print(tree)
<root_folder: C:\Code\github_corp\pdtable\pdtable\test\io\input\with_include>
Row 0 of 'input_foo.csv'
bar_abs.csv
**bar_abs_table
bar.csv
**bar_table
For detailed information about the module, please refer to the source code and the docs/diagrams folder in the
repository.