Reading ======= The functions ``read_excel()`` and ``read_csv()`` do what it says on the tin. They can read StarTable data from files as well as text streams. They generate tuples of (``block_type``, ``block``) where * ``block`` is a StarTable block * ``block_type`` is a ``BlockType`` enum that indicates which type of StarTable block ``block`` is. Possible values: ``DIRECTIVE``, ``TABLE``, ``TEMPLATE_ROW``, ``METADATA``, and ``BLANK``. Reading example --------------- Let’s make some CSV StarTable data, put it in a text stream, and read that stream. (Reading from a CSV file is no different.) .. code:: python from io import StringIO from pdtable import read_csv csv_data = StringIO( dedent( """\ author: ;XYODA ; purpose:;Save the galaxy ***gunk grok jiggyjag **places; all place;distance;ETA;is_hot text;km;datetime;onoff home;0.0;2020-08-04 08:00:00;1 work;1.0;2020-08-04 09:00:00;0 beach;2.0;2020-08-04 17:00:00;1 **farm_animals;;; your_farm my_farm other_farm;;; species;n_legs;avg_weight; text;-;kg; chicken;2;2; pig;4;89; cow;4;200; unicorn;4;NaN; """ ) ) # Read the stream. Syntax is the same if reading CSV file. # Reader function returns a generator; unroll it in a list for convenience. block_list = list(read_csv(csv_data)) assert len(block_list) == 4 The reader generates tuples of ``(BlockType, block)``. Note that blank/comment lines are read but not parsed. >>> for bt, b in block_list: ... print(bt, type(b)) ... BlockType.METADATA BlockType.DIRECTIVE BlockType.TABLE BlockType.TABLE Here’s one of the tables. >>> t = block_list[2][1] >>> assert t.name == "places" >>> print(t) **places all place [text] distance [km] ETA [datetime] is_hot [onoff] 0 home 0.0 2020-08-04 08:00:00 True 1 work 1.0 2020-08-04 09:00:00 False 2 beach 2.0 2020-08-04 17:00:00 True We can pick the tables (and leave out blocks of other types) like this: .. code:: python from pdtable import BlockType tables = [b for bt, b in block_list if bt == BlockType.TABLE] assert len(tables) == 2 Filtered reading ---------------- Blocks can be filtered prior to parsing, by passing a callable as ``filter`` argument to the reader functions. This can reduce reading time substantially when reading only a few tables from an otherwise large file or stream. The callable must accept two arguments: block type, and block name. Only those blocks for which the filter callable returns ``True`` are fully parsed. Other blocks are parsed only superficially, meaning that only the block’s top-left cell is parsed, which is just enough to recognize block's type and name to pass to the filter callable, thus avoiding the much more expensive task of parsing the entire block, e.g. the values in all columns and rows of a large table. Let’s design a filter that only accepts tables whose name contains the word ``'animal'``. .. code:: python def is_table_about_animals(block_type: BlockType, block_name: str) -> bool: return block_type == BlockType.TABLE and "animal" in block_name Now let’s see what happens when we use this filter when re-reading the same CSV text stream as before. >>> csv_data.seek(0) >>> block_list = list(read_csv(csv_data, filter=is_table_about_animals)) >>> assert len(block_list) == 1 >>> block_list[0][1] **farm_animals my_farm, your_farm, other_farm species [text] n_legs [-] avg_weight [kg] 0 chicken 2.0 2.0 1 pig 4.0 89.0 2 cow 4.0 200.0 3 unicorn 4.0 NaN Non-table blocks were ignored, and so were table blocks that weren’t animal-related. Handling directives ------------------- StarTable Directive blocks are intended to be interpreted and handled at read time, and then discarded. The client code (i.e. you) is responsible for doing this. Handling directives is done outside of the ``pdtable`` framework. This would typically done by a function that consumes a ``BlockIterator`` (the output of the reader functions), processes the directive blocks encountered therein, and in turn yields a processed ``BlockIterator``, as in: :: def handle_some_directive(bi: BlockIterator, *args, **kwargs) -> BlockIterator: ... Usage would then be: :: block_it = handle_some_directive(read_csv('foo.csv'), args...) Let’s imagine a directive named ``include``, which the client application is meant to interpret as: include the contents of the listed StarTable CSV files along with the contents of the file you’re reading right now. Such a directive could look like: :: ***include bar.csv baz.csv Note that there’s nothing magical about the name “include”; it isn’t StarTable syntax. This name, and how such a directive should be interpreted, is entirely defined by the application. We could just as easily imagine a ``rename_tables`` directive requiring certain table names to be amended upon reading. But let’s stick to the “include” example for now. The application wants to interpret the ``include`` directive above as: read all the blocks from the two additional CSV files listed and throw them in the pile. Perhaps even recursively, i.e. similarly handle any ``include`` directives encountered in ``bar.csv`` and ``baz.csv`` and so on. A handler for this could be designed to be used as:: block_it = handle_includes(read_csv('foo.csv'), input_dir='./some_dir/', recursive=True) and look like this: .. code:: python import functools from pdtable import BlockIterator, Directive def handle_includes(bi: BlockIterator, input_dir, recursive: bool = False) -> BlockIterator: """Handles 'include' directives, optionally recursively. Handles 'include' directives. 'include' directives must contain a list of files located in directory 'input_dir'. Optionally handles 'include' directives recursively. No check is done for circular references. For example, if file1.csv includes file2.csv, and file2.csv includes file1.csv, then infinite recursion ensues upon reading either file1.csv or file2.csv with 'recursive' set to True. Args: bi: A block iterator returned by read_csv input_dir: Path of directory in which include files are located. recursive: Handle 'include' directives recursively, i.e. 'include' directives in files themselves read as a consequence of an 'include' directive, will be handled. Default is False. Yields: A block iterator yielding blocks from... * if recursive, the entire tree of files in 'include' directives. * if not recursive, the top-level file and those files listed in its 'include' directive (if any). """ deep_handler = ( functools.partial(handle_includes, input_dir=input_dir, recursive=recursive) if recursive else lambda x: x ) for block_type, block in bi: if block_type == BlockType.DIRECTIVE: directive: Directive = block if directive.name == "include": # Don't emit this directive block; handle it. for filename in directive.lines: yield from deep_handler(read_csv(Path(input_dir) / filename)) else: yield block_type, block else: yield block_type, block This is only an example implementation. You’re welcome to poach it, but note that no check is done for circular references! E.g. if ``bar.csv`` contains an ``include`` directive pointing to ``foo.csv`` then calling ``handle_includes`` with ``recursive=True`` will result in a stack overflow. You may want to perform such a check if relevant for your application. Alternatively, look into the section on `Loading input sets`_ below, which includes a more robust implementation of the ``include``-directive. Loading input sets ------------------ The term "reading" has been used above to describe the low-level process of reading a single input location. We will use the term "loading" for the process of reading a full input set, possibly from multiple storage backends, while recording information about the input locations for each table. Support for "loading" in this sense is implemented in the ``io.load`` and ``table_origin`` modules. These modules are built as layers on top of the fundamental readers, and a given project can choose to not use either, to use only ``table_origin`` or to use both. The design is intended to be flexible enough to support most scenarios, while still being convenient to use for the most basic use-case of reading a folder of files. As an example of using the modules, this is how you would use the convenience function ``load_files`` to load an filesystem input-set based on a root folder and a file name pattern:: bundle = pdtable.TableBundle(pdtable.load_files( root_folder=inputs, file_name_start_pattern="(input|setup)_")) The included load information can be accessed via the ``.metadata``-field as:: input_location = bundle['bar_table'].metadata.origin.input_location The resulting object can be inspected via the string-representation:: >>> print(input_location.interactive_identifier) >>> print('\n'.join(str(input_location.load_specification).split(';'))) Row 0 of 'bar.csv' included as "bar.csv" from Row 0 of 'input_foo.csv'" included as "input_foo.csv" from " included as "/" from " Alternatively, the function ``make_location_trees`` can be used to generate a tree-representaion of the load process for an entire table bundle:: >>> for tree in pdtable.io.load.make_location_trees(bundle): ... print(tree) Row 0 of 'input_foo.csv' bar_abs.csv **bar_abs_table bar.csv **bar_table For detailed information about the module, please refer to the source code and the ``docs/diagrams`` folder in the repository.