Tatu Projects Journal
Themes in my MSc thesis

Last update: 2024-06-29

Initially the major themes in my thesis were literate programming and notebook interfaces (essentially cell execution). Transclusion popped into my radar during writing it, and while it is only discussed shortly in the thesis, is a major research avenue for me to continue working on. Cell execution didn't inspire me much in the end, but it is a curious way to build UIs. An honorary mention for future inspiration goes to Fog and Klokmose conceptualizing literate computing.

Literate programming

At its core, literate programming abstracts code structure imposed by a programming language into an arbitrarily chosen structure row-by-row. Knuth's vision for this arbitrary structure was a narrative structure, a structure best read and understood by humans. In Knuth's implementation arbitrary structures are borne of code laid in nodes, and then linking these nodes. Noweb essentially standardized this method in the 90s.

Org and Babel work as an extensive literate programming platform. Babel produces (tangles per Knuth) codebases and Org then produces (weaves) readable documents, all from a single or multiple Org files. The whole Org ecosystem is available to enhance the documentation side. Also, the Emacs programming ecosystem does the same for the coding side, albeit in a more limited fashion.

However, there is a critical issue with any Noweb-style literate programming system, especially problematic in the context of professional software engineering. Any kind of modification to the tangled codebase is impossible to integrate back into the Org file without copy-pasting by hand. This rules out any changes to the codebase by e.g. refactoring tools or non-Org IDEs designed to work on language-standard codebases. There are no modern IDEs that work on Org or Noweb syntax and match e.g. debugging or hinting between the tangled codebase and the nodes in the source document.

This issue follows at least in part from Noweb syntax allowing one node being copied to multiple parts in a produced codebase. Which of those copies is the authoritative one when working backwards? To my great relief, there is a recent and actively worked on tool in the Org ecosystem that skirts around this issue, while still allowing an arbitrary order: org-transclusion.

Transclusion

Transclusive style reverses the flow: instead of producing the codebase from the document, the document now mirrors an existing codebase. If literate programming is about arbitrary ordering and not about single sourcing codebases and documentation, transclusion could be considered a modern relevant form of literate programming in a world, where no professional IDE supports Noweb-style literate programming.

Ditching Noweb, it can no longer deduplicate code, but arguably deduplication is done better by programming languages themselves post-PASCAL. Transcluded nodes mirror content in a codebase and the nodes can be edited in-Org. This has one core problem: how to delimit and pattern match the content in the codebase. Org-transclusion has only rudimentary plaintext or line number matching, but eventually this will be done with regexes. Transclusive style might then reflect on the codebase in situations where arbitrary-looking delimiters might have to be used, when matching on e.g. function name and the next function delimiter (e.g. '}') following it is not satisfactory.

Transclusion is a compromise, if a very pragmatic one. There are no silver bullets.

  • While arbitrary ordering allows for a narrative marriage between documentation and programming, without interlinking nodes the web-like structure is lost. This closes some doors a hypothetical literate programming IDE could use to ease reading and traversing the literate document.
  • The pragmatic choice to allow using outside IDEs to change code means a split between the literate document and the codebase. The literate document is harder to keep up-to-date as a developer's focus moves to the non-literate codebase and its IDE. In this sense, it compromises on literate programming's core benefit and moves it towards separate documentation.

But, I think these compromises are essential as long as there are no advanced literate programming IDEs seamlessly integrating a more complex system like Noweb. Transclusive style can work as a platform for incremental technological transition towards better realizing the core benefits.

Notebook interface

The vehicle for exploring literate programming in my thesis was integrating traits from computational notebooks into a novel literate programming style. The essential notebook trait is cellular execution, where you can run select nodes with arbitrary tools, usually a REPL.

Executing nodes in a literate document seems to have some utility, as it allows abstracting functionality behind a single UI (notebook interface). In my thesis I used it for visual testing using a headless browser instance, unit testing, and setting up editor and runtime environments for the program. However, no code could have been both put into a codebase and also run inside the interface. As I wrote Haskell in this case, the REPL environment essentially prevents it. On the other hand, I'm not even sure what a piece of practical code would look like, one that could both be cell executed and be put into a codebase.