≅ - SiSU for documents - structuring, publishing in multiple formats & search

ℹ - A short description

SiSU is an object-centric, lightweight markup based, document structuring, parser, publishing and search tool for document collections. It is command line oriented and generates static content that is also made searchable at an object level through an SQL database.

SiSU markup helps define (delineate) text objects which are numbered sequentially by the program for object citation. Breaking the document into objects provides interesting possibilities. These object numbers provide the possibility of citing/locating text precisely across different document formats and different languages (assuming the document has been translated). For search it also makes it possible to identify precisely where within in each document search criteria is met in the form of an index. Additionally the use of objects (and that objects are numbered) frees the possibility to represent the document in the manner considered most suitable to a specific document format (whilst retaining its structural (and citation) integrity).

Δ - SiSU project source

Δ SiSU projects repo (git)
- https://git.sisudoc.org

Δ SiSU (scribe): document publishing (multiple formats + search)
- https://git.sisudoc.org/sisu

Δ SiSU markup samples in document pods for sisu (scribe)
- https://git.sisudoc.org/sisu-markup

⌘ - SiSU Spine markup sample output

To give an idea of how this works here is a small collection of documents marked up for and generated by the software. The curation of topics for a collection of specialized related documents would benefit from a consistently applied bespoke ontology or thesaurus.
The documents presented are documents that have been released under various creative commons licences, in the public domain, or the author's work, with the exception of one that is under GPL and the old abandoned Debian live-manual

⌘ Authors (software curated from provided document header metadata)
- https://sisudoc.org/spine/authors.html

⌘ Topics (software curated from provided document header metadata)
- https://sisudoc.org/spine/topics.html

፨ - SiSU Spine search

፨ Search (granular search of text objects)
- https://sisudoc.org/spine_search

ℹ - SiSU description

Here is a description that has been used for the original sisu (scribe):

With minimal preparation of a plain-text (UTF-8) file, using sisu markup syntax in your text editor of choice, SiSU can generate various document formats, most of which share a common object numbering system for locating content, including plain text, HTML, XHTML, XML, EPUB, OpenDocument text (ODF:ODT), LaTeX, PDF files, and populate an SQL database with objects (roughly paragraph-sized chunks) so searches may be performed and matches returned with that degree of granularity. Think of being able to finely match text in documents, using common object numbers, across different output formats (same object identifier for pdf, epub or html) and across languages if you have translations of the same document (same object identifier across languages). For search, your criteria is met by these documents at these locations within each document (equally relevant across different output formats and languages). To be clear (if obvious) page numbers provide none of this functionality. Object numbering is particularly suitable for "published" works (finalized texts as opposed to works that are frequently changed or updated) for which it provides a fixed means of reference of content. Document outputs can also share provided semantic meta-data.

...

SiSU is less about document layout than it is about finding a way using little markup to construct an abstract representation of a document that makes it possible to produce multiple representations of it which may be rather different from each other and used for different purposes, whether layout and publishing, scrollworthy online viewing/ reading, or content search. To be able to take advantage from its minimal preparation starting point of some of the strengths of rather different established ways of representing documents for different purposes, whether for search (relational database, or indexed flat files generated for that purpose whether of complete documents, or say of files made up of objects), online or other electronic viewing (e.g. html, xml, epub), or paper publication (e.g. pdf via latex)...

The solution arrived at is to extract structural information about the document (document sections and headings within the document, available through pattern matching or markup) and tracking objects (which primarily are defined units of text such as paragraphs, headings, tables, verse, etc. but also images) which can be reconstituted as the same documents with relevant object identification numbers so text (objects) can be referenced across different output formats and presentations.

SiSU generates tables of content, and through its markup the means for metadata to be provided for the generation of book style indexes for a document (that again due to document object numbers are the same and equally relevant across all document formats). Per document classifying/organizing metadata can also be provided for automated document curation.

... there have also been working experiments with sisu markup source, two way conversion/representation of sisu document markup source in mind-mapping (software kdissert was used for its strong focus on producing documents (now apparently called semantik)); also po4a software for translators has been used successfuly in its regular text mode for sisu markup in translation, (which is more an attribute of po4a than of sisu, but) which is of interest due to sisu/spine's object citation numbering being available across translations. Open Document Format text (odf:odt), has been an output, but much more interesting (and requested by potential users of sisu/spine) would be the ability of a word processor to save text/a document in sisu markup, making alternative document processing and presentations with sisu possible.

also worth mention, in the relatively long history of this project, there has been work done on extracting hash representations of each object, that could hypothetically be shared to prove the content of a document without sharing its content, or of identifying which objects change; these hashes can also be used as unique identifiers in a database or as identifying filenames if individual objects are saved.

SiSU has evolved, the current implementation focuses on one primary use-case, books and literary writings. However the concept on which it is based has wider application. Here is a prevously posted souvenir from my encounter with an IBM software evaluator in London June 2004 that came about through a chance encounter with an IBM manager at a Linux Expo, who was curious about my interest in Gnu/Linux with my legal background... on hearing that I also wrote software, he suggested, maybe IBM should have a look at it. I was interested, the meeting was set up... with an IBM, Software Innovations evaluator
His response after the meeting:

"Ralph
Good to meet with you today, I was very impressed with your software.
[colleague's name (also posted to an IBM colleague)] - in summary - Ralph has built an application that runs on linux and takes ASCII documents and pulls them apart in to the smallest constituent parts, storing them as XML, PDF and HTML, the HTML are hyperlinked up so the document can be browsed in its full form. the format and text data created is stored in a database.
This has potential in any place that needs the power of full text search whilst holding the structural concepts of the document i.e. legal, pharma, education, research.. which ones we need to figure out, ..."

Special interest was expressed in the search implications of SiSU. To paraphrase, the company has document management systems dealing with hundreds of thousands of texts, these tell you which documents match your search criteria, but cannot inform you where within a text these matches were found without opening the documents. This is achieved through defining document objects and making them the building block of the document, trackable document objects (that can be placed back in the context of the document or corpus of documents if part of a collection). SiSU's early design was to - abstract documents to their structure, and identified objects, numbered in a citable way (as pointed out document object hashes can be of use for the purpose).

ℹ - SiSU Spine

SiSU Spine is the new generator for documents prepared in sisu markup, written in D as opposed to the original sisu which was first shared in Ruby.

Spine code has not as yet been made publicly available.

As compared with the original sisu generator sisu spine:

- Spine uses the same document markup for the document body, but uses yaml for document headers (which contains document metadata and configuration details), the original sisu has a bespoke markup for headers.

- Spine (written in D) is considerably faster at generating native output than sisu (written in Ruby), on last test at least 60 times faster (what took 1 minute takes 1 second; 1 hour a minute :-) (admittedly some time ago, ruby has been getting faster, hopefully this is not over over promising).

- Spine produces fewer document outputs types than sisu (html, epub, (odt, latex) and populates sql db for search)

- As regards non-native output, so far Spine has greater separation of what it does and largely leaves calling the external program to the user, e.g.: latex output is a native output in the sense that it is generated directly by spine, but the pdfs that can be produced from these are produced through use of an external program xelatex, which produces fine output but is a very much slower process.

- (where both produce the same output type, generally) Spine generally produces more up to date output format representations.

ralph.amissah www since 1993 ;-)