1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
|
-*- mode: org -*-
#+TITLE: configuration git
#+DESCRIPTION: documents - structuring, various output representations & search
#+FILETAGS: :spine:hub:
#+AUTHOR: Ralph Amissah
#+EMAIL: [[mailto:ralph.amissah@gmail.com][ralph.amissah@gmail.com]]
#+COPYRIGHT: Copyright (C) 2015 - 2023 Ralph Amissah
#+LANGUAGE: en
#+STARTUP: content hideblocks hidestars noindent entitiespretty
#+PROPERTY: header-args :exports code
#+PROPERTY: header-args+ :noweb yes
#+PROPERTY: header-args+ :results no
#+PROPERTY: header-args+ :cache no
#+PROPERTY: header-args+ :padline no
#+PROPERTY: header-args+ :mkdirp yes
#+OPTIONS: H:3 num:nil toc:t \n:t ::t |:t ^:nil -:t f:t *:t
* homepage index.html
#+HEADER: :tangle "../markup/sisudoc-spine-bespoke-output/html/homepage.index.html"
#+BEGIN_SRC html
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/plain; charset=UTF-8" />
<title>≅ SiSU project sisudoc.org</title>
<link href="./css/html_seg.css" rel="stylesheet" />
</head>
<body>
<h1>≅ - SiSU for documents - structuring, publishing in multiple
formats & search</h1>
<h2>ℹ - A short description</h2>
<p>
SiSU is an object-centric, lightweight markup based, document structuring,
parser, publishing and search tool for document collections. It is command line
oriented and generates static content that is currently made searchable at an
object level through an SQL database.
Markup helps define (delineate) objects (primarily various types of text block)
which are tracked in sequence, substantive objects being numbered sequentially
by the program for object citation.
</p>
<h2>Δ - SiSU project source</h2>
<p>
<a href="./projects">
Δ SiSU projects repo (git)
</a><br>
- <a href="https://git.sisudoc.org">
https://git.sisudoc.org
</a><br>
</p>
<p>
<a href="./projects/sisu">
Δ SiSU (scribe): document publishing (multiple formats + search)
</a><br>
- <a href="https://git.sisudoc.org/sisu">
https://git.sisudoc.org/sisu
</a><br>
</p>
<p>
<a href="./projects/sisu-markup">
Δ SiSU markup samples in document pods for sisu (scribe)
</a><br>
- <a href="https://git.sisudoc.org/sisu-markup">
https://git.sisudoc.org/sisu-markup
</a><br>
</p>
<h2>⌘ - SiSU Spine markup sample output</h2>
<p>
To give an idea of how this works here is a small collection of documents marked
up for and generated by the software. The curation of topics for a collection of
specialized related documents would benefit from a consistently applied bespoke
ontology or thesaurus.<br> The documents presented are documents that have been
released under various creative commons licences, in the public domain, or the
author's work, with the exception of one that is under GPL and the old abandoned
Debian live-manual
</p>
<p>
<a href="./authors.html">
⌘ Authors
</a>
(software curated from provided document header metadata)<br>
- <a href="./authors.html">
https://sisudoc.org/spine/authors.html
</a>
</p>
<p>
<a href="./topics.html">
⌘ Topics
</a>
(software curated from provided document header metadata)<br>
- <a href="./topics.html">
https://sisudoc.org/spine/topics.html
</a>
</p>
<h2>፨ - SiSU Spine search</h2>
<p>
<a href="./spine_search">
፨ Search
</a>
(granular search of text objects)<br>
- <a href="https://sisudoc.org/spine_search">
https://sisudoc.org/spine_search
</a>
</p>
<div class="p">
<!-- SiSU Spine Search -->
<form action="https://sisudoc.org/spine_search" target="_top" method="POST" accept-charset="UTF-8" id="search">
<input type="text" name="sf" size="24" maxlength="255">
<input type="hidden" name="db" value="spine.search.db">
<input type="hidden" name="sml" value="1000">
<input type="hidden" name="ec" value="on">
<input type="hidden" name="url" value="on">
<button type="submit" form="search"> ㏈ ፨ </button>
</form>
<!-- SiSU Spine Search -->
</div>
<h2>ℹ - SiSU description</h2>
<p>
SiSU is an object-centric, lightweight markup based, document structuring,
parser, publishing and search tool for document collections. It is command line
oriented and generates static content that is currently made searchable at an
object level through an SQL database.
Markup helps define (delineate) objects (primarily various types of text block)
which are tracked in sequence, substantive objects being numbered sequentially
by the program for object citation.
</p>
<p>
<b>Summary.</b> An object is a unit of text within a document the most common
being a paragraph. Objects include individual headings, paragraphs, tables,
grouped text of various types such as code blocks and within poems, verse.
Objects have properties and attributes, of particular significance are headings
and their levels which provide document structure. A heading is an object with a
heirarchical value, that conceptually contains other objects (such as paragraphs
and possibly sub-headings etc.). Objects are tracked sequentially as they relate
to each other object within a document and substantive objects are numbered
sequentially, for citation purposes. Notably footnotes are not objects in
themselves, rather belonging to the object from which they are referenced, and
following their own numbering sequence. From heading objects (linked) tables of
content may be generated, and if additional metadata is provided book type
indexes can be generated that link back to the objects to which they relate.
</p>
<p>
<b>Unpacking this a bit further.</b> SiSU as a concept independent of its markup
language and the parsers that have been implemented, is based on the following
ideas:
</p>
<p>
<b>Object-Centricity. On objects:</b> In SiSU objects are the fundamental unit
from which larger constructs within a document and the document itself is built.
Breaking the document into objects provides interesting possibilities.
</p>
<p>
<b>Objects are fundamental building blocks:</b> Conceptually within SiSU,
objects are the building blocks or individual units of construction of a
document. Objects are usually blocks of text, the most common of which is the
paragraph, other examples include: individual headings, tables, grouped text of
various types which include code blocks and verse within poems, ... and as
mentioned an object could also, for example, be an image. Objects can be
formatted and placed as needed, providing flexibility and enabling multiple
types of representation across disperate formats and text recepticle, examples
including html, epub, latex (in the past mind-maps) and sql (populated at an
object level, and thereby providing search with that degree of granularity).
</p>
<p>
<b>Sequential. Objects have sequence:</b> That objects have sequence, goes
largely without saying, this follows authorship, it is part of the definition of
a document and how a document is written to convey meaning.
</p>
<p>
<b>Object Numbers & Citation. Substantive objects are numbered for citation
purposes:</b> Most objects within a document are meant by the author to be a
substantive part of the document. All such objects are numbered sequentially and
can be referenced thereby for citation purposes.
Object numbers provide the possibility of citing/locating text precisely across
different document formats and different languages (assuming the document has
been translated). For search it also makes it possible to identify precisely
where search criteria is met within in each document in the form of an index or
to view those precise text objects before deciding which documents are of
interest. Additionally the use of objects (and that objects are numbered) frees
the possibility to represent the document in the manner considered most suitable
to a specific document format wilst retaining its structural (and citation)
integrity).
</p>
<p>
<b>Characteristics. Objects have properties and attributes:</b> Objects have
properties (and may have attributes). By properties I here refer to the
fundamental type of object, be it a heading, a paragraph, table, verse etc.
Attributes extend further and may include other things that one might wish to
associate with the object (examples not necessarily currently available/
implemented in SiSU might include, formatting whether it is indented, or
metadata e.g. the associated language, or programming language for a code block)
</p>
<p>
<b>Document structure. Heading objects hold documents structure:</b> Heading
objects hold documents structure through their heading level property. The types
of document of interest to SiSU have structure that is captured by the heading
level property. Headings are individual objects like any other with the
additional properties that (i) they may be regarded as containing the other
objects following them sequentially (until the next heading of a similar or
higher level), heading objects may include other headings (sub-headings), and
(ii) that they have a heirarchy, the root "heading" being the document
title.<br>A complication was intruduced to provide greater flexibility across
document output formats. Headings have two sets of levels, the level under which
substantive text occurs, this would be a chapter or segment level, and above
that in the heirarchy if needed are document section separators, book, section,
part.
</p>
<p>
<b>Non-objects</b> Most but not all parts of a document are treated as objects.
Notably footnotes are not objects in themselves, rather belonging to the object
from which they are referenced, and following their own numbering sequence. From
heading objects (linked) tables of content may be generated, and if additional
metadata is provided book type indexes can be generated that link back to the
objects to which they relate.
</p>
<p>
<b>The Document Header.</b> SiSU document have headers which contain document
metadata, at a minimum the document title and author. In addition the document
header may contain markup instruction (e.g. how to identify headings within the
document, in which case those headings need not be found and treated
accordingly)
</p>
<p>
SiSU parsers have now been implemented in different programming paradigms and
languages a couple of times, the chosen markup has been left unchanged though
the document headers have been modified.
This is the core of sisu, beyond which there is more but largely in the form of
choices based on ... existing output formats and of implementation detail,
deciding what attributes of objects, or within objects should be supported,
extending markup to allow for the generation of book indexes from if tagging
provided.
</p>
<h2>ℹ - SiSU Historical Descriptions</h2>
<p>
Here is a description that has been used for the original sisu (scribe):
</p>
<p>
With minimal preparation of a plain-text (UTF-8) file, using sisu markup syntax
in your text editor of choice, SiSU can generate various document formats, most
of which share a common object numbering system for locating content, including
plain text, HTML, XHTML, XML, EPUB, OpenDocument text (ODF:ODT), LaTeX, PDF
files, and populate an SQL database with objects (roughly paragraph-sized
chunks) so searches may be performed and matches returned with that degree of
granularity. Think of being able to finely match text in documents, using common
object numbers, across different output formats (same object identifier for pdf,
epub or html) and across languages if you have translations of the same document
(same object identifier across languages). For search, your criteria is met by
these documents at these locations within each document (equally relevant across
different output formats and languages). To be clear (if obvious) page numbers
provide none of this functionality. Object numbering is particularly suitable
for "published" works (finalized texts as opposed to works that are frequently
changed or updated) for which it provides a fixed means of reference of content.
Document outputs can also share provided semantic meta-data.
</p>
<h3>...</h3>
<p>
SiSU is less about document layout than it is about finding a way using little
markup to construct an abstract representation of a document that makes it
possible to produce multiple representations of it which may be rather different
from each other and used for different purposes, whether layout and publishing,
scrollworthy online viewing/ reading, or content search. To be able to take
advantage from its minimal preparation starting point of some of the strengths
of rather different established ways of representing documents for different
purposes, whether for search (relational database, or indexed flat files
generated for that purpose whether of complete documents, or say of files made
up of objects), online or other electronic viewing (e.g. html, xml, epub), or
paper publication (e.g. pdf via latex)...
</p>
<p>
The solution arrived at is to extract structural information about the document
(document sections and headings within the document, available through pattern
matching or markup) and tracking objects (which primarily are defined units of
text such as paragraphs, headings, tables, verse, etc. but also images) which
can be reconstituted as the same documents with relevant object identification
numbers so text (objects) can be referenced across different output formats and
presentations.
</p>
<p>
SiSU generates tables of content, and through its markup the means for metadata
to be provided for the generation of book style indexes for a document (that
again due to document object numbers are the same and equally relevant across
all document formats). Per document classifying/organizing metadata can also be
provided for automated document curation.
</p>
<p>
... there have also been working experiments with sisu markup source, two way
conversion/representation of sisu document markup source in mind-mapping
(software kdissert was used for its strong focus on producing documents (now
apparently called semantik)); also po4a software for translators has been used
successfuly in its regular text mode for sisu markup in translation, (which is
more an attribute of po4a than of sisu, but) which is of interest due to
sisu/spine's object citation numbering being available across translations. Open
Document Format text (odf:odt), has been an output, but much more interesting
(and requested by potential users of sisu/spine) would be the ability of a word
processor to save text/a document in sisu markup, making alternative document
processing and presentations with sisu possible.
</p>
<p>
also worth mention, in the relatively long history of this project, there has
been work done on extracting hash representations of each object, that could
hypothetically be shared to prove the content of a document without sharing its
content, or of identifying which objects change; these hashes can also be used
as unique identifiers in a database or as identifying filenames if individual
objects are saved.
</p>
<p>
SiSU has evolved, the current implementation focuses on one primary use-case,
books and literary writings. However the concept on which it is based has wider
application. Here is a prevously posted souvenir from my encounter with an IBM
software evaluator in London June 2004 that came about through a chance
encounter with an IBM manager at a Linux Expo, who was curious about my interest
in Gnu/Linux with my legal background... on hearing that I also wrote software,
he suggested, maybe IBM should have a look at it. I was interested, the meeting
was set up... with an IBM, Software Innovations evaluator<br>His response after
the meeting:
</p>
<p>
"Ralph<br>Good to meet with you today, I was very impressed with your
software.<br><i>[colleague's name (also posted to an IBM colleague)]</i> - in
summary - Ralph has built an application that runs on linux and takes ASCII
documents and pulls them apart in to the smallest constituent parts, storing
them as XML, PDF and HTML, the HTML are hyperlinked up so the document can be
browsed in its full form. the format and text data created is stored in a
database.<br>This has potential in any place that needs the power of full text
search whilst holding the structural concepts of the document i.e. legal,
pharma, education, research.. which ones we need to figure out, ..."
</p>
<p>
Special interest was expressed in the search implications of SiSU. To
paraphrase, the company has document management systems dealing with hundreds of
thousands of texts, these tell you which documents match your search criteria,
but cannot inform you where within a text these matches were found without
opening the documents. This is achieved through defining document objects and
making them the building block of the document, trackable document objects (that
can be placed back in the context of the document or corpus of documents if part
of a collection). SiSU's early design was to - abstract documents to their
structure, and identified objects, numbered in a citable way (as pointed out
document object hashes can be of use for the purpose).
</p>
<h2>ℹ - SiSU Spine</h2>
<p>
SiSU Spine is the new generator for documents prepared in sisu markup, written
in D as opposed to the original sisu which was first shared in Ruby.
</p>
<p>
Spine code has not as yet been made publicly available.
</p>
<p>
As compared with the original sisu generator sisu spine:
</p>
<p>
- Spine uses the same document markup for the document body, but uses yaml for
document headers (which contains document metadata and configuration details),
the original sisu has a bespoke markup for headers.
</p>
<p>
- Spine (written in D) is considerably faster at generating native output than
sisu (written in Ruby), on last test at least 60 times faster (what took 1
minute takes 1 second; 1 hour a minute :-) (admittedly some time ago, ruby has
been getting faster, hopefully this is not over over promising).
</p>
<p>
- Spine produces fewer document outputs types than sisu (html, epub, (odt,
latex) and populates sql db for search)
</p>
<p>
- As regards non-native output, so far Spine has greater separation of what it
does and largely leaves calling the external program to the user, e.g.: latex
output is a native output in the sense that it is generated directly by spine,
but the pdfs that can be produced from these are produced through use of an
external program xelatex, which produces fine output but is a very much slower
process.
</p>
<p>
- (where both produce the same output type, generally) Spine generally produces
more up to date output format representations.
</p>
<hr>
<p class="tiny"><i>
ralph.amissah www since 1993 ;-)
</i></p>
<hr>
<h2>Some external links of interest</h2>
<h3>Development</h3>
<h4>Programming</h4>
<p>
[ <a href="https://dlang.org/">
D - (dlang) general purpose, multi-paradigm, fast C like programming language
</a> ]
[ <a href="https://code.dlang.org/">
dub - package registry
</a> ]
[ <a href="https://forum.dlang.org/group/general">
community discussion (mail list frontend)
</a> ]<br>
</p>
<p>
[ <a href="https://www.ruby-lang.org/en/">
Ruby
</a> ]
[ <a href="https://rubygems.org/">
Gems
</a> ]<br>
[ <a href="https://crystal-lang.org/">
Crystal
</a> ]<br>
</p>
<h4>SQL DB</h4>
<p>
[ <a href="https://sqlite.org/index.html">
Sqlite - an sql database engine
</a> ]<br>
[ <a href="https://www.postgresql.org/">
PostgreSQL
</a> ]<br>
</p>
<h4>Markup</h4>
<p>
[ <a href="https://www.w3.org/html/">
HTML
</a> ]
[ <a href="https://html.spec.whatwg.org/multipage/">
multipage current spec
</a> ]
[ <a href="https://dom.spec.whatwg.org/">
dom current spec
</a> ]<br>
[ <a href="https://www.w3.org/publishing/epub32/">
Epub
</a> ]<br>
[ <a href="https://www.w3.org/Style/CSS/">
css - cascading style sheets
</a> ]<br>
</p>
<p>
[ <a href="https://opendocumentformat.org/">
OpenDocument Format
</a> ]<br>
</p>
<p>
[ <a href="https://www.latex-project.org/get/">
LaTeX
</a> ]<br>
</p>
<p>
[ <a href="https://po4a.org/index.php.en">
po4a - maintain translations
</a> ]<br>
</p>
<h4>Operating System Distributions</h4>
<p>
[ <a href="https://nixos.org/">
NixOS - linux based operating system built on the Nix declarative, reproducible and reliable, build system
</a> ]
[ <a href="https://github.com/NixOS/nixpkgs">
nixpkgs (packages @ github)
</a> ]
[ <a href="https://search.nixos.org/packages?channel=unstable&from=0&size=100&sort=relevance&query=">
package search
</a> ]
[ <a href="https://discourse.nixos.org/">
community discussion (discourse)
</a> ]<br>
Gnu [ <a href="https://guix.gnu.org/">
Guix
</a> ]
[ <a href="https://guix.gnu.org/en/packages/">
packages
</a> ]
<br>
</p>
<p>
[ <a href="https://debian.org/">
Debian - the universal operating system distribution
</a> ]<br>
[ <a href="https://www.devuan.org/">
Devuan
</a> ]<br>
</p>
<p>
[ <a href="https://archlinux.org/">
Arch Linux
</a> ]
[ <a href="https://wiki.archlinux.org/">
Arch Wiki
</a> ]<br>
</p>
<hr>
<h2>Extraneous (external) links of personal interest</h2>
<h4>Workspace</h4>
<h5>Shell</h5>
<p>
[ <a href="https://www.zsh.org/">
zsh
</a> ]<br>
[ <a href="https://starship.rs/">
starship - customizable cross-shell prompt
</a> ]<br>
</p>
<h5>Terminal</h5>
<p>
[ <a href="https://gnunn1.github.io/tilix-web/">
tilix
</a> ]
[ <a href="https://alacritty.org/">
alacritty
</a> ]<br>
</p>
<h5>Terminal Multiplexer</h5>
<p>
[ <a href="https://github.com/tmux/tmux">
tmux (github)
</a> ]
[ <a href="https://www.gnu.org/software/screen/">
screen
</a> ]<br>
</p>
<h5>Window Manager</h5>
<p>
[ <a href="https://i3wm.org/">
i3wm
</a> ]
[ <a href="https://swaywm.org/">
sway
</a> ]<br>
</p>
<h5>Text Editors</h5>
<p>
Gnu Emacs
[ <a href="https://github.com/hlissner/doom-emacs">
Doom Emacs (github)
</a> ]
[ <a href="https://orgmode.org/">
Org-Mode - your life in plain text & literate programming
</a> ]
[ <a href="https://github.com/emacs-evil/evil">
Evil-Mode
</a> ]<br>
</p>
<p>
[ <a href="https://www.vim.org/">
Vim
</a> ]
[ <a href="https://neovim.io/">
NeoVim
</a> ]<br>
</p>
<h5>Source Control Manager</h5>
<p>
[ <a href="https://git-scm.com/">
Git
</a> ]<br>
</p>
<h5>Browsers</h5>
<p>
[ <a href="https://vieb.dev/">
vieb
</a> ]
[ <a href="https://fanglingsu.github.io/vimb/">
vimb
</a> ]<br>
[ <a href="https://brave.com/">
brave
</a> ]<br>
</p>
<h3>Search</h3>
<p>
[ <a href="https://duckduckgo.com/">
DuckDuckGo
</a> ]
[ <a href="https://yubnub.org/">
YubNub
</a> ]<br>
</p>
<h3>eMail</h3>
<p>
[ <a href="https://www.migadu.com/">
Migadu
</a> ]<br>
</p>
<p>
[ <a href="https://notmuchmail.org/">
NotmuchMail
</a> ]<br>
</p>
<h3>Forges</h3>
<p>
[ <a href="https://sourcehut.org/">
Sourcehut
</a> ]<br>
</p>
<p>
[ <a href="https://codeberg.org/">
CodeBerg
</a> ]<br>
</p>
<p>
[ <a href="https://github.com">
GitHub
</a> ]
[ <a href="https://gitlab.com">
GitLab
</a> ]<br>
</p>
<h3>Software Archives</h3>
<p>
[ <a href="https://www.softwareheritage.org/">
Software Heritage - the universal software archive
</a> ]<br>
</p>
<hr>
<p class="tiny"><i>
ralph.amissah www since 1993 ;-)
</i></p>
</body>
</html>
#+END_SRC
|