The Document

Sinleqiunnini does not store transliterated documents as text files, but splits the various components of a cuneiform document into different database tables for more efficient process of information retrieval [link].

Nevertheless, to simplify the task of compiling and entering data, the system is equipped with a robust "parser" that accepts as input a simple text file formatted according to a relatively easy-to-remember "diplomatic notation". It is a shallow markup system employed to help the program understand how to differentiate the different words and signs values, and how to manage them into a database-driven architecture [refer to sample1 e sample 2]

From text to DB: "tokenizazion"

sample text RE 67

From the user's perspective, a key concept is the use of white-spaces to separate semantic units, that is tokens, that the parser will interpret as individual elements (i.e., words). Any word, as well as any epigraphic notation¹, will be distinguished by a unique identifier (ID).

id	area_id	notation
14353	4201	DIŠ=-bu-ul-la
14354	4201	DUMU
14355	4201	ib-ri-a-da-li
14356	4202	u₃
14357	4202	MUNUS=-aš-tar-um-mi
14358	4202	DAM-šu₂
14359	4203	MUNUS=-ba-aʾ-la-ki-mi
14360	4203	DUMU.=MI₂-šu₂-nu
...	...	...

This segmentation process is of primary importance for the DAPCA's search engine, as it allows dealing with any token in the system as an isolated unit. It permits the application of a wide set of search strategies to individual elements of a document (e.g., regex, fuzzy search, similarity measures, etc.). At the same time, it is always possible to keep these elements in their context and perform complex searches, for instance, for syntagmata or "chains of words."

Likewise, the system keeps track of the "coordinates" of each of these tokens, thus allowing for the reorganization of the document in its integrity for printing on screen.

Final result:

The parser

The following examples show how a "plain text" format (i.e., .txt) should be structured to be accepted by the parser.

Sample text 1

Diplomatic notation: sample 1

@obverse
 A.ŠÀ ma-["la ma"]-s,u2-ú
 i-na t_*BI-*IQ-mi URU g_ra-ab-ba-=KI
 n_1-1/2 IKU GÍD.DA
 n_1-1/2 IKU ru-up-šu
 ÚS.SA.DU AN.TA DUMU.=MEŠ p_za-bi-hi
 ÚS.SA.DU KI.TA DUMU.=MEŠ p_a-mur-ri
 SAG.KI n_1.KÁM p_iš-bi-ᵈda-gan
 DUMU p_na-ap-ši
 SAG.KI n_2.KÁM {DUMU.=MEŠ} p_da-gal-li DUMU p_ir-am-ᵈda-gan
[A.ŠÀ] aš-ri-iš-ma i-na Top_N_sí-ip-hu
n_[X] IKU GÍD.DA
n_[X] ši-id-du ru-up-šu
[Ú]S.SA.DU AN.TA DUMU.=MEŠ p_da-gal-li DUMU p_ir-am-ᵈda-gan
[Ú]S.SA.DU KI.TA DUMU.=MEŠ p_at-tu-wa
[SA]G.KI n_1.KÁM p_a-wi-ru DUMU p_il-la-ti
@bottom
[SAG].KI n_2.KÁM URU.=KI
ša DN_ᵈNIN.URTA
ù LÚ=.MEŠ=-ši-bu-ut
@reverse
GN_[UR]U e-mar-=KI
[b]e-lu-ú A.ŠÀ.=HI.=A
p_ᵐir-am-ᵈda-gan
DUMU p_il-la-ti
a-na n_1 me-at n_5/6 ma-na KÙ.BABBAR
[Š]ÁM TIL.LA
A.ŠÀ i-ša-am KÙ.BABBAR-^pa mah-rù
ŠÀ-šu-nu DU₁₀-^a-^ab ša ur-ra-am še-ra-am
n_2 A.ŠÀ.=HI.=A i-ba-qa-rù
n_1 li-im KÙ.BABBAR a-na DN_ᵈNIN.URTA
n_1 li-im KÙ.BABBAR a-na URU.=KI
Ì.LÁ.E.MEŠ
IGI p_ab-ba-nu DUMU p_ᵈIM-GAL
IGI p_píl-su-ᵈda-gan ŠEŠ-šu
IGI p_ᵈen-ma-lik ŠEŠ-šu-ma
IGI p_ᵈra-ša-ap-la-i DUMU p_ki-ir-ra
IGI p_ab-da DUMU p_hi-e-mi
[IGI] p_ša-dì-da DUMU p_ᵈda-gan-ka
[IGI] p_ša-dì-da DUMU p_i3-lí-a-bi
@left
[IGI p_i-]t[úr]-ᵈD[a-gan DUMU p_i]a-ah-ṣi-E[N]
[IGI p_ir-ib-ᵈIM DUMU p_ha-t]a-ni
[IGI p_x-x-x-x w_DU]B.SAR

Sample text 2

Diplomatic notation: sample 2

@obverse
$blank space(2)
 É-^tu₄ ma-la ma-ṣú-ú
 n_25 i-na am-ma-ti GÍD.DA-šú
 n_23 i-na am-ma-ti ru-up!-šú
 ZAG-šu É.UDUN ša DUMU.=MEŠ p_ᵐga-ni
 GÙB-šú É DUMU.=MEŠ p_ᵐba-at-ta
 pa-nu-šú p_ᵐᵈKUR-a-bu DUMU p_ga-ni
 EGIR-šú p_ᵐat!-tu₄ DUMU p_zu-Ba-la
 É ša p_ᵐa-hi-ᵈKUR ù p_ᵐÌR-DINGIR.=MEŠ DUMU p_ib-ni-be
 KI p_ᵐa-hi-ᵈKUR ù p_ᵐÌR-DINGIR.=MEŠ DUMU p_ib-ni-be
p_ᵐab-du DUMU p_zu-aš-tar-ti DUMU p_qa-ba-ri
a-na n_31 GÍN KÙ.BABBAR É-^ta₅ iš-am
ma-an-nu-me-e ur-ra-am še-ra-am É-^ta₅
⸢i⸣-pa-qa-ru KÙ.BABBAR.=MEŠ TÉŠ.BI
@bottom
a-na p_ᵐab-dì DUMU p_zu-aš-tar-ti
li-din É-^ta₅ lil-qì
@reverse
$ruling
ù a-nu-ma a-šar KÙ.BABBAR.=MEŠ e-ru-bu
n_20 GÍN KÙ.BABBAR.=MEŠ a-na p_ᵐib-ni-ia DUMU p_ma-di-Te
n_10 GÍN KÙ.BABBAR.=MEŠ a-na DAM p_ᵐᵈKUR-a-bi DUMU p_ga-ni
n_1 GÍN KÙ.BABBAR a-na p_ᵐAD-DIRI DUMU p_da-a-i
$ruling
a-nu-ma ṭup-pu la-be-ru ša É an-ni-i
ha-liq šum-ma i-na EGIR u₄-mi ú-še-lu-šu
ṭup-pu an-nu-ú i-hap-pè-e-šú
$ruling
NA₄=.KIŠIB p_ᵐa-hi-ma-lik __ NA₄=.KIŠIB p_ᵐa-hi-ᵈKUR
$seal(1) ________________________ $seal(1)
WN_LÚ=.UGULA __________ DUMU p_ib-ni-ᵈKUR EN É
_____ NA₄=.KIŠIB
$seal(1)
__ p_ᵐBe-li DUMU p_Ba-ia
@top
IGI p_ᵐi-mu-ut-ha-ma-dì DUMU p_ᵈKUR-GAL!(MA.AŠ)
IGI p_ᵐam-za-hi DUMU p_eh-li-ia
IGI p_ᵐEN-ma-lik DUMU p_ṣa!-al-mì
________ IGI p_ᵐÌR-DINGIR.=MEŠ DUMU p_ib-ni-be EN É

1. Physical surfaces of tablet

Those are self-explanatory:

@obverse
@bottom
@reverse
@top
@left

Accordingly, every text file must therefore begin with an @. Otherwise, the parser raises an exception and explicitly prompts the user. In all cases where a line begins with one of these tags, there is no need to add anything else. For example, anything after the @obverse tag will be ignored or, in the worst case, will produce an error.

This marking does not require a line number.

2. Free-text markers

Freely text-based markers can be inserted to indicate various aspects of both the document and metatextual elements. This feature is enabled by placing the $ character before the footnote.

This marking does not require a line number.

This markup includes some helpers for better document layout:

$ruling is replaced by a horizontal line. Multiple horizontal lines can be represented by placing multiple $rulings. For example:
1 2 3 4
10 <text in transliteration> $ruling $ruling 11 <text in transliteration>
If parentheses follow the markup, the number included in the parentheses tells the system the number of lines the "note" should occupy. For example:
- $break(3) tells the system that we have a break that corresponds roughly to three lines of text.
- $blank(2) tells the system that we have a portion of the tablet left blank that occupies the equivalent of two lines of text.
- $seal(4) indicates that the space occupied by the seal corresponds to four lines.

Additionally, one can add any information with the $ prefix. For instance, $the beginning of the column is broken or $an unknown number of columns destroyed, eventually in combination with the (n) alike. This information will be searchable, but please note that it affects the text layout.

It is recommended to use this tool sparingly.

Annotations

For a proper annotation system, see the discussion in: ...

3. Line numbers

Each transliterated line must begin with a line number, as is customary in Assyriological tradition. Currently, there are two possibilities:

a simple numeral: 1
a numeral followed by a single quote character after breaks: 1'

Customization

A set of parser-specific rules prevents the system from accepting anything other than numbers and/or numbers + single-quote as a line label. Additional rules can, however, be added to allow it to accept different line formats.

The sequence of numbers is virtually free, that is, one can decide to start with 1' after breaks or to continue with the previous number sequence (e.g., ... 10 / $break / 11'...)

Note

Regardless of how one chooses to name the line numbers, the system internally stores their order, which is determined by the order of the lines in the text file. The line numbers should be considered simple labels.

3.1. Line number separators

After every line number, a white-space (i.e. \s) -- or eventually a tab separator (i.e. \t) -- must follow. This allows the parser to understand where the line number section ends and the transliteration begins.

4. Transliteration

4.1. Graphic relationships

Character	Function	Example
[carriage return]	line boundary
[space]	word boundary	ša ur-ra-am še-ra-am
-	sign boundary	i-ša-am
.	intra-logographic boundary	ÚS.SA.DU AN.TA
+	used for ligatures	i+na
x or ×	for inclusions	AB×ḪA₂
_	for blank-spaces	[____ i-]na

4.1.1. Breaks and lacunae

Tip

White-spaces in digital transliterations are often neglected, whereas they are of primary importance, for instance, for material philology.

Please refer to the following cases:

White-spaces for scribal layout

8 u₃ ša EDIN ḪA.LA-ia ma-la it-ti ŠEŠ.=MEŠ-ia
9 i-kaš-ša-da-an-ni ______________ lil-qe₃
$ruling
10 a-nu-ma a-šar KU₃.BABBAR.=MEŠ u₃ ŠE.=MEŠ ḫu-bul₂-‹‹la››-li-ia i-ru-ub
11 n_10 GIN₂ KU₃.BABBAR.=MEŠ a-na le-et p_DIŠ=-zu-ba-la DUMU p_a-ḫi-ma-lik
12 n_10 _ MIN ________________ a-na le-et p_DIŠ=-DINGIR=-KUR-ta-li-iḫ DUMU p_zi-ik-ri-DINGIR=-KUR
13 n_10 _ MIN ________________ a-na le-et p_DIŠ=-še-i-DINGIR=-KUR DUMU w_tar-ta-ni

White-spaces for tablet fractures

@obverse
1 [___________________________________]-⸢x⸣
2 [_________________________________ t]a-a-an-=ḪI.=A
3 [________________________________ ]x x IKU.=ḪI.=A
4 [_______________________________ ]-im-i
5 [_______________________________ ]x-ma p_eḫ-⸢li-DINGIR⸣
6 [_______________________________ i]l-la-ak

4.2. Modifiers

Character(s)	Function	Example
=-	Preposed determinatives are followed by = and the sign boundary -	LU2=-mu-ti-ia
-=	Postposed determinatives are preceded by = which in turn is preceded by the sign boundary designation -	ra-ab-ba-an-=KI
=.	more complex determinatives	LU2=.MEŠ=-ši-bu-ut
.=	more complex determinatives	A.ŠÀ.=HI.=A
^-	Preposed phonetic complements are followed by the symbols ^-	li^-lil-lik
-^	Postposed phonetic complements are preceded by the symbol -^	URU-^li3
*	In front to uninterpreted signs	BI-IQ-mi or bi-iq-mi

4.3. Condition of the text

Character(s)	Fuction	Example	Info
x	unreadable signs	x or x-x-x or x x x
X	a single unreadable number	[X] li-im KU3.BABBAR
[ ]	as usual
⸢ ⸣	as usual, but the half brackets must keep the entire sign	ma-⸢la ma⸣-ṣu₂-u₂ is clearer than ma-l⸢a m⸣a-ṣu₂-u₂
⸤ ⸥	as usual, but the half brackets must keep the entire sign		available but not in use in DAPCA
{}	for erasures	{DUMU.=MEŠ} da-gal-li
< >	added by a modern editor
<< >>	mistakenly written by the scribe
[()]	indicates that there may or may not be a sign present in a break	[x-(x)-x]	deprecated! →
()	Alternatives, actual signs, explanatory names, etc.	mu-sa!(u2)-ra
?	after the sign for uncertain reading
!	after the sign for abnormal graphic writing. When possible, the actual sign must be reported[^7].	mu-sa!-ra or mu-sa!(u2)-ra
*	before uninterpreted (uppercase) signs	i-na BI-IQ-mi
°	new readings: follow each sign	a-na pa°-ni°	deprecated! →

Notes

In broken contexts, it is preferable to indicate the actual, visible space with a series of underscores, for example [________], rather than attempting to predict the number of missing characters. Therefore, although the notation [x (x) x] or [x x x] is accepted by the system, it is arbitrary at best and less preferable than the first one.
Since the system is a multi-user platform, this type of marking (e.g., a-na pa°-ni°), which is perfectly acceptable in printed publications, raises doubts as to who actually entered an alternative reading. It can be used, but sparingly, and should be replaced by the annotation system [*ref].

4.5. Punctuation

Character	Function	Sign	Example
\	Glossenkeil[^9]	GAM
:	gloss marker[^10]		KUR.=MEŠ :nu-ku-ur-ti
/	“new line” marker to be used either alone or within a word		a-ḫi-ma-lik / ŠEŠ-šu or E₂ u₃ ḫa-ab-la i-ša-/am

5. Semantic classifiers

To indicate some domains, the program accepts the following classifiers to be placed in front of words:

Code	Alternative	Function	Example
p_	PN_	masculine personal name	p_za-bi-hi or p_ir-am-d_ᵈda-gan
f_	PNF_	female personal name	f_al-ḫa-ti
d_	DN_	divine name	DN_ᵈNIN.URTA
g_	GN_	geographical name	g_ra-ab-ba-=KI
t_	Top_N_	topographical feature	Top_N_sí-ip-hu
n_	NUM_	numerals	n_1+1/2
w_	WN_	"work" name	w_DUB.SAR or w_LU2=.DUB.SAR
m_	MN_	month name	ITI m_ᵈḫal-ma

IMPORTANT! - These semantic classifiers must always precede any other element of the word. Thus, for instance, in the case of [X+1], the classifiers must also precede the initial square bracket: n_[X+1].

6.Language

... forthcoming

7. Allowed Characters ³

The system automatically checks for valid characters and will return an error message if unknown glyphs are used. It also recognizes "shortcuts", sequences of glyphs automatically changed to expected glyphs. For example, the combination of [ and " (i.e. [" ) is replaced by ⸢, TOP LEFT HALF BRACKET.

For a complete list of these combining characters, see the following table, column "alternative".

In any case, please prepare the text files with the desired Unicode glyphs. A "Virtual keyboard" button can be used to insert unusual characters into the texts on the "Insert a new tablet" webpage.

char	alternative	U. cat	name	code
0-9		all numbers
a-z		all lowercase ascii characters
A-Z		all uppercase ascii characters
ḫ	h	Ll	LATIN SMALL LETTER H WITH BREVE BELOW	U+1E2B
Ḫ	H	Lu	LATIN CAPITAL LETTER H
š	sz or sh	Ll	LATIN SMALL LETTER S WITH CARON	U+0161
Š	SZ or SH	Lu	LATIN CAPITAL LETTER S WITH CARON
ṣ	s,	Ll	LATIN SMALL LETTER S WITH DOT BELOW	U+1E63
Ṣ	S,	Lu	LATIN CAPITAL LETTER S WITH DOT BELOW
ṭ	t,	Ll	LATIN SMALL LETTER T WITH DOT BELOW	U+1E6D
Ṭ	T,	Lu	LATIN CAPITAL LETTER T WITH DOT BELOW
_		Pc	LOW LINE / underscore	U+005F
-		Pd	HYPHEN-MINUS	U+002D
,		Po	COMMA	U+002C
:		Po	COLON	U+003A
!		Po	EXCLAMATION MARK	U+0021
?		Po	QUESTION MARK	U+003F
.		Po	FULL STOP	U+002E
'		Po	APOSTROPHE	U+0027
"		Po	QUOTATION MARK	U+0022
‹	\<	Pi	SINGLE LEFT-POINTING ANGLE QUOTATION MARK	U+2039
›	>	Pf	SINGLE RIGHT-POINTING ANGLE QUOTATION MARK	U+203A
(		Ps	LEFT PARENTHESIS	U+0028
)		Pe	RIGHT PARENTHESIS	U+0029
[		Ps	LEFT SQUARE BRACKET	U+005B
]		Pe	RIGHT SQUARE BRACKET	U+005D
{		Ps	LEFT CURLY BRACKET	U+007B
}		Pe	RIGHT CURLY BRACKET	U+007D
@		Po	COMMERCIAL AT	U+0040
/		Po	SOLIDUS	U+002F
\		Po	REVERSE SOLIDUS	U+005C
⸢	["	Ps	TOP LEFT HALF BRACKET	U+2E22
⸣	"]	Ps	TOP RIGHT HALF BRACKET	U+2E23
⸤	[,	Ps	BOTTOM LEFT HALF BRACKET	U+2E24
⸥	,]	Ps	BOTTOM RIGHT HALF BRACKET	U+2E25
+		Sm	PLUS SIGN	U+002B
×	x	Sm	MULTIPLICATION SIGN	U+00D7
\|		Sm	VERTICAL LINE	U+007C
=		Sm	EQUALS SIGN	U+003D
;		Po	SEMICOLON	U+003B
*		Po	ASTERISK	U+002A
^		Sk	CIRCUMFLEX ACCENT	U+005E
%		Po	PERCENT SIGN	U+0025
°		So	DEGREE SIGN	U+00B0
₀	0	No	SUBSCRIPT ZERO	U+2080
₁	1	No	SUBSCRIPT ONE	U+2081
₂	2	No	SUBSCRIPT TWO	U+2082
²		No	SUPERSCRIPT TWO	U+00B2
₃	3	No	SUBSCRIPT THREE	U+2083
₄	4	No	SUBSCRIPT FOUR	U+2084
₅	5	No	SUBSCRIPT FIVE	U+2085
₆	6	No	SUBSCRIPT SIX	U+2086
₇	7	No	SUBSCRIPT SEVEN	U+2087
₈	8	No	SUBSCRIPT EIGHT	U+2088
₉	9	No	SUBSCRIPT NINE	U+2089
ᵈ	DINGIR=-	Lm	MODIFIER LETTER SMALL D	U+1D48
ᶠ	MUNUS=-	Lm	MODIFIER LETTER SMALL F	U+1DA0
ᵐ	DIŠ=-	Lm	MODIFIER LETTER SMALL M	U+1D50
ʾ	'	Lm	MODIFIER LETTER REVERSED GLOTTAL STOP	U+02BE
ᵪ		Lm	MODIFIER LETTER SMALL CHI	U+1D6A
ᵧ		Lm	MODIFIER LETTER SMALL GREEK GAMMA	U+1D67

u2-ú ↩
Spiegazione della nota 2 ↩
Accented vowels are not present in the list, even though they are generally aceepted by the parser. In any case, on the background, they are substituted by combinations of letters and lower script digits (e.g., É > E₂, ì > i₃, DU10 > DU₁₀) ↩