The Document
Sinleqiunnini does not store transliterated documents as text files, but splits the various components of a cuneiform document into different database tables for more efficient process of information retrieval [link].
Nevertheless, to simplify the task of compiling and entering data, the system is equipped with a robust "parser" that accepts as input a simple text file formatted according to a relatively easy-to-remember "diplomatic notation". It is a shallow markup system employed to help the program understand how to differentiate the different words and signs values, and how to manage them into a database-driven architecture [refer to sample1 e sample 2]
From text to DB: "tokenizazion"
From the user's perspective, a key concept is the use of white-spaces to separate semantic units, that is tokens, that the parser will interpret as individual elements (i.e., words). Any word, as well as any epigraphic notation1, will be distinguished by a unique identifier (ID).
| id | area_id | notation |
|---|---|---|
| 14353 | 4201 | DIŠ=-bu-ul-la |
| 14354 | 4201 | DUMU |
| 14355 | 4201 | ib-ri-a-da-li |
| 14356 | 4202 | u₃ |
| 14357 | 4202 | MUNUS=-aš-tar-um-mi |
| 14358 | 4202 | DAM-šu₂ |
| 14359 | 4203 | MUNUS=-ba-aʾ-la-ki-mi |
| 14360 | 4203 | DUMU.=MI₂-šu₂-nu |
| ... | ... | ... |
This segmentation process is of primary importance for the DAPCA's search engine, as it allows dealing with any token in the system as an isolated unit. It permits the application of a wide set of search strategies to individual elements of a document (e.g., regex, fuzzy search, similarity measures, etc.). At the same time, it is always possible to keep these elements in their context and perform complex searches, for instance, for syntagmata or "chains of words."
Likewise, the system keeps track of the "coordinates" of each of these tokens, thus allowing for the reorganization of the document in its integrity for printing on screen.
The parser
The following examples show how a "plain text" format (i.e., .txt) should be structured to be accepted by the parser.
Sample text 1
Diplomatic notation: sample 1
Sample text 2
Diplomatic notation: sample 2
1. Physical surfaces of tablet
Those are self-explanatory:
@obverse@bottom@reverse@top@left
Accordingly, every text file must therefore begin with an @. Otherwise, the parser raises an exception and explicitly prompts the user.
In all cases where a line begins with one of these tags, there is no need to add anything else. For example, anything after the @obverse tag will be ignored or, in the worst case, will produce an error.
This marking does not require a line number.
2. Free-text markers
Freely text-based markers can be inserted to indicate various aspects of both the document and metatextual elements. This feature is enabled by placing the $ character before the footnote.
This marking does not require a line number.
This markup includes some helpers for better document layout:
-
$ruling is replaced by a horizontal line. Multiple horizontal lines can be represented by placing multiple $rulings. For example:
-
If parentheses follow the markup, the number included in the parentheses tells the system the number of lines the "note" should occupy. For example:
$break(3)tells the system that we have a break that corresponds roughly to three lines of text.$blank(2)tells the system that we have a portion of the tablet left blank that occupies the equivalent of two lines of text.$seal(4)indicates that the space occupied by the seal corresponds to four lines.
Additionally, one can add any information with the $ prefix. For instance, $the beginning of the column is broken or $an unknown number of columns destroyed, eventually in combination with the (n) alike. This information will be searchable, but please note that it affects the text layout.
It is recommended to use this tool sparingly.
Annotations
For a proper annotation system, see the discussion in: ...
3. Line numbers
Each transliterated line must begin with a line number, as is customary in Assyriological tradition. Currently, there are two possibilities:
- a simple numeral: 1
- a numeral followed by a single quote character after breaks: 1'
Customization
A set of parser-specific rules prevents the system from accepting anything other than numbers and/or numbers + single-quote as a line label. Additional rules can, however, be added to allow it to accept different line formats.
The sequence of numbers is virtually free, that is, one can decide to start with 1' after breaks or to continue with the previous number sequence (e.g., ... 10 / $break / 11'...)
Note
Regardless of how one chooses to name the line numbers, the system internally stores their order, which is determined by the order of the lines in the text file. The line numbers should be considered simple labels.
3.1. Line number separators
After every line number, a white-space (i.e. \s) -- or eventually a tab separator (i.e. \t) -- must follow. This allows the parser to understand where the line number section ends and the transliteration begins.
4. Transliteration
4.1. Graphic relationships
| Character | Function | Example |
|---|---|---|
| [carriage return] | line boundary | |
| [space] | word boundary | ša ur-ra-am še-ra-am |
| - | sign boundary | i-ša-am |
| . | intra-logographic boundary | ÚS.SA.DU AN.TA |
| + | used for ligatures | i+na |
| x or × | for inclusions | AB×ḪA₂ |
| _ | for blank-spaces | [____ i-]na |
4.1.1. Breaks and lacunae
Tip
White-spaces in digital transliterations are often neglected, whereas they are of primary importance, for instance, for material philology.
Please refer to the following cases:
- White-spaces for scribal layout
8 u₃ ša EDIN ḪA.LA-ia ma-la it-ti ŠEŠ.=MEŠ-ia
9 i-kaš-ša-da-an-ni ______________ lil-qe₃
$ruling
10 a-nu-ma a-šar KU₃.BABBAR.=MEŠ u₃ ŠE.=MEŠ ḫu-bul₂-‹‹la››-li-ia i-ru-ub
11 n_10 GIN₂ KU₃.BABBAR.=MEŠ a-na le-et p_DIŠ=-zu-ba-la DUMU p_a-ḫi-ma-lik
12 n_10 _ MIN ________________ a-na le-et p_DIŠ=-DINGIR=-KUR-ta-li-iḫ DUMU p_zi-ik-ri-DINGIR=-KUR
13 n_10 _ MIN ________________ a-na le-et p_DIŠ=-še-i-DINGIR=-KUR DUMU w_tar-ta-ni
- White-spaces for tablet fractures
@obverse
1 [___________________________________]-⸢x⸣
2 [_________________________________ t]a-a-an-=ḪI.=A
3 [________________________________ ]x x IKU.=ḪI.=A
4 [_______________________________ ]-im-i
5 [_______________________________ ]x-ma p_eḫ-⸢li-DINGIR⸣
6 [_______________________________ i]l-la-ak
4.2. Modifiers
| Character(s) | Function | Example |
|---|---|---|
| =- | Preposed determinatives are followed by = and the sign boundary - | LU2=-mu-ti-ia |
| -= | Postposed determinatives are preceded by = which in turn is preceded by the sign boundary designation - | ra-ab-ba-an-=KI |
| =. | more complex determinatives | LU2=.MEŠ=-ši-bu-ut |
| .= | more complex determinatives | A.ŠÀ.=HI.=A |
| ^- | Preposed phonetic complements are followed by the symbols ^- | li^-lil-lik |
| -^ | Postposed phonetic complements are preceded by the symbol -^ | URU-^li3 |
| * | In front to uninterpreted signs | *BI-*IQ-mi or *bi-*iq-mi |
4.3. Condition of the text
| Character(s) | Fuction | Example | Info |
|---|---|---|---|
| x | unreadable signs | x or x-x-x or x x x | |
| X | a single unreadable number | [X] li-im KU3.BABBAR | |
| [ ] | as usual | ||
| ⸢ ⸣ | as usual, but the half brackets must keep the entire sign | ma-⸢la ma⸣-ṣu₂-u₂ is clearer than ma-l⸢a m⸣a-ṣu₂-u₂ |
|
| ⸤ ⸥ | as usual, but the half brackets must keep the entire sign | available but not in use in DAPCA | |
| {} | for erasures | {DUMU.=MEŠ} da-gal-li | |
| < > | added by a modern editor | ||
| << >> | mistakenly written by the scribe | ||
| [()] | indicates that there may or may not be a sign present in a break | [x-(x)-x] | deprecated! → |
| () | Alternatives, actual signs, explanatory names, etc. | mu-sa!(u2)-ra | |
| ? | after the sign for uncertain reading | ||
| ! | after the sign for abnormal graphic writing. When possible, the actual sign must be reported[^7]. | mu-sa!-ra or mu-sa!(u2)-ra | |
| * | before uninterpreted (uppercase) signs | i-na *BI-*IQ-mi | |
| ° | new readings: follow each sign | a-na pa°-ni° | deprecated! → |
Notes
- In broken contexts, it is preferable to indicate the actual, visible space with a series of underscores, for example
[________], rather than attempting to predict the number of missing characters. Therefore, although the notation[x (x) x]or[x x x]is accepted by the system, it is arbitrary at best and less preferable than the first one. - Since the system is a multi-user platform, this type of marking (e.g.,
a-na pa°-ni°), which is perfectly acceptable in printed publications, raises doubts as to who actually entered an alternative reading. It can be used, but sparingly, and should be replaced by the annotation system [*ref].
4.5. Punctuation
| Character | Function | Sign | Example |
|---|---|---|---|
| \ | Glossenkeil[^9] | GAM | |
| : | gloss marker[^10] | KUR.=MEŠ :nu-ku-ur-ti | |
| / | “new line” marker to be used either alone or within a word |
a-ḫi-ma-lik / ŠEŠ-šu or E₂ u₃ ḫa-ab-la i-ša-/am |
5. Semantic classifiers
To indicate some domains, the program accepts the following classifiers to be placed in front of words:
| Code | Alternative | Function | Example |
|---|---|---|---|
| p_ | PN_ | masculine personal name | p_za-bi-hi or p_ir-am-d_ᵈda-gan |
| f_ | PNF_ | female personal name | f_al-ḫa-ti |
| d_ | DN_ | divine name | DN_ᵈNIN.URTA |
| g_ | GN_ | geographical name | g_ra-ab-ba-=KI |
| t_ | Top_N_ | topographical feature | Top_N_sí-ip-hu |
| n_ | NUM_ | numerals | n_1+1/2 |
| w_ | WN_ | "work" name | w_DUB.SAR or w_LU2=.DUB.SAR |
| m_ | MN_ | month name | ITI m_ᵈḫal-ma |
IMPORTANT! - These semantic classifiers must always precede any other element of the word. Thus, for instance, in the case of [X+1], the classifiers must also precede the initial square bracket: n_[X+1].
6.Language
... forthcoming
7. Allowed Characters 3
The system automatically checks for valid characters and will return an error message if unknown glyphs are used. It also recognizes "shortcuts", sequences of glyphs automatically changed to expected glyphs. For example, the combination of [ and " (i.e. [" ) is replaced by ⸢, TOP LEFT HALF BRACKET.
For a complete list of these combining characters, see the following table, column "alternative".
In any case, please prepare the text files with the desired Unicode glyphs. A "Virtual keyboard" button can be used to insert unusual characters into the texts on the "Insert a new tablet" webpage.
| char | alternative | U. cat | name | code |
|---|---|---|---|---|
| 0-9 | all numbers | |||
| a-z | all lowercase ascii characters | |||
| A-Z | all uppercase ascii characters | |||
| ḫ | h | Ll | LATIN SMALL LETTER H WITH BREVE BELOW | U+1E2B |
| Ḫ | H | Lu | LATIN CAPITAL LETTER H | |
| š | sz or sh | Ll | LATIN SMALL LETTER S WITH CARON | U+0161 |
| Š | SZ or SH | Lu | LATIN CAPITAL LETTER S WITH CARON | |
| ṣ | s, | Ll | LATIN SMALL LETTER S WITH DOT BELOW | U+1E63 |
| Ṣ | S, | Lu | LATIN CAPITAL LETTER S WITH DOT BELOW | |
| ṭ | t, | Ll | LATIN SMALL LETTER T WITH DOT BELOW | U+1E6D |
| Ṭ | T, | Lu | LATIN CAPITAL LETTER T WITH DOT BELOW | |
| _ | Pc | LOW LINE / underscore | U+005F | |
| - | Pd | HYPHEN-MINUS | U+002D | |
| , | Po | COMMA | U+002C | |
| : | Po | COLON | U+003A | |
| ! | Po | EXCLAMATION MARK | U+0021 | |
| ? | Po | QUESTION MARK | U+003F | |
| . | Po | FULL STOP | U+002E | |
| ' | Po | APOSTROPHE | U+0027 | |
| " | Po | QUOTATION MARK | U+0022 | |
| ‹ | \< | Pi | SINGLE LEFT-POINTING ANGLE QUOTATION MARK | U+2039 |
| › | > | Pf | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK | U+203A |
| ( | Ps | LEFT PARENTHESIS | U+0028 | |
| ) | Pe | RIGHT PARENTHESIS | U+0029 | |
| [ | Ps | LEFT SQUARE BRACKET | U+005B | |
| ] | Pe | RIGHT SQUARE BRACKET | U+005D | |
| { | Ps | LEFT CURLY BRACKET | U+007B | |
| } | Pe | RIGHT CURLY BRACKET | U+007D | |
| @ | Po | COMMERCIAL AT | U+0040 | |
| / | Po | SOLIDUS | U+002F | |
| \ | Po | REVERSE SOLIDUS | U+005C | |
| ⸢ | [" | Ps | TOP LEFT HALF BRACKET | U+2E22 |
| ⸣ | "] | Ps | TOP RIGHT HALF BRACKET | U+2E23 |
| ⸤ | [, | Ps | BOTTOM LEFT HALF BRACKET | U+2E24 |
| ⸥ | ,] | Ps | BOTTOM RIGHT HALF BRACKET | U+2E25 |
| + | Sm | PLUS SIGN | U+002B | |
| × | x | Sm | MULTIPLICATION SIGN | U+00D7 |
| | | Sm | VERTICAL LINE | U+007C | |
| = | Sm | EQUALS SIGN | U+003D | |
| ; | Po | SEMICOLON | U+003B | |
| * | Po | ASTERISK | U+002A | |
| ^ | Sk | CIRCUMFLEX ACCENT | U+005E | |
| % | Po | PERCENT SIGN | U+0025 | |
| ° | So | DEGREE SIGN | U+00B0 | |
| ₀ | 0 | No | SUBSCRIPT ZERO | U+2080 |
| ₁ | 1 | No | SUBSCRIPT ONE | U+2081 |
| ₂ | 2 | No | SUBSCRIPT TWO | U+2082 |
| ² | No | SUPERSCRIPT TWO | U+00B2 | |
| ₃ | 3 | No | SUBSCRIPT THREE | U+2083 |
| ₄ | 4 | No | SUBSCRIPT FOUR | U+2084 |
| ₅ | 5 | No | SUBSCRIPT FIVE | U+2085 |
| ₆ | 6 | No | SUBSCRIPT SIX | U+2086 |
| ₇ | 7 | No | SUBSCRIPT SEVEN | U+2087 |
| ₈ | 8 | No | SUBSCRIPT EIGHT | U+2088 |
| ₉ | 9 | No | SUBSCRIPT NINE | U+2089 |
| ᵈ | DINGIR=- | Lm | MODIFIER LETTER SMALL D | U+1D48 |
| ᶠ | MUNUS=- | Lm | MODIFIER LETTER SMALL F | U+1DA0 |
| ᵐ | DIŠ=- | Lm | MODIFIER LETTER SMALL M | U+1D50 |
| ʾ | ' | Lm | MODIFIER LETTER REVERSED GLOTTAL STOP | U+02BE |
| ᵪ | Lm | MODIFIER LETTER SMALL CHI | U+1D6A | |
| ᵧ | Lm | MODIFIER LETTER SMALL GREEK GAMMA | U+1D67 |



