transliterate

Transliteration engine
git clone git://lumidify.org/git/transliterate.git
Log | Files | Refs | README

commit 4d2bba71bfcc746fc1adabcf7d0cc08d54652817
parent 410d20484d0ceda89cfcad455b727bac02d071d9
Author: lumidify <nobody@lumidify.org>
Date:   Thu, 26 Mar 2020 10:38:15 +0100

Add more explanation to documentation

Diffstat:
Mtransliterate.pl | 104++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 98 insertions(+), 6 deletions(-)

diff --git a/transliterate.pl b/transliterate.pl @@ -1392,7 +1392,7 @@ currently being processed. The unknown word window is opened any time a word could not be replaced. -Both the context from the original language and the context from the +Both the context from the original script and the context from the transliterated version (so far) is shown. If a part of the text is selected in one of the text boxes and "Use selection as word" is pressed for the appropriate box, the selected text is used for the @@ -1413,6 +1413,18 @@ configuration. Adds the word typed in the text box beside "Add replacement" to the selected table file and re-runs the replacement on the current line. +Note that this could be made to be faster by simply replacing the +word directly in the text instead of running the entire replacement +again. The problem is that adding new words can, on occasion, have +undesired side effects, and it is better to see those immediately, +rather than waiting until a later date. + +One problem is that the word is just written directly to the file +and there is no undo. This is the way it currently is and will +probably not change very soon. If a mistake is made, the word can +always be removed again manually from the list and "Reload config" +pressed. + The filtering for which table files are actually shown here is currently a bit rudimentary. First, all paths that are used in the C<table> statements in the config are put into a list. Then, the @@ -1438,8 +1450,76 @@ while since the entire word database has to be reloaded. Prints the current line number to the terminal and exits the program. +The program can always be started again at this line number using +the C<--start> option if needed. + =back +=head1 INTERNALS/EXAMPLES + +This section was added to explain to the user how the transliteration +process works internally since that may be necessary to understand +why certain words are replaced the way they are. + +First off, the process works line-by-line, i.e. no B<match> statement +will ever match anything that crosses the end of a line. + +Each line is initially stored as one chunk which is marked as +untransliterated. Then, all B<match>, B<matchignore>, and B<replace> +(or, rather, B<group>) statements are executed in the order they +appear in the config file. Whenever a word/match is replaced, it +is split off into a separate chunk which is marked as transliterated. +A chunk marked as transliterated I<is entirely ignored by any +replacement statements that come afterwards>. Note that C<beginword> +and C<endword> can always match at the boundary between an +untransliterated and transliterated chunk. This is to facilitate +automated replacement of certain grammatical constructions. For instance: + +If the string C<a-> could be attached as a prefix to any word and needed +to be replaced as C<b-> everywhere, it would be quite trivial to add +a match statement 'match "a-" "b-" beginword'. If run on the text +C<a-word>, where C<word> is some word that should be transliterated +as C<word_replaced>, and the group replace statement for the word comes +after the match statement given above, the following would happen: +First, the match statement would replace C<a-> and split the text into +the two chunks C<b-> and C<word>, where C<b-> is already marked as +transliterated. Since C<word> is now separate, it will be matched +by the group replace statement later, even if it has C<beginword> set +and would normally not match if C<a-> came before it. Thus, the final +output will be C<b-word_replaced>, allowing for the uniform replacement +of the prefix instead of having to add each word twice, once with and +once without the prefix. + +In certain cases, this behavior may not be desired. Consider, for +instance, a prefix C<c-> which cannot be replaced uniformly as in the +example above due to differences in the source and destination script. +Since it cannot be replaced uniformly, two words C<word1> and C<word2> +would both need to be specified separately with replacements for +C<c-word1> and C<c-word2>. If, however, the prefix C<c-> has an +alternate spelling C<c > (without the hyphen), it would be very useful +to be able to automatically recognize that as well. This is where the +C<nofinal> attribute for the B<match> statements comes in. If there is +a match statement 'match "c " "c-" beginword nofinal', the replaced +chunk is B<not> marked as transliterated, so after executing this +statement on the text C<c word1>, there will still only be one chunk, +C<c-word1>, allowing for the regular word replacements to function +properly. + +Once all the replacement statements have been processed, each chunk +of text that is not marked as transliterated yet is split based on +the B<split> pattern specified in the config and all actual characters +matched by the B<split> pattern are marked as transliterated (this +usually means all the spaces, newlines, quotation marks, etc.). Any +remaining words/text chunks that are still marked as untransliterated are +now processed by the unknown word window. If one of these remaining +unknown chunks is present in the file specified by the B<ignore> +statement in the config, it is simply ignored and later printed out +as is. After all untransliterated words have either had a replacement +added or been ignored, any words with multiple replacement choices are +processed by the word choice window. Once this is all done, the final +output is written to the output file and the process is repeated with +the next line. + =head1 CONFIGURATION These are the commands accepted in the configuration file. @@ -1449,8 +1529,19 @@ The C<match>, C<matchignore>, and C<replace> commands are executed in the order they are specified, except that all C<replace> commands within the same group are replaced together. -Note that any duplicate words found will cause the user to be prompted -to choose one option every time the word is replaced in the input text. +The B<match> and B<matchignore> statements accept any RegEx strings and +are thus very powerful. The B<group> statements only work with the +non-RegEx words from the tables, but are very efficient for large numbers +of words and should thus be used for the main bulk of the words. + +Any duplicate words found will cause the user to be prompted to choose +one option every time the word is replaced in the input text. + +Note that any regex strings specified in the config should B<not> +contain capture groups, as that would break the C<endword> functionality +since this is also implemented internally using capture groups. Capture +groups are also entirely pointless in the config since they currently +cannot be used as part of the replacement string in B<match> statements. =over 8 @@ -1521,7 +1612,7 @@ Note that if C<< <filename> >> is not an absolute path, it is taken to be relati to the location of the configuration file. The table files simply consist of C<tablesep>-separated values, with the word in the -original language first and the replacement word second. The replacement word +original script first and the replacement word second. The replacement word can optionally have several parts separated by C<choicesep>, which will cause the user to be prompted to choose one of the options. @@ -1537,7 +1628,7 @@ If C<noroot> is set, the root forms of the words are not kept. If the replacement for a word ending contains C<choicesep>, it is split and each part is combined with the root form separately and the user is prompted to choose one of the options later. it is thus possible to allow multiple choices for the ending if -there is a distinction in the replacement language but not in the source language. +there is a distinction in the replacement script but not in the source script. Note that each of the root words is also split into its choices (if necessary) during the expanding, so it is possible to use C<choicesep> in both the endings and root words. @@ -1559,7 +1650,8 @@ C<match> or C<replace> commands. =item B<matchignore> <regex string> [beginword] [endword] Performs a RegEx match in the same manner as C<match>, except that the original -match is used as the replacement instead of specifying a replacement string. +match is used as the replacement instead of specifying a replacement string, i.e. +whatever is matched is just marked as transliterated without changing it. =item B<group> [beginword] [endword]