transliterate

Transliteration engine
git clone git://lumidify.org/git/transliterate.git
Log | Files | Refs | README

commit 938f18f29430df37213508816b5d6a7efde23ed5
parent df15f30a4f0f70756b5fb4dde9b188087105e973
Author: lumidify <nobody@lumidify.org>
Date:   Thu, 26 Mar 2020 14:49:29 +0100

Fix style of documentation

Diffstat:
D.gitignore | 1-
Atests/data/endings.txt | 3+++
Atests/data/endings_choices.txt | 2++
Atests/data/ignore.txt | 1+
Atests/data/words.txt | 10++++++++++
Atests/data/words1.txt | 10++++++++++
Mtransliterate.pl | 144++++++++++++++++++++++++++++++++++++++++---------------------------------------
7 files changed, 99 insertions(+), 72 deletions(-)

diff --git a/.gitignore b/.gitignore @@ -1 +0,0 @@ -data diff --git a/tests/data/endings.txt b/tests/data/endings.txt @@ -0,0 +1,3 @@ +end1 end1_replaced +end2 end2_replaced +end3 end3_replaced diff --git a/tests/data/endings_choices.txt b/tests/data/endings_choices.txt @@ -0,0 +1,2 @@ +end1,end1r1|end1r2 +end2,end2r diff --git a/tests/data/ignore.txt b/tests/data/ignore.txt @@ -0,0 +1 @@ +ignore diff --git a/tests/data/words.txt b/tests/data/words.txt @@ -0,0 +1,10 @@ +word0 word0_replaced$word0_replaced2 +word1 word1_replaced +word2 word2_replaced +word3 word3_replaced +word4 word4_replaced +word5 word5_replaced +word6 word6_replaced +word7 word7_replaced +word8 word8_replaced +word9 word9_replaced diff --git a/tests/data/words1.txt b/tests/data/words1.txt @@ -0,0 +1,10 @@ +word0,word0_replaced|word0_replaced2 +word1,word1_replaced +word2,word2_replaced +word3,word3_replaced +word4,word4_replaced +word5,word5_replaced +word6,word6_replaced +word7,word7_replaced +word8,word8_replaced +word9,word9_replaced diff --git a/transliterate.pl b/transliterate.pl @@ -1286,20 +1286,20 @@ Start the transliteration engine with the given file as input. =over 8 -=item B<< --output <filename> >> +=item B<--output> <filename> Sets the output file to print to. -If the file exists already and C<--force> is not set, the user is asked +If the file exists already and B<--force> is not set, the user is asked if the file should be overwritten or appended to. -B<Default:> STDOUT (print to terminal) +B<Default:> C<STDOUT> (print to terminal) -=item B<< --config <filename> >> +=item B<--config> <filename> Sets the configuration file to use. -B<Default:> "config" +B<Default:> C<config> =item B<--checkduplicates> @@ -1331,18 +1331,18 @@ prompts. This option is only useful for automatic testing of the transliteration engine. -If C<--nochoices> is enabled, each word in the input with multiple choices will +If B<--nochoices> is enabled, each word in the input with multiple choices will be output, along with the number of choices (can be used to test the proper -functioning of C<choicesep> in the config file). +functioning of B<choicesep> in the config file). -If C<--nounknowns> is enabled, each unknown word in the input is printed -(can be used to test that the C<ignore> options are working correctly). +If B<--nounknowns> is enabled, each unknown word in the input is printed +(can be used to test that the B<ignore> options are working correctly). =item B<--force> Always overwrites the output and error file without asking. -=item B<< --start <line number> >> +=item B<--start> <line number> Starts at the given line number instead of the beginning of the file. @@ -1351,19 +1351,19 @@ printed out. This is the current line that was being processed, so it has not been printed to the output file yet and thus the program must be resumed at that line, not the one afterwards. -=item B<< --errors <filename> >> +=item B<--errors> <filename> Specifies a file to write errors in. Note that this does not refer to actual errors, but to any words that were temporarily ignored (i.e. words for which "Ignore: This run" was clicked). If no file is specified, nothing is written. If a file is specified -that already exists and C<--force> is not set, the user is prompted +that already exists and B<--force> is not set, the user is prompted for action. =item B<--help> -Display the full documentation. +Displays the full documentation. =back @@ -1427,9 +1427,9 @@ pressed. The filtering for which table files are actually shown here is currently a bit rudimentary. First, all paths that are used in the -C<table> statements in the config are put into a list. Then, the +B<table> statements in the config are put into a list. Then, the paths corresponding to any tables used for word endings in the -C<expand> statements are removed from the list, and that is what +B<expand> statements are removed from the list, and that is what is shown in this window. The reason for removing those paths from the list is that it gets somewhat confusing when all the tables that are only used for word endings are also in the list, and it @@ -1438,7 +1438,7 @@ one of those files. If necessary, the word can always be added manually and the config reloaded. Note also that only actual table paths are shown here, not the tables themselves - this is not necessarily a one-to-one mapping since new tables can be -generated with the C<expand> statements. +generated with the B<expand> statements. =item Reload config @@ -1451,7 +1451,7 @@ while since the entire word database has to be reloaded. Prints the current line number to the terminal and exits the program. The program can always be started again at this line number using -the C<--start> option if needed. +the B<--start> option if needed. =back @@ -1470,39 +1470,39 @@ untransliterated. Then, all B<match>, B<matchignore>, and B<replace> appear in the config file. Whenever a word/match is replaced, it is split off into a separate chunk which is marked as transliterated. A chunk marked as transliterated I<is entirely ignored by any -replacement statements that come afterwards>. Note that C<beginword> -and C<endword> can always match at the boundary between an +replacement statements that come afterwards>. Note that B<beginword> +and B<endword> can always match at the boundary between an untransliterated and transliterated chunk. This is to facilitate automated replacement of certain grammatical constructions. For instance: -If the string C<a-> could be attached as a prefix to any word and needed -to be replaced as C<b-> everywhere, it would be quite trivial to add -a match statement 'match "a-" "b-" beginword'. If run on the text -C<a-word>, where C<word> is some word that should be transliterated -as C<word_replaced>, and the group replace statement for the word comes +If the string "a-" could be attached as a prefix to any word and needed +to be replaced as "b-" everywhere, it would be quite trivial to add +a match statement C<'match "a-" "b-" beginword'>. If run on the text +"a-word", where "word" is some word that should be transliterated +as "word_replaced", and the group replace statement for the word comes after the match statement given above, the following would happen: -First, the match statement would replace C<a-> and split the text into -the two chunks C<b-> and C<word>, where C<b-> is already marked as -transliterated. Since C<word> is now separate, it will be matched -by the group replace statement later, even if it has C<beginword> set -and would normally not match if C<a-> came before it. Thus, the final -output will be C<b-word_replaced>, allowing for the uniform replacement +First, the match statement would replace "a-" and split the text into +the two chunks "b-" and "word", where "b-" is already marked as +transliterated. Since "word" is now separate, it will be matched +by the group replace statement later, even if it has B<beginword> set +and would normally not match if "a-" came before it. Thus, the final +output will be "b-word_replaced", allowing for the uniform replacement of the prefix instead of having to add each word twice, once with and once without the prefix. In certain cases, this behavior may not be desired. Consider, for -instance, a prefix C<c-> which cannot be replaced uniformly as in the +instance, a prefix "c-" which cannot be replaced uniformly as in the example above due to differences in the source and destination script. -Since it cannot be replaced uniformly, two words C<word1> and C<word2> +Since it cannot be replaced uniformly, two words "word1" and "word2" would both need to be specified separately with replacements for -C<c-word1> and C<c-word2>. If, however, the prefix C<c-> has an -alternate spelling C<c > (without the hyphen), it would be very useful +"c-word1" and "c-word2". If, however, the prefix "c-" has an +alternate spelling "c " (without the hyphen), it would be very useful to be able to automatically recognize that as well. This is where the -C<nofinal> attribute for the B<match> statements comes in. If there is -a match statement 'match "c " "c-" beginword nofinal', the replaced +B<nofinal> attribute for the B<match> statements comes in. If there is +a match statement C<'match "c " "c-" beginword nofinal'>, the replaced chunk is B<not> marked as transliterated, so after executing this -statement on the text C<c word1>, there will still only be one chunk, -C<c-word1>, allowing for the regular word replacements to function +statement on the text "c word1", there will still only be one chunk, +"c-word1", allowing for the regular word replacements to function properly. Once all the replacement statements have been processed, each chunk @@ -1525,8 +1525,8 @@ the next line. These are the commands accepted in the configuration file. Any parameters in square brackets are optional. -The C<match>, C<matchignore>, and C<replace> commands are executed in -the order they are specified, except that all C<replace> commands within +The B<match>, B<matchignore>, and B<replace> commands are executed in +the order they are specified, except that all B<replace> commands within the same group are replaced together. The B<match> and B<matchignore> statements accept any RegEx strings and @@ -1538,10 +1538,12 @@ Any duplicate words found will cause the user to be prompted to choose one option every time the word is replaced in the input text. Note that any regex strings specified in the config should B<not> -contain capture groups, as that would break the C<endword> functionality +contain capture groups, as that would break the B<endword> functionality since this is also implemented internally using capture groups. Capture groups are also entirely pointless in the config since they currently cannot be used as part of the replacement string in B<match> statements. +Lookaheads and lookbehinds are fine, though, and could be useful in +certain cases. =over 8 @@ -1551,50 +1553,50 @@ Sets the RegEx string to be used for splitting words. This is only used for splitting the words which couldn't be replaced after all replacement has been done, before prompting the user for unknown words. -Note that C<split> should probably always contain at least C<\n>, since +Note that B<split> should probably always contain at least C<\n>, since otherwise all of the newlines will be marked as unknown words. Usually, this will be included anyways through C<\s>. -Note also that C<split> should probably include the C<+> RegEx-quantifier +Note also that B<split> should probably include the C<+> RegEx-quantifier since that allows the splitting function in the end to ignore several splitting characters right after each other (e.g. several spaces) in one go instead of splitting the string again for every single one of them. This shouldn't actually make any difference functionality-wise, though. -B<Default:> \s (all whitespace) +B<Default:> C<\s> (all whitespace) =item B<beforeword> <regex string> -Sets the RegEx string to be matched before a word if C<beginword> is set. +Sets the RegEx string to be matched before a word if B<beginword> is set. -B<Default:> \s +B<Default:> C<\s> =item B<afterword> <regex string> -Sets the RegEx string to be matched after a word if C<endword> is set. +Sets the RegEx string to be matched after a word if B<endword> is set. -Note that C<afterword> should probably always contain at least C<\n>, -since otherwise words with C<endword> set will not be matched at the +Note that B<afterword> should probably always contain at least C<\n>, +since otherwise words with B<endword> set will not be matched at the end of a line. -C<beforeword> and C<afterword> will often be exactly the same, but +B<beforeword> and B<afterword> will often be exactly the same, but they are left as separate options in case more fine-tuning is needed. -B<Default:> \s +B<Default:> C<\s> =item B<tablesep> <string> Sets the separator used to split the lines in the table files into the original and replacement word. -B<Default:> Tab +B<Default:> C<Tab> =item B<choicesep> <string> Sets the separator used to split replacement words into multiple choices for prompting the user. -B<Default:> $ +B<Default:> C<$> =item B<ignore> <filename> @@ -1606,14 +1608,14 @@ add words to it from the unknown word window. =item B<table> <table identifier> <filename> Load the table from C<< <filename> >>, making it available for later use in the -C<expand> and C<replace> commands using the identifier C<< <table identifier> >>. +B<expand> and B<replace> commands using the identifier C<< <table identifier> >>. Note that if C<< <filename> >> is not an absolute path, it is taken to be relative to the location of the configuration file. -The table files simply consist of C<tablesep>-separated values, with the word in the +The table files simply consist of B<tablesep>-separated values, with the word in the original script first and the replacement word second. The replacement word -can optionally have several parts separated by C<choicesep>, which will cause the +can optionally have several parts separated by B<choicesep>, which will cause the user to be prompted to choose one of the options. =item B<expand> <table identifier> <new table identifier> <word ending table> [noroot] @@ -1623,17 +1625,17 @@ the word endings in C<< <word ending table> >>, saving the result as a table wit identifier C<< <new table identifier> >>. If C<same> is specified as C<< <new table identifier> >>, the original C<< <table identifier> >> is used instead. -If C<noroot> is set, the root forms of the words are not kept. +If B<noroot> is set, the root forms of the words are not kept. -If the replacement for a word ending contains C<choicesep>, it is split and each part +If the replacement for a word ending contains B<choicesep>, it is split and each part is combined with the root form separately and the user is prompted to choose one of the options later. it is thus possible to allow multiple choices for the ending if there is a distinction in the replacement script but not in the source script. Note that each of the root words is also split into its choices (if necessary) -during the expanding, so it is possible to use C<choicesep> in both the endings +during the expanding, so it is possible to use B<choicesep> in both the endings and root words. -Note that the paths of all tables used for <word ending table> are removed from the +Note that the paths of all tables used for C<< <word ending table> >> are removed from the list of paths that is later shown in the unknown word window. See L</"UNKNOWN WORD WINDOW"> for details. @@ -1641,32 +1643,32 @@ WINDOW"> for details. Perform a RegEx match using the given C<< <regex string> >>, replacing it with C<< <replacement string> >>. Note that the replacement cannot contain any RegEx -(e.g. groups) in it. C<beginword> and C<endword> specify whether the match must +(e.g. groups) in it. B<beginword> and B<endword> specify whether the match must be at the beginning or ending of a word, respectively, using the RegEx specified -in C<beforeword> and C<afterword>. If C<nofinal> is set, the string is not marked +in B<beforeword> and B<afterword>. If B<nofinal> is set, the string is not marked as transliterated after the replacement, allowing it to be modified by subsequent -C<match> or C<replace> commands. +B<match> or B<replace> commands. =item B<matchignore> <regex string> [beginword] [endword] -Performs a RegEx match in the same manner as C<match>, except that the original +Performs a RegEx match in the same manner as B<match>, except that the original match is used as the replacement instead of specifying a replacement string, i.e. whatever is matched is just marked as transliterated without changing it. =item B<group> [beginword] [endword] -Begins a replacement group. All C<replace> commands must occur between C<group> -and C<endgroup>, since they are then grouped together and replaced in one go. -C<beginword> and C<endword> act in the same way as specified for C<match> and -apply to all C<replace> statements in this group. +Begins a replacement group. All B<replace> commands must occur between B<group> +and B<endgroup>, since they are then grouped together and replaced in one go. +B<beginword> and B<endword> act in the same way as specified for B<match> and +apply to all B<replace> statements in this group. =item B<replace> <table identifier> Replace all words in the table with the identifier C<< <table identifier> >>, -using the C<beginword> and C<endword> settings specified by the current group. +using the B<beginword> and B<endword> settings specified by the current group. -Note that a table must have been loaded (or generated using C<expand>) -before being used in a C<replace> statement. +Note that a table must have been loaded (or generated using B<expand>) +before being used in a B<replace> statement. =item B<endgroup>