transliterate

Transliteration engine
git clone git://lumidify.org/transliterate.git
Log | Files | Refs | README | LICENSE

commit ae7e56e40e4b336326877b1fcf40f6aa0044799c
parent 8798eb993523df956ad8b76f2ce500bbbe65870a
Author: lumidify <nobody@lumidify.org>
Date:   Wed,  1 Apr 2020 14:14:44 +0200

Update documentation for diacritics

Diffstat:
Mtransliterate.pl | 107+++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 77 insertions(+), 30 deletions(-)

diff --git a/transliterate.pl b/transliterate.pl @@ -659,7 +659,7 @@ sub interpret_config { # ignore is the path to the ignore file, ignore_words the actual table $config{"ignore"} = ""; $config{"ignore_words"} = {}; - $config{"split"} = "\\s"; + $config{"split"} = "\\s+"; $config{"beforeword"} = "\\s"; $config{"afterword"} = "\\s"; $config{"tablesep"} = "\t"; @@ -1160,6 +1160,8 @@ sub replace_group { @$substrings = @substrings_new; } +# Perform all replacements on $word, first removing all +# diacritics specified in the config sub replace_strip_diacritics { my ($config, $word) = @_; foreach my $diacritic (@{$config->{"diacritics"}}) { @@ -1519,18 +1521,43 @@ The possible actions are: "Permanently" saves the word in the ignore file specified in the configuration. +=item Retry without diacritics + +Removes all diacritics specified in the L<config|/"CONFIGURATION"> +from the currently selected word and re-transliterates just that +word. The result is then pasted into the text box beside +"Add replacement" so it can be added to a table. This is only a +sort of helper for languages like Urdu in which words often can +be written with or without diacritics. If the "base form" without +diacritics is already in the tables, this button can be used to +quickly find the transliteration instead of having to type it +out again. Any part of the word that couldn't be transliterated +is just pasted verbatim into the text box (but after the +diacritics have been removed). + +Note that the selection can still be modified after this, before +pressing "Add to list". This could potentially be useful if a word +is in a table that is expanded using "noroot" because "Retry without +diacritics" would only work with the full word (with the ending), +but only the stem should be added to the list. If that is the case, +"Retry without diacritics" could be pressed with the whole word +selected, but the ending could be removed before actually pressing +"Add to list". + =item Add to list Adds the word typed in the text box beside "Add replacement" to the -selected table file and re-runs the replacement on the current line. -All table files that do not have B<nodisplay> set are shown as -options, see L</"CONFIGURATION">. - -Note that this could be made to be faster by simply replacing the -word directly in the text instead of running the entire replacement -again. The problem is that adding new words can, on occasion, have -undesired side effects, and it is better to see those immediately, -rather than waiting until a later date. +selected table file as the replacement for the word currently selected +and re-runs the replacement on the current line. All table files that +do not have B<nodisplay> set are shown as options, see L</"CONFIGURATION">. + +Note that this always re-transliterates the entire line afterwards. +This is to allow more flexibility. Consider, for instance, a compound +word of which the first part is also a valid single word. If the +entire line was not re-transliterated, it would be impossible to +add a replacement for that entire compound word and have it take +effect during the same run since the first part of the word would +not even be available for transliteration anymore. One problem is that the word is just written directly to the file and there is no undo. This is the way it currently is and will @@ -1604,25 +1631,29 @@ statement on the text "c word1", there will still only be one chunk, properly. Once all the replacement statements have been processed, each chunk -of text that is not marked as transliterated yet is split based on -the B<split> pattern specified in the config and all actual characters -matched by the B<split> pattern are marked as transliterated (this -usually means all the spaces, newlines, quotation marks, etc.). Any -remaining words/text chunks that are still marked as untransliterated are -now processed by the unknown word window. If one of these remaining -unknown chunks is present in the file specified by the B<ignore> -statement in the config, it is simply ignored and later printed out -as is. After all untransliterated words have either had a replacement -added or been ignored, any words with multiple replacement choices are -processed by the word choice window. Once this is all done, the final -output is written to the output file and the process is repeated with -the next line. +of text that is not marked as transliterated yet is "trimmed" based on +the B<split> pattern specified in the config. This means that all +"lone" split characters are marked as transliterated and any other +untransliterated chunks have leading or trailing split characters +marked as transliterated. At this point, only chunks of actual text that +have not been transliterated are still marked as untransliterated. +These are now processed by the L<unknown word window|/"UNKNOWN WORD WINDOW">. +If one of these remaining unknown chunks is present in the file +specified by the B<ignore> statement in the config, it is simply ignored +and later printed out as is. After all untransliterated words have either +had a replacement added or been ignored, any words with multiple replacement +choices are processed by the word choice window. Once this is all done, +the final output is written to the output file and the process is +repeated with the next line. Note that the entire process is started +again each time a word is added to a table or the config is reloaded +from the L<unknown word window|/"UNKNOWN WORD WINDOW">. =head1 CONFIGURATION These are the commands accepted in the configuration file. Any parameters in square brackets are optional. -Comments are started with C<#>. +Comments are started with C<#>. Strings (filenames, regex strings, etc.) +are enclosed in double quotes (""). The B<match>, B<matchignore>, and B<replace> commands are executed in the order they are specified, except that all B<replace> commands within @@ -1668,12 +1699,17 @@ otherwise all of the newlines will be marked as unknown words. Usually, this will be included anyways through C<\s>. Note also that B<split> should probably include the C<+> RegEx-quantifier -since that allows the splitting function in the end to ignore several -splitting characters right after each other (e.g. several spaces) in one -go instead of splitting the string again for every single one of them. -This shouldn't actually make any difference functionality-wise, though. +since that allows the splitting function in the end to also mark several +splitting characters in a row as transliterated. + +This is named a bit confusingly since it was originally used to split +the string completely based on the given pattern in the end. This was +changed later, so a better name now would be "trim", but it's already +called this way, so I don't feel like changing it. See the last +paragraph of L</"INTERNALS/EXAMPLES"> for a short description of how +the trimming works. -B<Default:> C<\s> (all whitespace) +B<Default:> C<\s+> (all whitespace) =item B<beforeword> <regex string> @@ -1794,6 +1830,16 @@ before being used in a B<replace> statement. End a replacement group. +=item B<diacritics> <diacritic> [...] + +Adds the given list of diacritics to the list of diacritics that will be removed +from a word when "Retry without diacritics" is pressed in the +L<unknown word window|/"UNKNOWN WORD WINDOW">. + +There are quite advanced Unicode algorithms that could be used to compare words +while ignoring diacritics, but I do not know if it would be possible to use any +of those with the current way this engine works. + =back =head1 BUGS @@ -1816,7 +1862,8 @@ on-the-fly replacing doesn't work. In general, I have tested the GUI code much less than the rest since you can't really test it automatically very well. -The diacritic handling code is very rudimentary. +The code is generally quite nasty, especially the parts belonging to the GUI. +Don't look at it. Tell me if you find any bugs.