NAME

transliterate.pl - Transliterate text files

SYNOPSIS

transliterate.pl [options][input file]

Start the transliteration engine with the given file as input. The input file defaults to STDIN if no filename is given.

OPTIONS

--output <filename>

Sets the output file to print to.

If the file exists already and --force is not set, the user is asked if the file should be overwritten or appended to.

Default: STDOUT (print to terminal)

--config <filename>

Sets the configuration file to use.

Default: config

--checkduplicates

Prints all duplicate words within single table files and across tables that are replaced within the same group, then exits the program.

Note that this simply prints all duplicates, even ones that are legitimate. When duplicates are found during normal operation of the program, they are simply combined in exactly the same way as the regular word choices.

Also note that the words are still added as possible choices, which may be slightly confusing. If, for instance, a word "word" is stored in the tables "tablea", "tableb", and "tablec" with the replacements "a", "b", and "c", the first duplicate message will say that the first occurrence was in table "tablea" with the replacement "a", and the second duplicate message will say that the first occurrence was in table "tablea" with the replacement "a$b" (assuming $ is the value set as choicesep in the config). This is just something to be aware of.

On that note, before duplicates are checked between tables in the same replacement group, duplicates inside the same file are already replaced, so that might be a bit confusing as well.

--dumptables

Prints the words of all tables that don't have nodisplay set.

This is mainly meant to be used for generating word lists in order to use them in a spell checker. Note that the words printed here are in UTF-8 NFC (Unicode Canonical Composition Form), so it may not be ideal when the spellchecked text is not in the same form.

--nochoices

Disables prompting for the right word when multiple replacement words exist.

This can be used to "weed out" all the unknown words before commencing the laborious task of choosing the right word every time multiple options exist.

--nounknowns

Disables prompting for the right word when a word is not found in the database.

This can be used together with --nochoices to perform a quick test of how well the actual engine is working without having to click through all the prompts.

--debug

Prints information helpful for debugging problems with the match and group statements.

For each match or group statement which replaces anything, the original statement is printed (the format is a bit different than in the config) and each actual word that's replaced is printed.

--debugspecial

This option is only useful for automatic testing of the transliteration engine.

If --nochoices is enabled, each word in the input with multiple choices will be output, along with the number of choices (can be used to test the proper functioning of choicesep in the config file).

If --nounknowns is enabled, each unknown word in the input is printed (can be used to test that the ignore options are working correctly).

--force

Always overwrites the output and error file without asking.

--start <line number>

Starts at the given line number instead of the beginning of the file.

Note: when "Stop processing" is pressed, the current line number is printed out. This is the current line that was being processed, so it has not been printed to the output file yet and thus the program must be resumed at that line, not the one afterwards.

--errors <filename>

Specifies a file to write errors in. Note that this does not refer to actual errors, but to any words that were temporarily ignored (i.e. words for which "Ignore: This run" was clicked).

If no file is specified, nothing is written. If a file is specified that already exists and --force is not set, the user is prompted for action.

--help

Displays the full documentation.

DESCRIPTION

transliterate.pl will read the given input file and transliterate it based on the given configuration file, prompting the user for action if a word has multiple replacement options or is not found in the database.

See "CONFIGURATION" for details on what is possible.

Note that this is not some sort of advanced transliteration engine which understands the grammar of the language and tries to guess words based on that. This is only a glorified find-and-replace program with some extra features to make it useful for transliterating text using large wordlists.

WARNING: All input data is assumed to be UTF-8!

WORD CHOICE WINDOW

The word choice window is opened any time one word has multiple replacement options and prompts the user to choose one.

For each word with multiple options, the user must choose the right option and then press "Accept changes" to finalize the transliteration of the current line. The button to accept changes is selected by default, so it is possible to just press enter instead of manually clicking it. Before the line is finalized, the user may press "Undo" to undo any changes on the current line.

"Skip word" just leaves it as is. This shouldn't be needed in most cases since choicesep should always be set to a character that doesn't occur normally in the text anyways.

"Open in unknown word window" will open the unknown word window with the current word selected. This is meant as a helper if you notice that another word choice needs to be added.

Warning: This is very inconsistent and buggy! Since the unknown word window is just opened directly, it isn't modified to make more sense for this situation. Whenever "Add replacement" is pressed, the whole line is re-transliterated as usual, but the word choice window is opened again right afterwards. If you just want to go back to the word choice window, press the ignore button for "whole line" since that shouldn't break anything. There are weird inconsistencies, though - for instance, if you delete all words in the tables, then press "Reload config", the line will be re-transliterated and none of the words will actually be found, but it will still go on because control passes back to the word choice window no matter what. Also, none of the word choices that were already done on this line are saved since the line is restarted from the beginning. As I said, it's only there as a helper and is very buggy/inconsistent. Maybe I'll make everything work better in a future release.

"Stop processing" will exit the program and print the line number that was currently being processed.

UNKNOWN WORD WINDOW

The unknown word window is opened any time a word could not be replaced.

Both the context from the original script and the context from the transliterated version (so far) is shown. If a part of the text is selected in one of the text boxes and "Use selection as word" is pressed for the appropriate box, the selected text is used for the action that is taken subsequently. "Reset text" resets the text in the text box to its original state (except for the highlight because I'm too lazy to do that).

The possible actions are:

Ignore

"This run" only ignores the word until the program exits, while "Permanently" saves the word in the ignore file specified in the configuration. "Whole line" stops asking for unknown words on this line and prints the line out as it originally was in the file. Note that any words in the original line that contain choicesep will still cause the word choice window to appear due to the way it is implemented. Just press "Skip word" if that happens.

Retry without <display name>

Removes all characters specified in the corresponding retrywithout statement in the config from the currently selected word and re-transliterates just that word. The result is then pasted into the text box beside "Add replacement" so it can be added to a table. This is only a sort of helper for languages like Urdu in which words often can be written with or without diacritics. If the "base form" without diacritics is already in the tables, this button can be used to quickly find the transliteration instead of having to type it out again. Any part of the word that couldn't be transliterated is just pasted verbatim into the text box (but after the characters have been removed).

Note that the selection can still be modified after this, before pressing "Add to list". This could potentially be useful if a word is in a table that is expanded using "noroot" because for instance "Retry without diacritics" would only work with the full word (with the ending), but only the stem should be added to the list. If that is the case, "Retry without diacritics" could be pressed with the whole word selected, but the ending could be removed before actually pressing "Add to list".

A separate button is shown for every retrywithout statement in the config.

Add to list

Adds the word typed in the text box beside "Add replacement" to the selected table file as the replacement for the word currently selected and re-runs the replacement on the current line. All table files that do not have nodisplay set are shown as options, see "CONFIGURATION".

Warning: This simply appends the word and its replacement to the end of the file, so it will cause an error if there was no newline ("\n") at the end of the file before.

Note that this always re-transliterates the entire line afterwards. This is to allow more flexibility. Consider, for instance, a compound word of which the first part is also a valid single word. If the entire line was not re-transliterated, it would be impossible to add a replacement for that entire compound word and have it take effect during the same run since the first part of the word would not even be available for transliteration anymore.

One problem is that the word is just written directly to the file and there is no undo. This is the way it currently is and will probably not change very soon. If a mistake is made, the word can always be removed again manually from the list and "Reload config" pressed.

Reload config

Reload the configuration file along with all tables an re-runs the replacement on the current line. Note that this can take a short while since the entire word database has to be reloaded.

Stop processing

Prints the current line number to the terminal and exits the program.

The program can always be started again at this line number using the --start option if needed.

INTERNALS/EXAMPLES

This section was added to explain to the user how the transliteration process works internally since that may be necessary to understand why certain words are replaced the way they are.

First off, the process works line-by-line, i.e. no match or replace statement will ever match anything that crosses the end of a line.

Each line is initially stored as one chunk which is marked as untransliterated. Then, all match, matchignore, and replace (or, rather, group) statements are executed in the order they appear in the config file. Whenever a word/match is replaced, it is split off into a separate chunk which is marked as transliterated. A chunk marked as transliterated is entirely ignored by any replacement statements that come afterwards. Note that beginword and endword can always match at the boundary between an untransliterated and transliterated chunk. This is to facilitate automated replacement of certain grammatical constructions. For instance:

If the string "a-" could be attached as a prefix to any word and needed to be replaced as "b-" everywhere, it would be quite trivial to add a match statement 'match "a-" "b-" beginword'. If run on the text "a-word", where "word" is some word that should be transliterated as "word_replaced", and the group replace statement for the word comes after the match statement given above, the following would happen: First, the match statement would replace "a-" and split the text into the two chunks "b-" and "word", where "b-" is already marked as transliterated. Since "word" is now separate, it will be matched by the group replace statement later, even if it has beginword set and would normally not match if "a-" came before it. Thus, the final output will be "b-word_replaced", allowing for the uniform replacement of the prefix instead of having to add each word twice, once with and once without the prefix.

In certain cases, this behavior may not be desired. Consider, for instance, a prefix "c-" which cannot be replaced uniformly as in the example above due to differences in the source and destination script. Since it cannot be replaced uniformly, two words "word1" and "word2" would both need to be specified separately with replacements for "c-word1" and "c-word2". If, however, the prefix "c-" has an alternate spelling "c " (without the hyphen), it would be very useful to be able to automatically recognize that as well. This is where the nofinal attribute for the match statements comes in. If there is a match statement 'match "c " "c-" beginword nofinal', the replaced chunk is not marked as transliterated, so after executing this statement on the text "c word1", there will still only be one chunk, "c-word1", allowing for the regular word replacements to function properly.

Once all the replacement statements have been processed, each chunk of text that is not marked as transliterated yet is split based on the split pattern specified in the config and all actual characters matched by the split pattern are marked as transliterated (this usually means all the spaces, newlines, quotation marks, etc.). Any remaining words/text chunks that are still marked as untransliterated are now processed by the unknown word window. If one of these remaining unknown chunks is present in the file specified by the ignore statement in the config, it is simply ignored and later printed out as is. After all untransliterated words have either had a replacement added or been ignored, any words with multiple replacement choices are processed by the word choice window. Once this is all done, the final output is written to the output file and the process is repeated with the next line. Note that the entire process is started again each time a word is added to a table or the config is reloaded from the unknown word window.

CONFIGURATION

These are the commands accepted in the configuration file. Any parameters in square brackets are optional. Comments are started with #. Strings (filenames, regex strings, etc.) are enclosed in double quotes ("").

The match, matchignore, and replace commands are executed in the order they are specified, except that all replace commands within the same group are replaced together.

The match and matchignore statements accept any RegEx strings and are thus very powerful. The group statements only work with the non-RegEx words from the tables, but are very efficient for large numbers of words and should thus be used for the main bulk of the words.

Any duplicate words found will cause the user to be prompted to choose one option every time the word is replaced in the input text.

Note that any regex strings specified in the config should not contain capture groups, as that would break the endword functionality since this is also implemented internally using capture groups. Capture groups are also entirely pointless in the config since they currently cannot be used as part of the replacement string in match statements. Lookaheads and lookbehinds are fine, though, and could be useful in certain cases.

All tables must be loaded before they are used, or there will be an error that the table does not exist.

Warning: If a replace statement is located before an expand statement that would have impacted the table used, there will be no error but the expand statement won't have any impact.

Basic rule of thumb: Always put the table statements before the expand statements and the expand statements before the replace statements.

split <regex string>

Sets the RegEx string to be used for splitting words. This is only used for splitting the words which couldn't be replaced after all replacement has been done, before prompting the user for unknown words.

Note that split should probably always contain at least \n, since otherwise all of the newlines will be marked as unknown words. Usually, this will be included anyways through \s.

Note also that split should probably include the + RegEx-quantifier since that allows the splitting function in the end to ignore several splitting characters right after each other (e.g. several spaces) in one go instead of splitting the string again for every single one of them. This shouldn't actually make any difference functionality-wise, though.

Default: \s+ (all whitespace)

beforeword <regex string>

Sets the RegEx string to be matched before a word if beginword is set.

Default: \s

afterword <regex string>

Sets the RegEx string to be matched after a word if endword is set.

Note that afterword should probably always contain at least \n, since otherwise words with endword set will not be matched at the end of a line.

beforeword and afterword will often be exactly the same, but they are left as separate options in case more fine-tuning is needed.

Default: \s

tablesep <string>

Sets the separator used to split the lines in the table files into the original and replacement word.

Default: Tab

choicesep <string>

Sets the separator used to split replacement words into multiple choices for prompting the user.

Default: $

comment <string>

If enabled, anything after <string> will be ignored on all lines in the input file. This will not be displayed in the unknown word window or word choice window but will still be printed in the end, with the comment character removed (that seems to be the most sensible thing to do).

Note that this is really just a "dumb replacement", so there's no way to prevent a line with the comment character from being ignored. Just try to always set this to a character that does not occur anywhere in the text (or don't use the option at all).

ignore <filename>

Sets the file of words to ignore.

This has to be set even if the file is just empty because the user can add words to it from the unknown word window.

table <table identifier> <filename> [nodisplay] [revert]

Load the table from <filename>, making it available for later use in the expand and replace commands using the identifier <table identifier>.

if nodisplay is set, the filename for this table is not shown in the unknown word window. If, however, the same filename is loaded again for another table that does not have nodisplay set, it is still displayed.

If revert is set, the original and replacement words are switched. This can be useful for creating a config for transliterating in the opposite direction with the same database. I don't know why I called it "revert" since it should actually be called "reverse". I guess I was a bit confused.

Note that if <filename> is not an absolute path, it is taken to be relative to the location of the configuration file.

The table files simply consist of tablesep-separated values, with the word in the original script first and the replacement word second. Both the original and replacement word can optionally have several parts separated by choicesep. If the original word has multiple parts, it is separated and each of the parts is added to the table with the replacement. If the replacement has multiple parts, the user will be prompted to choose one of the options during the transliteration process. If the same word occurs multiple times in the same table with different replacements, the replacements are automatically added as choices that will be handled by the word choice window.

If, for whatever reason, the same table is needed twice, but with different endings, the table can simply be loaded twice with different IDs. If the same path is loaded, the table that has already been loaded will be reused. Note that this feature was added before adding revert, so the old table is used even if it had revert set and the new one doesn't. This is technically a problem, but I don't know of any real-world case where it would be a problem, so I'm too lazy to change it. Tell me if it actually becomes a problem for you.

WARNING: Don't load the same table file both with and without revert in the same config! When a replacement word is added through the GUI, the program has to know which way to write the words. Currently, whenever a table file is loaded with revert anywhere in the config (even if it is loaded without revert in a different place), words will automatically be written as if revert was on. I cannot currently think of any reason why someone would want to load a file both with and without revert in the same config, but I still wanted to add this warning just in case.

expand <table identifier> <word ending table> [noroot]

Expand the table <table identifier>, i.e. generate all the word forms using the word endings in <word ending table>, saving the result as a table with the identifier <new table identifier>.

Note: There used to be a <new table identifier> argument to create a new table in case one table had to be expanded with different endings. This has been removed because it was a bit ugly, especially since there wasn't a proper mapping from table IDs to filenames anymore. If this functionality is needed, the same table file can simply be loaded multiple times. See the table section above.

If noroot is set, the root forms of the words are not kept.

If the replacement for a word ending contains choicesep, it is split and each part is combined with the root form separately and the user is prompted to choose one of the options later. it is thus possible to allow multiple choices for the ending if there is a distinction in the replacement script but not in the source script. Note that each of the root words is also split into its choices (if necessary) during the expanding, so it is possible to use choicesep in both the endings and root words.

match <regex string> <replacement string> [beginword] [endword] [nofinal]

Perform a RegEx match using the given <regex string>, replacing it with <replacement string>. Note that the replacement cannot contain any RegEx (e.g. groups) in it. beginword and endword specify whether the match must be at the beginning or ending of a word, respectively, using the RegEx specified in beforeword and afterword. If nofinal is set, the string is not marked as transliterated after the replacement, allowing it to be modified by subsequent match or replace commands.

matchignore <regex string> [beginword] [endword]

Performs a RegEx match in the same manner as match, except that the original match is used as the replacement instead of specifying a replacement string, i.e. whatever is matched is just marked as transliterated without changing it.

group [beginword] [endword]

Begins a replacement group. All replace commands must occur between group and endgroup, since they are then grouped together and replaced in one go. beginword and endword act in the same way as specified for match and apply to all replace statements in this group.

replace <table identifier> [override]

Replace all words in the table with the identifier <table identifier>, using the beginword and endword settings specified by the current group.

Unless override is set on the latter table, if the same word occurs in two tables with different replacements, both are automatically added as choices. See "WORD CHOICE WINDOW".

override can be useful if the same database is used for both directions and one direction maps multiple words to one word, but in the other direction this word should always default to one of the choices. In that case, a small table with these special cases can be created and put at the end of the main group statement with override set. This is technically redundant since you could just add a special group with only the override table in it earlier in the config, but it somehow seems cleaner this way.

Note that a table must have been loaded before being used in a replace statement.

endgroup

End a replacement group.

retrywithout <display name> [character] [...]

Adds a button to the unknown word window to retry the replacements on the selected word, first removing the given characters. The button is named "<display name>" and located after the "Retry without" label. Whatever is found with the replacements is pasted into the regular text box for the "Add replacement" functionality.

This can be used as an aid when, for instance, words can be written with or without certain diacritics. If the actual word without diacritics is already in the database and there is a retrywithout statement for all the diacritics, the button can be used to quickly find the replacement for the word instead of having to type it out manually. The same goes for compound words that can be written with or without a space.

It is also possible to specify retrywithout without any characters, which just adds a button that takes whatever word is selected and retries the replacements on it. This can be useful if you want to manually edit words and quickly see if they are found with the edits in place.

Note that all input text is first normalized to the unicode canonical decomposition form so that diacritics can be removed individually.

Also note that all buttons are currently just dumped in the GUI without any sort of wrapping, so they'll run off the screen if there are too many. Tell me if this becomes a problem. I'm just too lazy to change it right now.

Small warning: This only removes the given characters from the word selected in the GUI, not from the tables. Thus, this only works if the version of the word without any of the characters is already present in the tables. It would be useful when handling diacritics if the program could simply make a comparison while completely ignoring diacritics, but I haven't figured out a nice way to implement that yet.

Historical note: This was called diacritics in a previous version and only allowed removal of diacritics. This is exactly the same functionality, just generalized to allow removal of any characters with different buttons.

targetdiacritics <diacritic> [...]

This was only added to simplify transliteration from Hindi to Urdu with the same database. When this is set, the choices in the word choice window are sorted in descending order based on the number of diacritics from this list that are matched in each choice. This is so that when transliterating from Hindi to Urdu, the choice with the most diacritics is always at the top.

Additionally, if there are exactly two choices for a word and one of them contains diacritics but the other one doesn't, the one containing diacritics is automatically taken without ever prompting the user. This is, admittedly, a very language-specific feature, but I couldn't think of a simple way of adding it without building it directly into the actual program.

Note that due to the way this is implemented, it will not take any effect if --nochoices is enabled.

The attentive reader will notice at this point that most of the features in this program were added specifically for dealing with Urdu and Hindi, which does appear to make sense, considering that this program was written specifically for transliterating Urdu to Hindi and vice versa (although not quite as much vice versa).

BUGS

Although it may not seem like it, one of the ugliest parts of the program is the GUI functionality that allows the user to add a replacement word. The problem is that all information about the expand and replace statements has to be kept in order to properly handle adding a word to one of the files and simultaneously adding it to the currently loaded tables without reloading the entire config. The way it currently works, the replacement word is directly written to the file, then all expand statements that would have impacted the words from this file are redone (just for the newly added word) and the resulting words are added to the appropriate tables (or, technically, the appropriate 'trie'). Since a file can be mapped to multiple table IDs and a table ID can occur in multiple replace statements, this is more complicated than it sounds, and thus it is very likely that there are bugs lurking here somewhere. Do note that "Reload config" will always reload the entire configuration, so that's safe to do even if the on-the-fly replacing doesn't work.

In general, I have tested the GUI code much less than the rest since you can't really test it automatically very well.

The code is generally quite nasty, especially the parts belonging to the GUI. Don't look at it.

Tell me if you find any bugs.

SEE ALSO

perlre, perlretut

LICENSE

Copyright (c) 2019, 2020, 2021 lumidify <nobody[at]lumidify.org>

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.