Notes - transliterate_data - Data for Urdu<->Hindi transliteration

Notes (4031B)

1 NOTE REGARDING THE TABLES
2
3 The tables of words have been divided into nouns_adjectives and verbs. The tables are divided according to the way in which the stems are inflected. The two 'irregular.txt' files are for any word that is not to be expanded/inflected.
4
5 Note: When adding new words to the tables, it is important to understand WHAT to add. In the case of the irregular.txt tables, the whole word is added. With the rest, only a stem is added. The inflections are then added by the program.
6
7 An example from each table is given below. On the left is the stem, on the right one of the inflections/expansions.
8
9 VERBS
10
11 irregular سیوں گا सियूँगा > [no expansion]
12 regular_consonant_ending ابال उबाल > ابالنا उबालना
13 regular_ending_in_a_o آزما आज़मा > آزمانا आज़माना
14
15 NOUNS/ADJECTIVES
16
17 adjectiveregular_a_i آدھ आध > آدھا आधा
18 irregular آئین आईन > [no expansion]
19 ahmasc آلود आलूद > آلودہ आलूदा
20 aishortmasc افع अफ़ > افعی अफ़इ
21 amasc آٹ आट > آٹا आटा
22 an آٹھو आठव > آٹھواں आठवाँ
23 cfem آتش आतिश > آتشیں आतिशें
24 cmasc آبشار आबशार > آبشاروں आबशारों
25 ifem آباد आबाद > آبادی आबादी
26 ifemshort مورت मूर्त > مورتی मूर्ति
27 imasc آدم आदम > آدمی आदमी
28 o_a_staysfem ابتدا इब्तिदा > ابتداؤں इब्तिदाओं
29 u_staysfem آرز आरज़ > آرزو आरज़ू
30 o_a_staysmasc دانا दाना > داناؤں दानाओं
31 u_staysmasc آنس आँस > آنسو आँसू
32 ui_oi_ai_mascfem ابتدا इब्तिदा > ابتدائی इब्तिदाई
33
34 TABLES IN DATA FOLDER
35
36 There are a number of further tables in order to cope with punctuation, exceptions and special cases in the data folder:
37
38 ignore: adds words that are ignored permanently,
39 punctuation: for conversion of punctuation.
40 misc_beginword.ur_hi: word parts ("prefixes") at the beginning of word compounds
41 misc_endword: word parts ("suffixes") at the end of word compounds
42 special: special cases (no beginword endword)
43 exceptions_beginword_endword.ur_hi: override multiple choices for common words found in the preceding tables.
44 exceptions_beginword.hi_ur: exceptions which need to replaced before the following match statements.
45 exceptions_beginword_endword.hi_ur: override multiple choices for common words found in the preceding tables.
46 pairs_middle_e_o: The Persian Genetive े- (eg मुल्के-मिसर) conflicts with word pairs containing this such as नवासे-नवासियाँ. These word pairs are regular inflections and do not contain a Persian Genetive, so in Urdu script the first word of the pair ends in ے + space and not ِ + space. Word pairs conflicting with the Persian Genetive have been put into the new file 'pairs.middle_e_o'. Word pairs with و at the end of the first word have also been placed here, eg دو ایک दो-एक, as these conflict with the rule regarding the copula و linking words in Urdu.
47
48 CAREFUL: If you add the wrong words to these tables, you can mess up the conversion process!
49
50 THE CONFIG FILES
51 There are two config files.
52
53 config.hi_ur: the config to use when converting Hindi to Urdu.
54 config.ur_hi: the config to use when converting Urdu to Hindi.
55
56 NOTE: The tables in the data folder relating only to one of these two configs are labelled accordingly, ie xxxxx.hi_ur.txt or xxxxx.ur_hi.txt
57
58 Tables which are not labelled in either way relate to both config files.
59
60 !!!THINGS TO KEEP IN MIND!!!!
61
62 * -से needs to be done manually, as this is in most cases the postposition से and not the 'adjective' से. के-से can be done through search/replace. It is better to find the rest of the cases by reading through the text.
63
64 * Also make sure you have gtk2-perl installed!
65
66

	transliterate_data Data for Urdu<->Hindi transliteration
	git clone git://lumidify.org/transliterate_data.git (fast, but not encrypted)
	git clone https://lumidify.org/transliterate_data.git (encrypted, but very slow)
	git clone git://4kcetb7mo7hj6grozzybxtotsub5bempzo4lirzc3437amof2c2impyd.onion/transliterate_data.git (over tor)
	Log \| Files \| Refs \| README