transliterate_data

Data for Urdu<->Hindi transliteration
git clone git://lumidify.org/transliterate_data.git (fast, but not encrypted)
git clone https://lumidify.org/transliterate_data.git (encrypted, but very slow)
git clone git://4kcetb7mo7hj6grozzybxtotsub5bempzo4lirzc3437amof2c2impyd.onion/transliterate_data.git (over tor)
Log | Files | Refs | README

Notes (4031B)


      1 NOTE REGARDING THE TABLES
      2 
      3 The tables of words have been divided into nouns_adjectives and verbs. The tables are divided according to the way in which the stems are inflected.  The two 'irregular.txt' files are for any word that is not to be expanded/inflected. 
      4 
      5 Note: When adding new words to the tables, it is important to understand WHAT to add. In the case of the irregular.txt tables, the whole word is added. With the rest, only a stem is added. The inflections are then added by the program.
      6 
      7 An example from each table is given below. On the left is the stem, on the right one of the inflections/expansions. 
      8 
      9 VERBS
     10 
     11 irregular	سیوں گا	सियूँगा	> [no expansion]
     12 regular_consonant_ending	ابال	उबाल	>	ابالنا	उबालना
     13 regular_ending_in_a_o	آزما	आज़मा	>	آزمانا	आज़माना
     14 
     15  NOUNS/ADJECTIVES
     16 
     17 adjectiveregular_a_i	آدھ	आध > آدھا	आधा	
     18 irregular	آئین	आईन	> [no expansion]
     19 ahmasc	آلود	आलूद	> آلودہ	आलूदा
     20 aishortmasc	افع	अफ़	> افعی	अफ़इ
     21 amasc	آٹ	आट	> آٹا	आटा
     22 an	آٹھو	आठव	> آٹھواں	आठवाँ
     23 cfem	آتش	आतिश	> آتشیں	आतिशें
     24 cmasc	آبشار	आबशार	> آبشاروں	आबशारों
     25 ifem	آباد	आबाद	> آبادی	आबादी
     26 ifemshort	مورت	मूर्त	> مورتی	मूर्ति
     27 imasc	آدم	आदम	> آدمی	आदमी
     28 o_a_staysfem	ابتدا	इब्तिदा	> ابتداؤں	इब्तिदाओं
     29 u_staysfem	آرز	आरज़	> آرزو	आरज़ू
     30 o_a_staysmasc دانا	दाना	> داناؤں	दानाओं
     31 u_staysmasc	آنس	आँस	> آنسو	आँसू
     32 ui_oi_ai_mascfem	ابتدا	इब्तिदा	> ابتدائی	इब्तिदाई
     33 
     34 TABLES IN DATA FOLDER
     35 
     36 There are a number of further tables in order to cope with punctuation, exceptions and special cases in the data folder:
     37 
     38 ignore: adds words that are ignored permanently,
     39 punctuation: for conversion of punctuation.
     40 misc_beginword.ur_hi: word parts ("prefixes") at the beginning of word compounds
     41 misc_endword: word parts ("suffixes") at the end of word compounds
     42 special: special cases (no beginword endword)
     43 exceptions_beginword_endword.ur_hi:  override multiple choices for common words found in the preceding tables.
     44 exceptions_beginword.hi_ur: exceptions which need to replaced before the following match statements.
     45 exceptions_beginword_endword.hi_ur:  override multiple choices for common words found in the preceding tables.
     46 pairs_middle_e_o: The Persian Genetive े-  (eg मुल्के-मिसर)  conflicts with word pairs containing this such as नवासे-नवासियाँ. These word pairs are regular inflections and do not contain a Persian Genetive, so in Urdu script the first word of the pair ends in ے + space and not ِ  + space. Word pairs conflicting with the Persian Genetive have been put into the new file 'pairs.middle_e_o'. Word pairs with و at the end of the first word have also been placed here, eg دو ایک	दो-एक, as these conflict with the rule regarding the copula و linking words in Urdu.
     47 
     48 CAREFUL: If you add the wrong words to these tables, you can mess up the conversion process!
     49 
     50 THE CONFIG FILES
     51 There are two config files.
     52 
     53 config.hi_ur: the config to use when converting Hindi to Urdu.
     54 config.ur_hi: the config to use when converting Urdu to Hindi.
     55 
     56 NOTE: The tables in the data folder relating only to one of these two configs are labelled accordingly, ie xxxxx.hi_ur.txt or xxxxx.ur_hi.txt
     57 
     58 Tables which are not labelled in either way relate to both config files.
     59 
     60 !!!THINGS TO KEEP IN MIND!!!!
     61 
     62 * -से needs to be done manually, as this is in most cases the postposition से and not the 'adjective' से.   के-से can be done through search/replace. It is better to find the rest of the cases by reading through the text.
     63 
     64 * Also make sure you have gtk2-perl installed!
     65 
     66