Encoding Arabic letters with their diacritics (if exists)

remove arabic diacritics python
arabic text classification python

I'm working on a deep learning project in which we use RNN. I want to encode the data before it's fed to the network. Input is Arabic verses, which have diacritics that are treated as separate characters in Python. I should encode/represent the character with the character following it with a number if the character following it is a diacritic, else I only encode the character.

Doing so for millions of verses, was hoping to use lambda with map. However, I cannot iterate with two characters at a time, i.e., was hoping for:

map(lambda ch, next_ch: encode(ch + next_ch) if is_diacritic(next_ch) else encode(ch), verse)

My intention behind the question is finding the fastest way to accomplish the above. No restriction on lambda functions, but for loop answers are not what I'm looking for.

A close example for non-Arabians, assume you want to encode the following text:

 XXA)L_I!I%M<LLL>MMQ*Q

You want to encode the letter after concatenating it with the letter following it if it's a special character, else just encode the letter only.

Output:

['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

For Arabians:

Verse example:

"قفا نبك من ذِكرى حبيب ومنزل بسِقطِ اللّوى بينَ الدَّخول فحَوْمل"

Diacritics are these small symbols above the letter(i.e., ّ , ْ )


[Update]

Range of diacritics starts at 64B HEX or 1611 INT and ends at 652 HEX or 1618 INT.

And letters 621 HEX - 1569 INT to 63A HEX - 1594 INT and from 641 HEX - 1601 INT to 64A HEX - 1610 INT

A letter can have at most one diacritic.


Extra information:

A similar encoding methodology to what I'm doing is representing the binary form of the verse as a matrix with shape (number of bits needed, number of characters in a verse). Both the number of bits and number of characters are calculated after we combine each letter with its diacritic if it exists.

For example, assume the verse is the following, and diacritics are special characters:

X+Y_XX+YYYY_

The alphabet different combinations are:

['X', 'X+', 'X_', 'Y', 'Y+', 'Y_']  

Hence I need 3 bits (at least) to represent these 6 characters, so the number of bits needed is 3

Consider the following encodings:

{
'X' : 000,
'X+': 001,
'X_': 010,
'Y':  011,
'Y+': 100,
'Y_': 101,
}

And I get to represent the example in a matrix as (binary representation is vertical):

X+     Y_    X    X+    Y    Y    Y    Y_
0      1     0    0     0    0    0    1
0      0     0    0     1    1    1    0
1      1     0    1     1    1    1    1

Which is why I am looking to combine diacritics with letters first.


Note: Iterate over a string 2 (or n) characters at a time in Python and Iterating each character in a string using Python do not give the intended answer.

I'm going to throw my hat into the ring with numpy here. You can convert a string into a useable format with

arr = np.array([verse]).view(np.uint32)

You can mask the locations where the following character is diacritical:

mask = np.empty(arr.shape, dtype=np.bool)
np.bitwise_and((arr[1:] > lower), (arr[1:] < upper), out=mask[:-1])
mask[-1] = False

Here, the range [upper, lower] is a made up way for checking for diacritics. Implement the actual check however you like. In this example, I used the full-blown form of bitwise_and with empty to avoid a potentially expensive append of the last element.

Now if you have a numerical method for encoding your code points to a number, which I'm sure you can vectorize, you can do something like:

combined = combine(letters=arr[mask], diacritics=arr[1:][mask[:-1]])

To get the remaining, uncombined characters, you would have to remove both the diactitics and the characters they bind to. The easiest way I can think of doing this is smearing the mask to the right and negating it. Again, I assume that you have a vectorized method to encode the single characters as well:

smeared = mask.copy()
smeared[1:] |= mask[:-1]
single = encode(arr[~smeared])

Combining the result into a final array is conceptually simple but takes a couple of steps. The result will be np.count_nonzeros(mask) elements shorter than the input, since diacritics are being removed. We need to shift all the mask elements by the amount of their index. Here's one way to do it:

ind = np.flatnonzero(mask)
nnz = ind.size
ind -= np.arange(nnz)

output = np.empty(arr.size - nnz, dtype='U1')
output[ind] = combined

# mask of unmodified elements
out_mask = np.ones(output.size, dtype=np.bool)
out_mask[ind] = False
output[out_mask] = single

The reason I'm suggesting numpy is that it should be able to handle a few million characters in a matter of seconds this way. Getting the output back as a string should be straightforward.

Suggested Implementation

I have been pondering your question and decided to play with some timings and possible implementations. My idea was to map the unicode characters in 0x0621-0x063A, 0x0641-0x064A (26 + 10 = 36 letters) into the lower 6 bits of a uint16, and the characters 0x064B-0x0652 (8 diacritics) to the next higher 3 bits, assuming these are in fact the only diacritics you need:

def encode_py(char):
    char = ord(char) - 0x0621
    if char >= 0x20:
        char -= 5
    return char

def combine_py(char, diacritic):
    return encode_py(char) | ((ord(diacritic) - 0x064A) << 6)

In numpy terms:

def encode_numpy(chars):
    chars = chars - 0x0621
    return np.subtract(chars, 5, where=chars > 0x20, out=chars)

def combine_numpy(chars, diacritics):
    chars = encode_numpy(chars)
    chars |= (diacritics - 0x064A) << 6
    return chars

You can choose to encode further to shorten the representation slightly, but I would not recommend it. This representation has the advantage of being verse-independent, so you can compare portions of different verses, as well as not worry about which representation you're going to get depending on how many verses you encoded together. You can even mask out the top bits of all the codes to compare the raw characters, without diacritics.

So let's say that your verse is a collection of randomly generated numbers in those ranges, with diacritics randomly generated to follow one letter each at most. We can generate a string of length around million pretty easily for comparative purposes:

import random

random.seed(0xB00B5)

alphabet = list(range(0x0621, 0x063B)) + list(range(0x0641, 0x064B))
diactitics = list(range(0x064B, 0x0653))

alphabet = [chr(x) for x in alphabet]
diactitics = [chr(x) for x in diactitics]

def sample(n=1000000, d=0.25):
    while n:
        yield random.choice(alphabet)
        n -= 1
        if n and random.random() < d:
            yield random.choice(diactitics)
            n -= 1

data = ''.join(sample())

This data has completely randomly distributed characters, with approximately a 25% chance of any character being followed by a diacritic. It takes just a few seconds to generate on my not-too-overpowered laptop.

The numpy conversion would look like this:

def convert_numpy(verse):
    arr = np.array([verse]).view(np.uint32)
    mask = np.empty(arr.shape, dtype=np.bool)
    mask[:-1] = (arr[1:] >= 0x064B)
    mask[-1] = False

    combined = combine_numpy(chars=arr[mask], diacritics=arr[1:][mask[:-1]])

    smeared = mask.copy()
    smeared[1:] |= mask[:-1]
    single = encode_numpy(arr[~smeared])

    ind = np.flatnonzero(mask)
    nnz = ind.size
    ind -= np.arange(nnz)

    output = np.empty(arr.size - nnz, dtype=np.uint16)
    output[ind] = combined

    # mask of unmodified elements
    out_mask = np.ones(output.size, dtype=np.bool)
    out_mask[ind] = False
    output[out_mask] = single

    return output

Benchmarks

And now let's %timeit to see how it goes. First, here are the other implementations. I convert everything to a numpy array or a list of integers for fair comparison. I've also made minor modifications to make the functions return lists of the same quantities to validate accuracy:

from itertools import tee, zip_longest
from functools import reduce

def is_diacritic(c):
    return ord(c) >= 0x064B

def pairwise(iterable, fillvalue):
    """ Slightly modified itertools pairwise recipe
    s -> (s0,s1), (s1,s2), (s2, s3), ... 
    """
    a, b = tee(iterable)
    next(b, None)
    return zip_longest(a, b, fillvalue=fillvalue)

def combine_py2(char, diacritic):
    return char | ((ord(diacritic) - 0x064A) << 6)

def convert_FHTMitchell(verse):
    def convert(verse):
        was_diacritic = False  # variable to keep track of diacritics -- stops us checking same character twice

        # fillvalue will not be encoded but ensures last char is read
        for this_char, next_char in pairwise(verse, fillvalue='-'):
            if was_diacritic:  # last next_char (so this_char) is diacritic
                was_diacritic = False
            elif is_diacritic(next_char):
                yield combine_py(this_char, next_char)
                was_diacritic = True
            else:
                yield encode_py(this_char)

    return list(convert(verse))

def convert_tobias_k_1(verse):
    return reduce(lambda lst, x: lst + [encode_py(x)] if not is_diacritic(x) else lst[:-1] + [combine_py2(lst[-1], x)], verse, [])

def convert_tobias_k_2(verse):
    res = []
    for x in verse:
        if not is_diacritic(x):
            res.append(encode_py(x))
        else:
            res[-1] = combine_py2(res[-1], x)
    return res

def convert_tobias_k_3(verse):
    return [combine_py(x, y) if y and is_diacritic(y) else encode_py(x) for x, y in zip_longest(verse, verse[1:], fillvalue="") if not is_diacritic(x)]

Now for the timings:

%timeit result_FHTMitchell = convert_FHTMitchell(data)
338 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_tobias_k_1 = convert_tobias_k_1(data)
Aborted, took > 5min to run. Appears to scale quadratically with input size: not OK!

%timeit result_tobias_k_2 = convert_tobias_k_2(data)
357 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_tobias_k_3 = convert_tobias_k_3(data)
466 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_numpy = convert_numpy(data)
30.2 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A comparison of the resulting arrays/lists shows that they are equal as well:

np.array_equal(result_FHTMitchell, result_tobias_k_2)  # True
np.array_equal(result_tobias_k_2, result_tobias_k_3)   # True
np.array_equal(result_tobias_k_3, result_numpy)        # True

I'm using array_equal here because it performs all the necessary type conversions to verify the actual data.

So the moral of the story is that there are lots of ways to do this, and parsing a few millions of characters shouldn't be prohibitively expensive on its own, until you get into cross referencing and other truly time-consuming tasks. The main thing to take from this is not to use reduce on lists, since you will be reallocating a lot more than you need to. Even a simple for loop will work fine for your purposes. Even though numpy is about ten times faster than the other implementations, it doesn't give a huge advantage.

Decoding

For the sake of completeness, here is a function to decode your results:

def decode(arr):
    mask = (arr > 0x3F)
    nnz = np.count_nonzero(mask)
    ind = np.flatnonzero(mask) + np.arange(nnz)

    diacritics = (arr[mask] >> 6) + 41
    characters = (arr & 0x3F)
    characters[characters >= 27] += 5

    output = np.empty(arr.size + nnz, dtype='U1').view(np.uint32)
    output[ind] = characters[mask]
    output[ind + 1] = diacritics

    output_mask = np.zeros(output.size, dtype=np.bool)
    output_mask[ind] = output_mask[ind + 1] = True
    output[~output_mask] = characters[~mask]

    output += 0x0621

    return output.base.view(f'U{output.size}').item()

As a side note, the work I did here inspired this question: Converting numpy arrays of code points to and from strings

Encoding Arabic letters with their diacritics (if exists), I'm working on a deep learning project in which we use RNN. I want to encode the data before it's fed to the network. Input is Arabic verses,  A similar encoding methodology to what I'm doing is representing the binary form of the verse as a matrix with shape (number of bits needed, number of characters in a verse). Both the number of bits and number of characters are calculated after we combine each letter with its diacritic if it exists.

map does not seem to be the right tool for the job. You do not want to map characters to other characters, but group them together. Instead, you might try reduce (or functools.reduce in Python 3). Here, I use isalpha to test what kind of character it is; you might need something else.

>>> is_diacritic = lambda x: not x.isalpha()
>>> verse = "XXA)L_I!I%M<LLL>MMQ*Q"
>>> reduce(lambda lst, x: lst + [x] if not is_diacritic(x) else lst[:-1] + [lst[-1]+x], verse, [])
['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

However, this is barely readable and also creates lots of intermediate lists. Better just use a boring old for loop, even if you explicitly asked for something else:

res = []
for x in verse:
    if not is_diacritic(x):
        res.append(x)
    else:
        res[-1] += x

By iterating pairs of consecutive characters, e.g. using zip(verse, verse[1:]) (i.e. (1,2), (2,3),..., not (1,2), (3,4), ...), you could indeed also use a list comprehension, but I'd still vote for the for loop for readability.

>>> [x + y if is_diacritic(y) else x
...  for x, y in zip_longest(verse, verse[1:], fillvalue="")
...  if not is_diacritic(x)]
...
['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

You could even do the same using map and lambda, but you'd also need to filter first, with another lambda, making the whole thing orders of magnitude uglier and harder to read.

Text Layout Requirements for the Arabic Script, Arabic script is encoded in the Unicode standard semantically, meaning that Characters used for these languages include letters and diacritics, three sets of If they are used, however, there are different approaches to their A simpler explanation of the basics of the algorithm exists in the W3C article  Arabic is a Unicode block, containing the standard letters and the most common diacritics of the Arabic script, and the Arabic-Indic digits.

You're not reading two characters at a time and even if you were, map does not split them into two parameters for lambda.

from itertools import tee, zip_longest

def pairwise(iterable, fillvalue):
    """ Slightly modified itertools pairwise recipe
    s -> (s0,s1), (s1,s2), (s2, s3), ... 
    """
    a, b = tee(iterable)
    next(b, None)
    return zip_longest(a, b, fillvalue=fillvalue)

def encode_arabic(verse):

    was_diacritic = False  # variable to keep track of diacritics -- stops us checking same character twice

    # fillvalue will not be encoded but ensures last char is read
    for this_char, next_char in pairwise(verse, fillvalue='-'):

        if was_diacritic:  # last next_char (so this_char) is diacritic
            was_diacritic = False

        elif is_diacritic(next_char):
            yield encode(this_char + next_char)
            was_diacritic = True

        else:
            yield this_char

encode_arabic(verse)  # returns a generator like map -- wrap in list / string.join / whatever

Adaptive and Natural Computing Algorithms: Proceedings of the , The system was applied on Arabic script. In this paper we will present the different steps of the handwriting recognition system. The encoding of Arabic handwritten word consists of representing a word from its original sequence of When diacritical symbols (dots, specials marks) are used, they appear above or below the  The hamza is indicated by diacritics in modern Arabic script (القرآن‎‎, al-qur'an), but in the Uthmani script, it was a letter (القرءان, al-qur'an). ِArabic letters mixed together. When the letters lam (ل) and alif (ا) are one after the other ("la"), they are written in a special way: لا

Artificial Neural Nets and Genetic Algorithms: Proceedings of the , First, we present an approach to on-line encoding the Arabic handwritten word. The difference between these letters is their positions in the word, the number When diacritical symbols (dots, specials marks) are used, they appear above or  The Arabic script has numerous diacritics, including i'jam, consonant pointing, and tashkil, supplementary diacritics. The latter include the ḥarakāt vowel marks - singular: ḥarakah. The Arabic script is an impure abjad, where short consonants and long vowels are represented by letters but short vowels and consonant length are not generally indicated in writing. Tashkīl is optional to represent missing vowels and consonant length. Modern Arabic is always written with the i‘jām

[PDF] Design of Arabic Diacritical Marks, strategies when applied to Arabic. At the end, we When the holy Quran was documented, the Arabic alphabet Diacritics encoded in conjuncture with his basic letter, such as: asymmetrical balance, or informal balance, exists when the. All of the Diacritics(كل التشكيل) The word diacritic refers to all of the markings that can appear above and below letters to alter their pronunciation. In Arabic there are 8 diacritic markings.

Arabic script in Unicode, As of Unicode 13.0, the Arabic script is contained in the following blocks: Arabic (​0600–06FF The basic Arabic range encodes the standard letters and diacritics, but does not The presentation forms are present only for compatibility with older standards, "L2/15-121R2: Proposal to Encode Indic Siyaq Numbers" (​PDF). Names of Arabic letters may have quite different names popularly. Six letters (و ز ر ذ د ا) do not have a distinct medial form and have to be written with their final form without being connected to the next letter. Their initial form matches the isolated form.

Comments
  • is verse a list or a generator?
  • note that map works well with builtins. When you have to create your own function, generator comprehension is faster & more readable
  • @Jean-FrançoisFabre I've edited the question, check it :)
  • I will have a look on that.
  • Why exactly does it have to be map with lambda instead of a for loop or any other approach, maybe a list comprehension or something using groupby?
  • Could you please clarify the masking part?
  • What is the second variable?
  • Are you sure in combine(letters=arr[mask], diacritics=arr[1:][mask[:-1]]) it is not letters=arr[~mask], note the ~.
  • @Andrew I'm splitting the array into two portions: character-diacritic pairs and standalone characters. second was just arr[1:] from a prior edit (fixed now). The mask therefore marks the characters followed by a diacritic. And that means arr[1:][mask[:-1]] are the diacritic characters. arr[~mask] includes both diacritics and standalone characters.
  • @Andrew. Your problem is interesting. I'll run an example on some random data and post a benchmark of this version and the others if I have time today.
  • Is this faster than what I'm looking for? Yeah, I said I was hoping for :)
  • I see -- How about this version using lazy iterators.
  • In fact, with the pairwise recipe, you could also use a list comprehension (or even map and filter with two lambdas, but... ugh): [x+y if is_diacritic(y) else x for x, y in pairwise(verse, "") if not is_diacritic(x)]