ALFA (Arabic letter frequency analysis)

[May 3, 2008] ALFAS name is now changed to ALWFA
[April 15, 2008] This is the first post of this article. Please help make it better with your feedback. This date-stamped line will appear here until it matures.

Motivation

While implementing my new invention in software -- a new keyboard layout for express typing in Arabic for QWERTY typists -- I obviously and badly needed Arabic letter frequency distribution to make enlightened decisions about key mappings. Strangely, I didn't find any site on the Internet with an expected table of frequencies of Arabic letters as can be found for their English counterparts [1,2,3]! So I had to build my own software, ALWFA.

This is indeed surprising since the very notion of "frequency analysis" started with the Arabs about 1000 years ago by the eminent scholar Abu Yousuf Ya'qoub ibn Ishaaq Al-Kindi, or Al-kindi for short [4,5,6]. Al-kindi (800 - 873 AD) wrote over 250 books on subjects spanning more than the entire spectrum of subjects offered as degree programs in sciences & humanities university departments! And ... without computers to help him type, software to cut down the computational needs of years to few milliseconds, and Internet to help him surf on/through the intelligence of others! So, what's running a frequency distribution counter on the Quran or some other textual volume! More than anything else, I would like to see some frequency distribution dating back as far as that ninth century to compare accuracy and methodology using the primitive tools they entertained then. In [5], it says "They realized the rarest letters in Arabic and the most common letters: the letters 'a' and 'l' are the most common in Arabic, whereas the letter 'j' appears only a tenth as frequency". Whereas it is easy to concur that  ا  and  ل  are the most common letters when scanning any Arabic text, the statement said about the letter  ج , Jeem in English, is off by a factor of 10! In fact, it is more like a "hundredth" as frequency. So, who do the authors in [5] precisely refer to by "They" in the above quote? Where did they get such data from? Any idea?

In this initial work, frequency analysis is conducted on several sources providing input of more than five million letters in total to get a fairly stable distribution of Arabic letter frequency analysis. All sources are in Arabic of course. Along the way, number of words and number of pages also get recorded for statistical purposes. A lot of interesting future work is awaiting execution, specially on the Quran. Keep coming back to these pages please.

First things first: What gets counted in ALWFA?

Chiefly, the Arabic alphabet consists of 28 primary letters, these are letters 1 to 28 in Table 1. However, when scripting in Arabic, the eight modified letters listed in positions 29 to 36 in the same table are used just as much. If we lump these 8 modified forms back into the primary list based on shape similarity, we end up with the listing shown in Table 2. For accurate frequency analysis, ALWFA doesn't lump; it leaves lumping for the user to do if needed. References to Table 1 is only made to the version of the left from now on.

Table 1: The Arabic alphabet. Letters 1 to 28 are the primary letters. Letters 29 to 36 are the modified letters.   Table 2: The Arabic alphabet, with modified letters lumped into their primary forms based on letter shape.
     
No
Letter
No
Letter
1
ا
19
غ
2
ب
20
ف
3
ت
21
ق
4
ث
22
ك
5
ج
23
ل
6
ح
24
م
7
خ
25
ن
8
د
26
هـ
9
ذ
27
و
10
ر
28
ي
11
ز
29
أ
12
س
30
إ
13
ش
31
آ
14
ص
32
ء
15
ض
33
ة
16
ط
34
ؤ
17
ظ
35
ى
18
ع
36
ئ
 
No
Letter
No
Letter
1
ا , أ , إ , آ , ء
19
غ
2
ب
20
ف
3
ت , ة
21
ق
4
ث
22
ك
5
ج
23
ل
6
ح
24
م
7
خ
25
ن
8
د
26
هـ
9
ذ
27
و , ؤ
10
ر
28
ي , ى , ئ
11
ز
29
أ
12
س
30
إ
13
ش
31
آ
14
ص
32
ء
15
ض
33
ة
16
ط
34
ؤ
17
ظ
35
ى
18
ع
36
ئ

 

Arabic Letter Frequency using only the Quran as input source

First, let's consider the frequency distribution using only the Quran as source of input [7]. Table 3 below presents the letter frequency data based on the 114 suras of the Quran. The listing is sorted according to the hexadecimal unicode value of the characters. Following is a description of the five columns of Table 3:

Table 3: Arabic letter frequency distribution sorted according to the character's unicode value. The source of data is exclusively from the Quran.

#
Hex
Letter
Frequency
Percentage
#
Hex
Letter
Frequency
Percentage
1
0621
ء
1525
0.461
19
0633
س
6012
1.818
2
0622
آ
1730
0.523
20
0634
ش
2124
0.642
3
0623
أ
9119
2.757
21
0635
ص
2072
0.626
4
0624
ؤ
785
0.237
22
0636
ض
1686
0.510
5
0625
إ
5086
1.538
23
0637
ط
1273
0.385
6
0626
ئ
1151
0.348
24
0638
ظ
853
0.258
7
0627
ا
43324
13.099
25
0639
ع
9405
2.844
8
0628
ب
11491
3.474
26
063A
غ
1221
0.369
9
0629
ة
2363
0.714
27
0641
ف
8748
2.645
10
062A
ت
10501
3.175
28
0642
ق
7034
2.127
11
062B
ث
1414
0.428
29
0643
ك
10497
3.174
12
062C
ج
3317
1.003
30
0644
ل
38191
11.547
13
062D
ح
4140
1.252
31
0645
م
26735
8.084
14
062E
خ
2497
0.755
32
0646
ن
27271
8.246
15
062F
د
5991
1.811
33
0647
ه
14850
4.490
16
0630
ذ
4932
1.491
34
0648
و
24813
7.502
17
0631
ر
12403
3.750
35
0649
ى
2603
0.787
18
0632
ز
1599
0.483
36
064A
ي
21977
6.645

 

Table 4 below shows the same information portrayed in Table 3, but with the letters sorted from the most frequent to the least frequent. Note how the letter Jeem appears in the 21st position, and that its frequency is one hundredth. That is, given that the average length of a word in the Quran is 4.25, Jeem can be expected to occur only once in every 23.53 words, or, once every 100 letters. Complete letter/word/average-word-length statistics about the 114 suras of the Quran is found in the Quran statistics page.

Table 4: Arabic letter frequency distribution sorted according to the most frequent. The source of data is exclusively from the Quran.

#
Hex
Letter
Frequency
Percentage
#
Hex
Letter
Frequency
Percentage
1
0627
ا
43324
13.099
19
0630
ذ
4932
1.491
2
0644
ل
38191
11.547
20
062D
ح
4140
1.252
3
0646
ن
27271
8.246
21
062C
ج
3317
1.003
4
0645
م
26735
8.084
22
0649
ى
2603
0.787
5
0648
و
24813
7.502
23
062E
خ
2497
0.755
6
064A
ي
21977
6.645
24
0629
ة
2363
0.714
7
0647
ه
14850
4.490
25
0634
ش
2124
0.642
8
0631
ر
12403
3.750
26
0635
ص
2072
0.626
9
0628
ب
11491
3.474
27
0622
آ
1730
0.523
10
062A
ت
10501
3.175
28
0636
ض
1686
0.510
11
0643
ك
10497
3.174
29
0632
ز
1599
0.483
12
0639
ع
9405
2.844
30
0621
ء
1525
0.461
13
0623
أ
9119
2.757
31
062B
ث
1414
0.428
14
0641
ف
8748
2.645
32
0637
ط
1273
0.385
15
0642
ق
7034
2.127
33
063A
غ
1221
0.369
16
0633
س
6012
1.818
34
0626
ئ
1151
0.348
17
062F
د
5991
1.811
35
0638
ظ
853
0.258
18
0625
إ
5086
1.538
36
0624
ؤ
785
0.237

 

Methodology

To accurately arrive to the statistics exhibited in the tables above, the following assumptions and sources are used:

Examples

Let's consider two suras in the Quran to analyze some interesting statistics on: Sura 1 and Sura 18. Sura 1 is made up of 7 verses, which are in turn made up of 29 words, which are in turn made up of 143 letters. Notice that in the statistics section in [10], they count 139 letters and not 143. ALWFA counts 4 extra letters because the letter  ا  is explicitly included in the words  الْعَالَمِين ,  مَـالِك ِ,  الصِّرَاطَ  and  صِرَاطَ  as stated in [7] above. In Quranic scripture, these words would be scripted as  الْعَــٰـلَمِينَ ,  مَــٰـلِك ِ,  الصِّرَٰطَ  and  صِرَٰط , where an implicit  ٰ  (alif mamdoodah) ensures that the correct pronunciation is maintained.

A lot of interesting statistics can be drawn from Sura 18, known as Surat Al-kahf. On a typical Quran (i.e., مصحف المدينة النبوية. See [11] below), where the 114 Quran suras are spread over 604 pages, the following holds:

Arabic Letter Frequency using general sources

This work won't be complete without gathering statistical data from several other sources. The following famous Arabic sources are used:

Collectively, these sources add up to 3,378 pages, generating 1,297,259 words, or, 5,122,132 letters. Here is the letter frequency distribution for such data in Table 5.

Table 5: Arabic letter frequency distribution sorted according to the most frequent. The source of input data is from several texts and volumes with over five million letters.

#
Hex
Letter
Frequency
Percentage
#
Hex
Letter
Frequency
Percentage
1
1575
ا
640362
12.502
19
1577
ة
72952
1.424
2
1604
ل
618465
12.074
20
1609
ى
66064
1.290
3
1606
ن
338646
6.611
21
1580
ج
62900
1.228
4
1605
م
333858
6.518
22
1589
ص
53229
1.039
5
1610
ي
325831
6.361
23
1573
إ
51033
0.996
6
1608
و
297064
5.800
24
1584
ذ
49066
0.958
7
1607
ه
260110
5.078
25
1579
ث
44683
0.872
8
1576
ب
239062
4.667
26
1582
خ
40535
0.791
9
1585
ر
215286
4.203
27
1588
ش
37606
0.734
10
1593
ع
205591
4.014
28
1586
ز
26779
0.523
11
1571
أ
147879
2.887
29
1591
ط
25422
0.496
12
1601
ف
145287
2.836
30
1590
ض
22640
0.442
13
1602
ق
137803
2.690
31
1594
غ
16693
0.326
14
1583
د
136497
2.665
32
1569
ء
15932
0.311
15
1578
ت
133497
2.606
33
1574
ئ
14390
0.281
16
1587
س
126270
2.465
34
1592
ظ
8959
0.175
17
1603
ك
104403
2.038
35
1570
آ
7492
0.146
18
1581
ح
95474
1.864
36
1572
ؤ
4372
0.085

 

And, here is the data from the Quran and the sources above displayed side by side in Table 6. Line chart figures are available in this page. If you spend some time analyzing the variations in the data, you could conclude that it all makes sense. As an example, notice how the letter  آ  makes 27th on the Quran list, but that it makes 35th on the Others list. This is because the divine Quran scripture does make more use of this letter than in Earthly writing. If fact, the letter  آ  is frequently equally represented by the digraph  ءا  (as in  قرءان  as opposed to  قرآن ). Generally however, the two line charts in the figure or in Table 6 simply show that the statistics are more similar than anything else, as should be expected.

Table 6: A comparative display of input sources for frequency distribution analysis.

#
letter
Quran%
letter
Others%
#
letter
Quran%
letter
Others%
1
ا
13.099
ا
12.502
19
ذ
1.491
ة
1.424
2
ل
11.547
ل
12.074
20
ح
1.252
ى
1.290
3
ن
8.246
ن
6.611
21
ج
1.003
ج
1.228
4
م
8.084
م
6.518
22
ى
0.787
ص
1.039
5
و
7.502
ي
6.361
23
خ
0.755
إ
0.996
6
ي
6.645
و
5.800
24
ة
0.714
ذ
0.958
7
ه
4.490
ه
5.078
25
ش
0.642
ث
0.872
8
ر
3.750
ب
4.667
26
ص
0.626
خ
0.791
9
ب
3.474
ر
4.203
27
آ
0.523
ش
0.734
10
ت
3.175
ع
4.014
28
ض
0.510
ز
0.523
11
ك
3.174
أ
2.887
29
ز
0.483
ط
0.496
12
ع
2.844
ف
2.836
30
ء
0.461
ض
0.442
13
أ
2.757
ق
2.690
31
ث
0.428
غ
0.326
14
ف
2.645
د
2.665
32
ط
0.385
ء
0.311
15
ق
2.127
ت
2.606
33
غ
0.369
ئ
0.281
16
س
1.818
س
2.465
34
ئ
0.348
ظ
0.175
17
د
1.811
ك
2.038
35
ظ
0.258
آ
0.146
18
إ
1.538
ح
1.864
36
ؤ
0.237
ؤ
0.085

 

Interesting Findings

While searching for frequency analysis on Arabic letters, I stumbled upon some findings and sites you may be interested to know about.

Quran Frequency Analysis, in English!

In [12], statistics are conducted on an English-translated version of the Quran! It is hard to see the significance of such effort when the analysis is done on a mere translation. On one hand it shows the importance and the interest invested in such work of analyzing the Quran scripture from a computational view point; on the other hand, such efforts would surely produce results that are in total harmony to taking frequency analysis on any typical English text.

Cracking Ciphers

Literally about a 1000 years ago, a complete method to enforcing cryptanalysis is explained in sufficient detail and in hand-writing! Would you not like to see it? Check [13].

Inaccuracies Unexplained!

As I said in the beginning, before writing ALWFA, I wandered for some time looking to see if there are any resources on Arabic letter frequency analysis or anything close. I got one! A 2005 paper authored by a PhD candidate at that time in England. In her paper in Table 1, she lists 24 suras, all of which are made up of more than a 1000 words. The source of data she picks for her experiments is interesting, where the whole Quran is transliterated word by word, presumably to help non-Arabic speakers get a handle on correct pronunciation. What I found to be to be not adding up, however, is the number of words listed before each sura in her table. The numbers are off by 100 sometimes! I examined Surat Al-kahf listed in the source she used by pasting it into MS Word to count the words. There are 1693 words if you strictly count from Verse 1 to Verse 110. Subtracting 110 from that number to remove verse numbers from the statistics gives 1583! This is the number arrived to by ALWFA (and by me, page by page, word by word). However, she lists 1489 words for the same sura! I will share with you any feed back if she responds to the email I sent her to see how her data came into being.

References

[1] http://en.wikipedia.org/wiki/Letter_frequencies

[2] http://en.wikipedia.org/wiki/Frequency_analysis

[3] http://www.simonsingh.com/The_Black_Chamber/frequencyanalysis.html

[4] http://en.wikipedia.org/wiki/Alkindi

[5] http://cs-exhibitions.uni-klu.ac.at/index.php?id=279

[6] http://www.muslimheritage.com/topics/default.cfm?ArticleID=372

[7] http://www.oneummah.net/quran/quran.html. I have copied all suras in Arabic, saved them as a .docx file (that is, in Microsoft Office Word 2007 format). I zipped the entire suras for downloading. You can find information about the Quran in its page on Wikipedia, may be you can help with the "cleanup" or adding "citations" tasks they seek at the top of that page.

[8] http://en.wikipedia.org/wiki/Arabic_alphabet

[9] http://www.unicode.org/charts/PDF/U0600.pdf

[10] http://en.wikipedia.org/wiki/Al-Fatiha. Let me know if you would like to help and provide interesting statistics such as that you find in [10] for all 114 suras of the Quran. Note that all suras can be accessed by number in the table shown in the Statistics.

[11] http://www.qurancomplex.com/Quran/display/Display.asp. If you are browsing using IE, you will be guided towards installing any missing plug-ins you need to viewing the pages of the Quran. If you are using Firefox like me, then although you are prompted to download a missing plug-in, you won't know what to do with it unless you are an expert, or you would follow the instructions given to me by a cyberspace helper in the Firefox furoms (known as GingerBread Guy) that say:

  1. Save the KFCFont.XPI file to your desktop.
  2. Rename it to KFCFont.xpi
  3. In Firefox, go to Tools > Add-ons > Extensions.
  4. Drag and drop the file from the desktop to the Add-ons window. Nothing will appear to happen. Wait a few moments, close the Add-ons window, then exit and restart Firefox.
  5. In the address bar, type about:plugins and press Enter. Verify that the plug-in is installed (see screenshot). If the plug-in appears in the list, you may load the site now (see screenshot).

Removing (i.e., uninstalling) this particular plug-in is not straight forward! GingerBread Guy to the rescue again:

Please note that there is no way to uninstall the plug-in from Firefox. If you want to remove it, follow these steps:

  1. Exit Firefox.
  2. Open the Firefox plug-in directory (usually C:\Program Files\Mozilla Firefox\plugins) and delete FontDown.dll, npKFCFont.dll and npKFCFont.xpt
  3. Click the Start button, then click Run. Type %windir%\system32\ and press OK. Delete the "KFC" folder.

[12] http://www.intratext.com/IXT/ENG0027/_STAT.HTM

[13] http://www.simonsingh.net/The_Black_Chamber/crackingsubstitution.html

Contact Me

Any comments? Any inaccuracy? Interesting additions or links? Please do not hesitate to communicate any remarks to enhance the quality of these pages to me. Thank you.