ALFA (Arabic letter frequency analysis)
[May 3, 2008] ALFAS name is now changed to ALWFA
[April 15, 2008] This is the first post of this article. Please help make it better with your feedback. This date-stamped line will appear here until it matures.
Motivation
While implementing my new invention in software -- a new keyboard layout for express typing in Arabic for QWERTY typists -- I obviously and badly needed Arabic letter frequency distribution to make enlightened decisions about key mappings. Strangely, I didn't find any site on the Internet with an expected table of frequencies of Arabic letters as can be found for their English counterparts [1,2,3]! So I had to build my own software, ALWFA.
This is indeed surprising since the very notion of "frequency analysis" started with the Arabs about 1000 years ago by the eminent scholar Abu Yousuf Ya'qoub ibn Ishaaq Al-Kindi, or Al-kindi for short [4,5,6]. Al-kindi (800 - 873 AD) wrote over 250 books on subjects spanning more than the entire spectrum of subjects offered as degree programs in sciences & humanities university departments! And ... without computers to help him type, software to cut down the computational needs of years to few milliseconds, and Internet to help him surf on/through the intelligence of others! So, what's running a frequency distribution counter on the Quran or some other textual volume! More than anything else, I would like to see some frequency distribution dating back as far as that ninth century to compare accuracy and methodology using the primitive tools they entertained then. In [5], it says "They realized the rarest letters in Arabic and the most common letters: the letters 'a' and 'l' are the most common in Arabic, whereas the letter 'j' appears only a tenth as frequency". Whereas it is easy to concur that ا and ل are the most common letters when scanning any Arabic text, the statement said about the letter ج , Jeem in English, is off by a factor of 10! In fact, it is more like a "hundredth" as frequency. So, who do the authors in [5] precisely refer to by "They" in the above quote? Where did they get such data from? Any idea?
In this initial work, frequency analysis is conducted on several sources providing input of more than five million letters in total to get a fairly stable distribution of Arabic letter frequency analysis. All sources are in Arabic of course. Along the way, number of words and number of pages also get recorded for statistical purposes. A lot of interesting future work is awaiting execution, specially on the Quran. Keep coming back to these pages please.
First things first: What gets counted in ALWFA?
Chiefly, the Arabic alphabet consists of 28 primary letters, these are letters 1 to 28 in Table 1. However, when scripting in Arabic, the eight modified letters listed in positions 29 to 36 in the same table are used just as much. If we lump these 8 modified forms back into the primary list based on shape similarity, we end up with the listing shown in Table 2. For accurate frequency analysis, ALWFA doesn't lump; it leaves lumping for the user to do if needed. References to Table 1 is only made to the version of the left from now on.
| Table 1: The Arabic alphabet. Letters 1 to 28 are the primary letters. Letters 29 to 36 are the modified letters. | Table 2: The Arabic alphabet, with modified letters lumped into their primary forms based on letter shape. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Arabic Letter Frequency using only the Quran as input source
First, let's consider the frequency distribution using only the Quran as source of input [7]. Table 3 below presents the letter frequency data based on the 114 suras of the Quran. The listing is sorted according to the hexadecimal unicode value of the characters. Following is a description of the five columns of Table 3:
- Column 1 is used for numerical referencing
- Column 2 shows the hexadecimal unicode that's used in ordering the letters
- Column 3 show the Arabic letter (harf) whose frequency is counted
- Column 4 show the frequency of each letter in the 330733 letters counted in the entire book of the Quran
- Finally, Column 5 shows frequencies of Column 4 in percentiles. As an example calculation, consider Letter 7 (i.e., Letter ا , named alif); dividing the frequency by the total count of letters in the Quran and multiplying by 100 (i.e., 43324 / 330733 * 100) gives the percentage 13.099, rounded to three digits after the decimal point.
Table 3: Arabic letter frequency distribution sorted according to the character's unicode value. The source of data is exclusively from the Quran.
# |
Hex |
Letter |
Frequency |
Percentage |
# |
Hex |
Letter |
Frequency |
Percentage |
|
1 |
0621 |
ء |
1525 |
0.461 |
19 |
0633 |
س |
6012 |
1.818 |
|
2 |
0622 |
آ |
1730 |
0.523 |
20 |
0634 |
ش |
2124 |
0.642 |
|
3 |
0623 |
أ |
9119 |
2.757 |
21 |
0635 |
ص |
2072 |
0.626 |
|
4 |
0624 |
ؤ |
785 |
0.237 |
22 |
0636 |
ض |
1686 |
0.510 |
|
5 |
0625 |
إ |
5086 |
1.538 |
23 |
0637 |
ط |
1273 |
0.385 |
|
6 |
0626 |
ئ |
1151 |
0.348 |
24 |
0638 |
ظ |
853 |
0.258 |
|
7 |
0627 |
ا |
43324 |
13.099 |
25 |
0639 |
ع |
9405 |
2.844 |
|
8 |
0628 |
ب |
11491 |
3.474 |
26 |
063A |
غ |
1221 |
0.369 |
|
9 |
0629 |
ة |
2363 |
0.714 |
27 |
0641 |
ف |
8748 |
2.645 |
|
10 |
062A |
ت |
10501 |
3.175 |
28 |
0642 |
ق |
7034 |
2.127 |
|
11 |
062B |
ث |
1414 |
0.428 |
29 |
0643 |
ك |
10497 |
3.174 |
|
12 |
062C |
ج |
3317 |
1.003 |
30 |
0644 |
ل |
38191 |
11.547 |
|
13 |
062D |
ح |
4140 |
1.252 |
31 |
0645 |
م |
26735 |
8.084 |
|
14 |
062E |
خ |
2497 |
0.755 |
32 |
0646 |
ن |
27271 |
8.246 |
|
15 |
062F |
د |
5991 |
1.811 |
33 |
0647 |
ه |
14850 |
4.490 |
|
16 |
0630 |
ذ |
4932 |
1.491 |
34 |
0648 |
و |
24813 |
7.502 |
|
17 |
0631 |
ر |
12403 |
3.750 |
35 |
0649 |
ى |
2603 |
0.787 |
|
18 |
0632 |
ز |
1599 |
0.483 |
36 |
064A |
ي |
21977 |
6.645 |
Table 4 below shows the same information portrayed in Table 3, but with the letters sorted from the most frequent to the least frequent. Note how the letter Jeem appears in the 21st position, and that its frequency is one hundredth. That is, given that the average length of a word in the Quran is 4.25, Jeem can be expected to occur only once in every 23.53 words, or, once every 100 letters. Complete letter/word/average-word-length statistics about the 114 suras of the Quran is found in the Quran statistics page.
Table 4: Arabic letter frequency distribution sorted according to the most frequent. The source of data is exclusively from the Quran.
# |
Hex |
Letter |
Frequency |
Percentage |
# |
Hex |
Letter |
Frequency |
Percentage |
|
1 |
0627 |
ا |
43324 |
13.099 |
19 |
0630 |
ذ |
4932 |
1.491 |
|
2 |
0644 |
ل |
38191 |
11.547 |
20 |
062D |
ح |
4140 |
1.252 |
|
3 |
0646 |
ن |
27271 |
8.246 |
21 |
062C |
ج |
3317 |
1.003 |
|
4 |
0645 |
م |
26735 |
8.084 |
22 |
0649 |
ى |
2603 |
0.787 |
|
5 |
0648 |
و |
24813 |
7.502 |
23 |
062E |
خ |
2497 |
0.755 |
|
6 |
064A |
ي |
21977 |
6.645 |
24 |
0629 |
ة |
2363 |
0.714 |
|
7 |
0647 |
ه |
14850 |
4.490 |
25 |
0634 |
ش |
2124 |
0.642 |
|
8 |
0631 |
ر |
12403 |
3.750 |
26 |
0635 |
ص |
2072 |
0.626 |
|
9 |
0628 |
ب |
11491 |
3.474 |
27 |
0622 |
آ |
1730 |
0.523 |
|
10 |
062A |
ت |
10501 |
3.175 |
28 |
0636 |
ض |
1686 |
0.510 |
|
11 |
0643 |
ك |
10497 |
3.174 |
29 |
0632 |
ز |
1599 |
0.483 |
|
12 |
0639 |
ع |
9405 |
2.844 |
30 |
0621 |
ء |
1525 |
0.461 |
|
13 |
0623 |
أ |
9119 |
2.757 |
31 |
062B |
ث |
1414 |
0.428 |
|
14 |
0641 |
ف |
8748 |
2.645 |
32 |
0637 |
ط |
1273 |
0.385 |
|
15 |
0642 |
ق |
7034 |
2.127 |
33 |
063A |
غ |
1221 |
0.369 |
|
16 |
0633 |
س |
6012 |
1.818 |
34 |
0626 |
ئ |
1151 |
0.348 |
|
17 |
062F |
د |
5991 |
1.811 |
35 |
0638 |
ظ |
853 |
0.258 |
|
18 |
0625 |
إ |
5086 |
1.538 |
36 |
0624 |
ؤ |
785 |
0.237 |
Methodology
To accurately arrive to the statistics exhibited in the tables above, the following assumptions and sources are used:
- All the suras are obtained online from Al-Quran Al-Kareem site [7].
- [7] adopts the following approaches which ALWFA is tailored for:
- A normal alif is consistently used instead of alif mamdoodah where appropriate
Example. In Sura 2, Verse 2, [7] would write الْكِتَابُ as opposed to الْكِتَـٰبُ , which is more suitable when conducting frequency analysis. That said, note that in some cases as in one of the names of Allah سبحانه وتعالى -- الرَّحْمـَٰن -- the shape is retained as is as opposed to scripting it as الرَّحْمـَان .
- The letter و (i.e., Letter 27 in Table 1), when used for the purposes of conjunction (which means "and" in English) is not counted as an independent word; rather, as a prefix to whatever it precedes.
Example: In Sura 3, Verse 33, it is most typical to write إِنَّ اللّهَ اصْطَفَى آدَمَ وَنُوحًا وَآلَ إِبْرَاهِيمَ وَآلَ عِمْرَانَ عَلَى الْعَالَمِينَ as opposed to separating the three conjunctive و from the nouns they precede; the verse as a result is composed of 11 words.
- The calling digraph -- يا -- is counted as a whole word, and not as part of the word it precedes.
Example: In Sura 19 in Verses 42 and 46, scripted are يَا أَبَتِ and يَا إِبْراهِيمُ as opposed to the style followed in the Quranic scripture as يَـٰـأَبَتِِ and يَـٰـإِبْراهِيمُمُ . Therefore, ALWFA counts two words in each case instead of one, and counts in total 5 letters and 9 letters in each pair, respectively, instead of 4 and 8 letters.
- Whereas each verse in all the suras in [7] is preceded with the verse number followed by a period, ALWFA is made completely oblivious to such numbers and periods.
Example: ALWFA counts exactly 14 words in Sura 103.
- The letters and the words of the opening verse -- بسم الله الرحمن الرحيم -- are counted only in Sura 1, but are exempt from counting in all other suras that start with them, as should be the case.
- A normal alif is consistently used instead of alif mamdoodah where appropriate
- ALWFA is made completely oblivious to diacritics (harakaat, or tashkeel) when counting letter frequency analysis.
Examples
Let's consider two suras in the Quran to analyze some interesting statistics on: Sura 1 and Sura 18. Sura 1 is made up of 7 verses, which are in turn made up of 29 words, which are in turn made up of 143 letters. Notice that in the statistics section in [10], they count 139 letters and not 143. ALWFA counts 4 extra letters because the letter ا is explicitly included in the words الْعَالَمِين , مَـالِك ِ, الصِّرَاطَ and صِرَاطَ as stated in [7] above. In Quranic scripture, these words would be scripted as الْعَــٰـلَمِينَ , مَــٰـلِك ِ, الصِّرَٰطَ and صِرَٰط , where an implicit ٰ (alif mamdoodah) ensures that the correct pronunciation is maintained.
A lot of interesting statistics can be drawn from Sura 18, known as Surat Al-kahf. On a typical Quran (i.e., مصحف المدينة النبوية. See [11] below), where the 114 Quran suras are spread over 604 pages, the following holds:
- The sura is printed over 11 pages and four lines
- Each full page is composed of 15 lines when not including headings of suras, so the sura is made up of 169 comprising 110 versus (a'ayaat, or a'aya for singular)
- The 110 versus are made up of 1583 words, which implies that each line is made up of an average of 9.367 words. It is worth noting that whereas counting the words straight from a typical Quran text such as [11] gives 1579; [7] counts 4 extra words in Verses 42, 49, 86 and 94, all of which are in the form of the separate Arabic calling digraph يا .
Arabic Letter Frequency using general sources
This work won't be complete without gathering statistical data from several other sources. The following famous Arabic sources are used:
- The first seven volumes of the series البداية والنهاية (The Beginning and The End) of Ibn Katheer. All together, these seven volumes fill up 2,855 pages, containing 1,096,047 words, containing 4,326,031 letters.
- The book of sirah of الرحيق المختوم (The Sealed Nectar; sirah means the life of Prophet Mohammad صلى الله عليه وسلم) of Almubarakfouri. The book is spread over 284 pages, containing 134,662 words, containing 553,740 letters.
- The book of تحفة العروس (The Masterpiece for the Bride) for Al-shuri.The book is spread over 239 pages, containing 66,550 words, containing 242,361 letters.
Collectively, these sources add up to 3,378 pages, generating 1,297,259 words, or, 5,122,132 letters. Here is the letter frequency distribution for such data in Table 5.
Table 5: Arabic letter frequency distribution sorted according to the most frequent. The source of input data is from several texts and volumes with over five million letters.
# |
Hex |
Letter |
Frequency |
Percentage |
# |
Hex |
Letter |
Frequency |
Percentage |
|
1 |
1575 |
ا |
640362 |
12.502 |
19 |
1577 |
ة |
72952 |
1.424 |
|
2 |
1604 |
ل |
618465 |
12.074 |
20 |
1609 |
ى |
66064 |
1.290 |
|
3 |
1606 |
ن |
338646 |
6.611 |
21 |
1580 |
ج |
62900 |
1.228 |
|
4 |
1605 |
م |
333858 |
6.518 |
22 |
1589 |
ص |
53229 |
1.039 |
|
5 |
1610 |
ي |
325831 |
6.361 |
23 |
1573 |
إ |
51033 |
0.996 |
|
6 |
1608 |
و |
297064 |
5.800 |
24 |
1584 |
ذ |
49066 |
0.958 |
|
7 |
1607 |
ه |
260110 |
5.078 |
25 |
1579 |
ث |
44683 |
0.872 |
|
8 |
1576 |
ب |
239062 |
4.667 |
26 |
1582 |
خ |
40535 |
0.791 |
|
9 |
1585 |
ر |
215286 |
4.203 |
27 |
1588 |
ش |
37606 |
0.734 |
|
10 |
1593 |
ع |
205591 |
4.014 |
28 |
1586 |
ز |
26779 |
0.523 |
|
11 |
1571 |
أ |
147879 |
2.887 |
29 |
1591 |
ط |
25422 |
0.496 |
|
12 |
1601 |
ف |
145287 |
2.836 |
30 |
1590 |
ض |
22640 |
0.442 |
|
13 |
1602 |
ق |
137803 |
2.690 |
31 |
1594 |
غ |
16693 |
0.326 |
|
14 |
1583 |
د |
136497 |
2.665 |
32 |
1569 |
ء |
15932 |
0.311 |
|
15 |
1578 |
ت |
133497 |
2.606 |
33 |
1574 |
ئ |
14390 |
0.281 |
|
16 |
1587 |
س |
126270 |
2.465 |
34 |
1592 |
ظ |
8959 |
0.175 |
|
17 |
1603 |
ك |
104403 |
2.038 |
35 |
1570 |
آ |
7492 |
0.146 |
|
18 |
1581 |
ح |
95474 |
1.864 |
36 |
1572 |
ؤ |
4372 |
0.085 |
And, here is the data from the Quran and the sources above displayed side by side in Table 6. Line chart figures are available in this page. If you spend some time analyzing the variations in the data, you could conclude that it all makes sense. As an example, notice how the letter آ makes 27th on the Quran list, but that it makes 35th on the Others list. This is because the divine Quran scripture does make more use of this letter than in Earthly writing. If fact, the letter آ is frequently equally represented by the digraph ءا (as in قرءان as opposed to قرآن ). Generally however, the two line charts in the figure or in Table 6 simply show that the statistics are more similar than anything else, as should be expected.
Table 6: A comparative display of input sources for frequency distribution analysis.
# |
letter |
Quran% |
letter |
Others% |
# |
letter |
Quran% |
letter |
Others% |
|
1 |
ا |
13.099 |
ا |
12.502 |
19 |
ذ |
1.491 |
ة |
1.424 |
|
2 |
ل |
11.547 |
ل |
12.074 |
20 |
ح |
1.252 |
ى |
1.290 |
|
3 |
ن |
8.246 |
ن |
6.611 |
21 |
ج |
1.003 |
ج |
1.228 |
|
4 |
م |
8.084 |
م |
6.518 |
22 |
ى |
0.787 |
ص |
1.039 |
|
5 |
و |
7.502 |
ي |
6.361 |
23 |
خ |
0.755 |
إ |
0.996 |
|
6 |
ي |
6.645 |
و |
5.800 |
24 |
ة |
0.714 |
ذ |
0.958 |
|
7 |
ه |
4.490 |
ه |
5.078 |
25 |
ش |
0.642 |
ث |
0.872 |
|
8 |
ر |
3.750 |
ب |
4.667 |
26 |
ص |
0.626 |
خ |
0.791 |
|
9 |
ب |
3.474 |
ر |
4.203 |
27 |
آ |
0.523 |
ش |
0.734 |
|
10 |
ت |
3.175 |
ع |
4.014 |
28 |
ض |
0.510 |
ز |
0.523 |
|
11 |
ك |
3.174 |
أ |
2.887 |
29 |
ز |
0.483 |
ط |
0.496 |
|
12 |
ع |
2.844 |
ف |
2.836 |
30 |
ء |
0.461 |
ض |
0.442 |
|
13 |
أ |
2.757 |
ق |
2.690 |
31 |
ث |
0.428 |
غ |
0.326 |
|
14 |
ف |
2.645 |
د |
2.665 |
32 |
ط |
0.385 |
ء |
0.311 |
|
15 |
ق |
2.127 |
ت |
2.606 |
33 |
غ |
0.369 |
ئ |
0.281 |
|
16 |
س |
1.818 |
س |
2.465 |
34 |
ئ |
0.348 |
ظ |
0.175 |
|
17 |
د |
1.811 |
ك |
2.038 |
35 |
ظ |
0.258 |
آ |
0.146 |
|
18 |
إ |
1.538 |
ح |
1.864 |
36 |
ؤ |
0.237 |
ؤ |
0.085 |
Interesting Findings
While searching for frequency analysis on Arabic letters, I stumbled upon some findings and sites you may be interested to know about.
Quran Frequency Analysis, in English!
In [12], statistics are conducted on an English-translated version of the Quran! It is hard to see the significance of such effort when the analysis is done on a mere translation. On one hand it shows the importance and the interest invested in such work of analyzing the Quran scripture from a computational view point; on the other hand, such efforts would surely produce results that are in total harmony to taking frequency analysis on any typical English text.
Cracking Ciphers
Literally about a 1000 years ago, a complete method to enforcing cryptanalysis is explained in sufficient detail and in hand-writing! Would you not like to see it? Check [13].
Inaccuracies Unexplained!
As I said in the beginning, before writing ALWFA, I wandered for some time looking to see if there are any resources on Arabic letter frequency analysis or anything close. I got one! A 2005 paper authored by a PhD candidate at that time in England. In her paper in Table 1, she lists 24 suras, all of which are made up of more than a 1000 words. The source of data she picks for her experiments is interesting, where the whole Quran is transliterated word by word, presumably to help non-Arabic speakers get a handle on correct pronunciation. What I found to be to be not adding up, however, is the number of words listed before each sura in her table. The numbers are off by 100 sometimes! I examined Surat Al-kahf listed in the source she used by pasting it into MS Word to count the words. There are 1693 words if you strictly count from Verse 1 to Verse 110. Subtracting 110 from that number to remove verse numbers from the statistics gives 1583! This is the number arrived to by ALWFA (and by me, page by page, word by word). However, she lists 1489 words for the same sura! I will share with you any feed back if she responds to the email I sent her to see how her data came into being.
References
[1] http://en.wikipedia.org/wiki/Letter_frequencies
[2] http://en.wikipedia.org/wiki/Frequency_analysis
[3] http://www.simonsingh.com/The_Black_Chamber/frequencyanalysis.html
[4] http://en.wikipedia.org/wiki/Alkindi
[5] http://cs-exhibitions.uni-klu.ac.at/index.php?id=279
[6] http://www.muslimheritage.com/topics/default.cfm?ArticleID=372
[7] http://www.oneummah.net/quran/quran.html. I have copied all suras in Arabic, saved them as a .docx file (that is, in Microsoft Office Word 2007 format). I zipped the entire suras for downloading. You can find information about the Quran in its page on Wikipedia, may be you can help with the "cleanup" or adding "citations" tasks they seek at the top of that page.
[8] http://en.wikipedia.org/wiki/Arabic_alphabet
[9] http://www.unicode.org/charts/PDF/U0600.pdf
[10] http://en.wikipedia.org/wiki/Al-Fatiha. Let me know if you would like to help and provide interesting statistics such as that you find in [10] for all 114 suras of the Quran. Note that all suras can be accessed by number in the table shown in the Statistics.
[11] http://www.qurancomplex.com/Quran/display/Display.asp. If you are browsing using IE, you will be guided towards installing any missing plug-ins you need to viewing the pages of the Quran. If you are using Firefox like me, then although you are prompted to download a missing plug-in, you won't know what to do with it unless you are an expert, or you would follow the instructions given to me by a cyberspace helper in the Firefox furoms (known as GingerBread Guy) that say:
- Save the KFCFont.XPI file to your desktop.
- Rename it to KFCFont.xpi
- In Firefox, go to Tools > Add-ons > Extensions.
- Drag and drop the file from the desktop to the Add-ons window. Nothing will appear to happen. Wait a few moments, close the Add-ons window, then exit and restart Firefox.
- In the address bar, type about:plugins and press Enter. Verify that the plug-in is installed (see screenshot). If the plug-in appears in the list, you may load the site now (see screenshot).
Removing (i.e., uninstalling) this particular plug-in is not straight forward! GingerBread Guy to the rescue again:
Please note that there is no way to uninstall the plug-in from Firefox. If you want to remove it, follow these steps:
- Exit Firefox.
- Open the Firefox plug-in directory (usually C:\Program Files\Mozilla Firefox\plugins) and delete FontDown.dll, npKFCFont.dll and npKFCFont.xpt
- Click the Start button, then click Run. Type %windir%\system32\ and press OK. Delete the "KFC" folder.
[12] http://www.intratext.com/IXT/ENG0027/_STAT.HTM
[13] http://www.simonsingh.net/The_Black_Chamber/crackingsubstitution.html
Contact Me
Any comments? Any inaccuracy? Interesting additions or links? Please do not hesitate to communicate any remarks to enhance the quality of these pages to me. Thank you.
You can always reach me on my email on mmadi@intellaren.com.