Intellaren

A study of Arabic letter frequency analysis | Print |

This article exists in other translations [ Id: ar00001  عربي ], also accessible through the articles page.

Motivation

While designing Intellark (Intellaren's Arabic keyboard layout), the new keyboard layout that allows you to type in Arabic using your typing knowledge in English, we naturally and badly needed Arabic letter frequency analysis to make enlightened decisions about some key mappings. Strangely, no resources on the Internet with the expected table of frequencies of Arabic letters could be found as is the case for their English counterparts [1,2,3]! This prompted the birth of Intellyze, Intellaren's Arabic letter and word frequency analyzer, which would later enable us to present a basic study on the Arabic letter frequency analysis.

This is indeed surprising since the very notion of "frequency analysis" started with the Arabs about a 1000 years ago by the eminent scholar Abu Yusuf Ya'qoub ibn Ishaaq Al-Kindi [4,5,6]. Al-Kindi (800 - 873 AD) wrote over 250 books on subjects that spanned a spectrum of subjects typically offered as university degree programs in departments of sciences & humanities, and ... without computers to help him type, software to cut down the computational needs of years to few milliseconds, or Internet to help him surf through the intelligence and cultivation of others! So, what's running a frequency distribution counter on the Quran or some other textual volume! More than anything else, we would like to see some frequency distribution dating as far back as the ninth century to compare accuracy and methodology using the primitive tools they would have used then. In [5], it says "They realized the rarest letters in Arabic and the most common letters: the letters 'a' and 'l' are the most common in Arabic, whereas the letter 'j' appears only a tenth as frequency". Whereas it is easy to concur that  ا  and  ل  are the most common letters when scanning through Arabic text, the statement said about  ج , Jeem in English, is off by a factor of 10! In fact, it is more like a "hundredth" as frequency (not "tenth")! So, who do the authors in [5] precisely refer to by "They" in the above quote? What is the source of their findings? That remains to be seen.

In this work, frequency analysis is conducted on several sources providing input of more than five million letters in total to get a fairly stable distribution of Arabic letter frequency analysis. The document is concluded with reporting some interesting findings about the Arabic letters.

First things first: What gets counted in input text?

Chiefly, the Arabic alphabet consists of 28 primary letters, these are letters 1 to 28 in Table 1. However, when scripting in Arabic, the eight modified letters listed in positions 29 to 36 in the same table are used just the same. If we lump these 8 modified forms back into the primary list based on shape or phonetic similarity, we end up with the listing shown in Table 2. For accurate frequency analysis, Intellyze doesn't lump; it leaves lumping for the user to do if needed. Note here that the ordering of the alphabet is more logical than is used by the Unicode standard.

 

Table 1: The Arabic alphabet. Letters 1 to 28 are the primary letters. Letters 29 to 36 are the modified letters.

 

Table 2: The Arabic alphabet, with modified letters lumped onto their primary forms.

Arabic Letter Frequency using only the Quran as input source

First, let's consider the frequency distribution using only the Quran as a source of input. Table 3 below presents the letter frequency data based on the 114 suras of the Quran. The listing is sorted according to the Unicode standard. Following is a description of the three columns of Table 3:

  • Column Letter shows the Arabic letter (حرف) whose frequency is counted
  • Column Frequency shows the frequency of each letter in the 330,709 letters counted in the entire book of the Quran
  • Finally, Column Percentage shows the value recorded in Column Frequency in percentiles. As an example calculation for percentage, consider Letter ا : dividing the frequency by the total count of letters in the Quran and multiplying by 100 (i.e., 43,542 / 330,709 * 100) gives the percentage 13.17.


Table 3: Arabic letter frequency distribution sorted according to the Unicode standard. See the Quran Sura Statistics page (QSS) for a coverage of how the data are computed.

Table 4 below shows the same information portrayed in Table 3, but with the letters sorted from the most to least frequent. Note that the letter Jeem appears in the 21st position, and that its frequency is one hundredth. That is, given that the average length of a word in the Quran is 4.25, Jeem can be expected to occur only once in every 23.53 words, or, once every 100 letters. This again refutes the statement highlighted in [5] above. Complete letter/word/average-word-length statistics about the 114 Suras of the Quran are found in the QSS page.


Table 4: Arabic letter frequency distribution sorted according to letter frequency in descending order.

Arabic Letter Frequency using general sources

This work won't be complete without gathering statistical data from several other sources. The following famous Arabic sources are used:

  • The first seven volumes of the series  البداية والنهاية   (The Beginning and The End) of Ibn Katheer. All together, these seven volumes fill up 2,855 pages, containing 1,096,047 words, containing 4,326,031 letters.
  • The book of sirah of   الرحيق المختوم   (The Sealed Nectar; sirah means the life of Prophet Mohammad  صلى الله عليه وسلم) of Almubarakfuri. The book is spread over 284 pages, containing 134,662 words, containing 553,740 letters.
  • The book of  تحفة العروسين  (The Masterpiece of the Brides) for Al-shuri. The book is spread over 239 pages, containing 66,550 words, containing 242,361 letters.

Collectively, these sources add up to 3,378 pages, generating 1,297,259 words, or, 5,122,132 letters. Here is the letter frequency distribution for such data in Table 5.


Table 5: Arabic letter frequency distribution sorted according to letter frequency. The source of input data is several texts containing together over five million letters.

Figures 1 and 2 below exhibit histograms renderings of the data provided in Table 5.

 



Figure 1: Arabic letter frequency distribution of Table 5 data, sorted according Unicode value.

 

 



Figure 2: Arabic letter frequency distribution of Table 5 data, sorted according frequency of letters.

Line chart figures that compares the frequency of letters in the Quran to the frequency of letters in the aforementioned sources are available in the comparative statistics page. The statistics generator tool Intellyze is used to generate all the frequency data of the Arabic letters.

Interesting Findings

While searching for frequency analysis on Arabic letters, we stumbled upon some findings the may be of interest.

Quran Frequency Analysis, in English!

In [7], statistics are conducted on an English-translated version of the Quran! It is hard to see the significance of such effort when the analysis is done on a mere translation. On one hand it shows the importance and the interest invested in such work of analyzing the Quran scripture from a computational view point; on the other hand, such efforts would merely produce results that are in total harmony to taking frequency analysis of any typical English text.

Cracking Ciphers

Literally, about a 1000 years ago, a complete method of enforcing cryptanalysis is explained in sufficient detail and in hand-writing! Wouldn't you like to take a look? Check [8].

Inaccuracies Unexplained!

As we mentioned in the beginning before attempting to create Intellyze, we wandered for some time looking to see if there are any resources on Arabic letter frequency analysis. We got one! A 2005 paper authored by a PhD candidate at that time in England. In her paper in Table 1, she lists 24 suras. The source of data she picks for her experiments is interesting, where the whole Quran is transliterated word by word, presumably to help non-Arabic speakers get a handle on correct pronunciation. What we found to be inaccurate is the number of words listed for each sura in that table. The numbers are off by 100 sometimes! We examined Surat-Alkahf as an example. What is recorded in that paper is that there are 1,489 words in Surat-Alkahf, but Intellyze computes the number of words to be 1,583 (and this is the number arrived to by us, page by page, word by word). We will share with you any feedback if she responds to the inquiry we sent her on how her data came into being.

Contact Us

Any comments? Inaccuracies? Interesting additions or links? Please do not hesitate to send us your thoughts to enhance the quality and accuracy of this document. Thank you.

 

Comments (4)Add Comment
ID: 132
By: Mohsen
October 13, 2013
Votes: +0
Re: Expanding the analysis

Hi Mark. Thank you for your note. When I did the initial analysis and wrote the articles you see here I thought I was just setting the stage for more work, analyses, articles and research to come. However, apart from the three precious comments you see below I am getting no feedback or interactivity that such work is desired. I would love to, but... to whom?

That said, it is always on my mind to carry this work further to develop better tools and software in collaboration with all interested.

I am familiar with the notion of digraphs, digrams and n-grams in general from the days I worked with cryptography. It would definitely be a nice next-step to consider those in further research and articles, maybe... inShaAllah.

ID: 131
By: mark mayzner
October 13, 2013
Votes: +0
Expanding the analysis

Have you considered expanding your analysis of letter frequencies to digram, trigram,
tetragram, etc., frequencies and which take into account word length and letter position?
.
As far as I know this has not been done for Arabic. This has been done for English, e.g.,
see: "http://norvig.com/mayzner.html".

ID: 82
By: beny
October 28, 2010
Votes: +0
...

i love this website smilies/smiley.gif

ID: 78
By: Nuurul
October 26, 2010
Votes: +0
Excellent Effort

I think this effort is to be congratulated to. I also did a research on this scope. Keep up the excellent works, to develop a better software.


Write a comment in |
 
 
smaller | bigger
 

busy
 
You are here  : Home Articles en A study of Arabic Letter Frequency Analysis

Intellaren

Products & Services

Contact Intellaren

Intellaren Software Inc.
info@intellaren.com