|A study of Arabic letter frequency analysis||| Print ||
While designing Intellark (Intellaren's Arabic keyboard layout), the new keyboard layout that allows you to type in Arabic using your typing knowledge in English, we naturally and badly needed Arabic letter frequency analysis to make enlightened decisions about some key mappings. Strangely, no resources on the Internet with the expected table of frequencies of Arabic letters could be found as is the case for their English counterparts [1,2,3]! This prompted the birth of Intellyze, Intellaren's Arabic letter and word frequency analyzer, which would later enable us to present a basic study on the Arabic letter frequency analysis.
This is indeed surprising since the very notion of "frequency analysis" started with the Arabs about a 1000 years ago by the eminent scholar Abu Yusuf Ya'qoub ibn Ishaaq Al-Kindi [4,5,6]. Al-Kindi (800 - 873 AD) wrote over 250 books on subjects that spanned a spectrum of subjects typically offered as university degree programs in departments of sciences & humanities, and ... without computers to help him type, software to cut down the computational needs of years to few milliseconds, or Internet to help him surf through the intelligence and cultivation of others! So, what's running a frequency distribution counter on the Quran or some other textual volume! More than anything else, we would like to see some frequency distribution dating as far back as the ninth century to compare accuracy and methodology using the primitive tools they would have used then. In , it says "They realized the rarest letters in Arabic and the most common letters: the letters 'a' and 'l' are the most common in Arabic, whereas the letter 'j' appears only a tenth as frequency". Whereas it is easy to concur that ا and ل are the most common letters when scanning through Arabic text, the statement said about ج , Jeem in English, is off by a factor of 10! In fact, it is more like a "hundredth" as frequency (not "tenth")! So, who do the authors in  precisely refer to by "They" in the above quote? What is the source of their findings? That remains to be seen.
In this work, frequency analysis is conducted on several sources providing input of more than five million letters in total to get a fairly stable distribution of Arabic letter frequency analysis. The document is concluded with reporting some interesting findings about the Arabic letters.
First things first: What gets counted in input text?
Chiefly, the Arabic alphabet consists of 28 primary letters, these are letters 1 to 28 in Table 1. However, when scripting in Arabic, the eight modified letters listed in positions 29 to 36 in the same table are used just the same. If we lump these 8 modified forms back into the primary list based on shape or phonetic similarity, we end up with the listing shown in Table 2. For accurate frequency analysis, Intellyze doesn't lump; it leaves lumping for the user to do if needed. Note here that the ordering of the alphabet is more logical than is used by the Unicode standard.
Arabic Letter Frequency using only the Quran as input source
First, let's consider the frequency distribution using only the Quran as a source of input. Table 3 below presents the letter frequency data based on the 114 suras of the Quran. The listing is sorted according to the Unicode standard. Following is a description of the three columns of Table 3:
Table 4 below shows the same information portrayed in Table 3, but with the letters sorted from the most to least frequent. Note that the letter Jeem appears in the 21st position, and that its frequency is one hundredth. That is, given that the average length of a word in the Quran is 4.25, Jeem can be expected to occur only once in every 23.53 words, or, once every 100 letters. This again refutes the statement highlighted in  above. Complete letter/word/average-word-length statistics about the 114 Suras of the Quran are found in the QSS page.
Arabic Letter Frequency using general sources
This work won't be complete without gathering statistical data from several other sources. The following famous Arabic sources are used:
Collectively, these sources add up to 3,378 pages, generating 1,297,259 words, or, 5,122,132 letters. Here is the letter frequency distribution for such data in Table 5.
Figures 1 and 2 below exhibit histograms renderings of the data provided in Table 5.
Line chart figures that compares the frequency of letters in the Quran to the frequency of letters in the aforementioned sources are available in the comparative statistics page. The statistics generator tool Intellyze is used to generate all the frequency data of the Arabic letters.
While searching for frequency analysis on Arabic letters, we stumbled upon some findings the may be of interest.Quran Frequency Analysis, in English!
In , statistics are conducted on an English-translated version of the Quran! It is hard to see the significance of such effort when the analysis is done on a mere translation. On one hand it shows the importance and the interest invested in such work of analyzing the Quran scripture from a computational view point; on the other hand, such efforts would merely produce results that are in total harmony to taking frequency analysis of any typical English text.Cracking Ciphers
Literally, about a 1000 years ago, a complete method of enforcing cryptanalysis is explained in sufficient detail and in hand-writing! Wouldn't you like to take a look? Check .Inaccuracies Unexplained!
As we mentioned in the beginning before attempting to create Intellyze, we wandered for some time looking to see if there are any resources on Arabic letter frequency analysis. We got one! A 2005 paper authored by a PhD candidate at that time in England. In her paper in Table 1, she lists 24 suras. The source of data she picks for her experiments is interesting, where the whole Quran is transliterated word by word, presumably to help non-Arabic speakers get a handle on correct pronunciation. What we found to be inaccurate is the number of words listed for each sura in that table. The numbers are off by 100 sometimes! We examined Surat-Alkahf as an example. What is recorded in that paper is that there are 1,489 words in Surat-Alkahf, but Intellyze computes the number of words to be 1,583 (and this is the number arrived to by us, page by page, word by word). We will share with you any feedback if she responds to the inquiry we sent her on how her data came into being.
Any comments? Inaccuracies? Interesting additions or links? Please do not hesitate to send us your thoughts to enhance the quality and accuracy of this document. Thank you.
Hi Mark. Thank you for your note. When I did the initial analysis and wrote the articles you see here I thought I was just setting the stage for more work, analyses, articles and research to come. However, apart from the three precious comments you see below I am getting no feedback or interactivity that such work is desired. I would love to, but... to whom?
That said, it is always on my mind to carry this work further to develop better tools and software in collaboration with all interested.
I am familiar with the notion of digraphs, digrams and n-grams in general from the days I worked with cryptography. It would definitely be a nice next-step to consider those in further research and articles, maybe... inShaAllah.
Have you considered expanding your analysis of letter frequencies to digram, trigram,
tetragram, etc., frequencies and which take into account word length and letter position?
As far as I know this has not been done for Arabic. This has been done for English, e.g.,