Python Readability Index (NLP)

Post Views: 775

In this tutorial, we will talk about the readability index in Python(NLP). It focuses on how we put the words that are chosen together into sentences and paragraphs. In NLP, the readability of a sentence depends on the complexity of vocabulary and syntax. There are different readability formulas that determine the difficulty of the text. These are called readability scores evaluation. This helps the programmer to make sentences that are easily comprehensible and are engaging for the audience.

Following are the various methods to find the readability score:

The Dale-Chall formula:
The Dale-Chall formula is a numeric estimation of the difficulty that readers can have while reading the text. It is a list of 3000 words all of which a fourth-grade American student can understand so these words are not considered difficult.

Formula:

Raw Score = 0.1579 * (DW) + 0.0496 * AS

Raw Score = Reading Grade of a reader who can understand the text at 3rd grade or below.

DW = Percentage of Difficult Words

AS = average length of sentence in a word

The Gunning Fog Formula:
Find average sentence length by dividing the total number of words by the number of sentences. Percentage of hard words are found by seeing any word that is more than two syllables. So a number of those words divided by the number of total words in the passage is the percentage of hard words.
Formula:
```
Grade Level = 0.4 (AS+ PH)
```
AS= Average Sentence Length
PH = Percentage of Hard Words
Smog Formula:
It is a simple method used to determine the difficulty level of written material. These words will be understood by people at or above a certain grade. Generally, this grade is a sixth grade or less.
In a passage, we will take 10 sentences in the start, middle and end of the passage. Then we will count all the words that have syllables greater than 3. This is the count that is used in the formula.
```
SMOG grade = 3 + Square Root of Count
```
This formula predicts scores two grades higher than Dale-Chall formula.
Flesch Formula:
This formula tells you what level of education someone needs to understand the data. A score is given between 0-100 which tells us the grade.
Formula:
```
score = 206.835 - (1.015 × ASL) - (84.6 × ASW)
```
Here,
ASL = average sentence length
ASW = average word length in syllables
Now we will see how to implement this readability index in the following program.

import spacy 
from textstat.textstat import textstatistics, legacy_round 
import textstat

def break_sentences(text): 
    nlp = spacy.load('en') # load a sentence or paragraph in english.
    doc = nlp(text) 
    return doc.sents 
  
 
def word_count(text): 
    sentences = break_sentences(text) 
    words = 0
    for sentence in sentences: 
        words += len([token for token in sentence]) 
    return words 
  
 
def sentence_count(text): 
    sentences = break_sentences(text) 
    return len(sentences) 

def avg_sentence_length(text): 
    words = word_count(text) 
    sentences = sentence_count(text) 
    average_sentence_length = float(words / sentences) 
    return average_sentence_length 
  
. 
def syllables_count(word): 
    return textstatistics().syllable_count(word) 
  

def avg_syllables_per_word(text): 
    syllable = syllables_count(text) 
    words = word_count(text) 
    ASPW = float(syllable) / float(words) 
    return legacy_round(ASPW, 1) 
  

def difficult_words(text): 
  
    
    words = [] 
    sentences = break_sentences(text) 
    for sentence in sentences: 
        words += [str(token) for token in sentence] 
  
    
    diff_words_set = set() 
      
    for word in words: 
        syllable_count = syllables_count(word) 
        if word not in textstat._textstatistics__get_lang_easy_words() and syllable_count >= 2: 
            diff_words_set.add(word) 
  
    return len(diff_words_set) 
  
 
def poly_syllable_count(text): 
    count = 0
    words = [] 
    sentences = break_sentences(text) 
    for sentence in sentences: 
        words += [token for token in sentence] 
      
  
    for word in words: 
        syllable_count = syllables_count(word) 
        if syllable_count >= 3: 
            count += 1
    return count 
  
  
def flesch_reading_ease(text): 
    FRE = 206.835 - float(1.015 * avg_sentence_length(text))- float(84.6 * avg_syllables_per_word(text)) 
    return legacy_round(FRE, 2) 
  
  
def gunning_fog(text): 
    per_diff = (difficult_words(text) / word_count(text) * 100) + 5
    grade = 0.4 * (avg_sentence_length(text) + per_diff) 
    return grade 
  
  
def smog_index(text): 
    if sentence_count(text) >= 3: 
        poly_syllab = poly_syllable_count(text) 
        SMOG = (1.043 * (30*(poly_syllab / sentence_count(text)))**0.5)+ 3.1291
        return legacy_round(SMOG, 1) 
    else: 
        return 0
  
  
def dale_chall_readability_score(text): 
    words = word_count(text) 
    
    count = word_count - difficult_words(text) 
    if words > 0: 
  
        
  
        per = float(count) / float(words) * 100
      
    
    diff_words = 100 - per 
  
    raw_score = (0.1579 * diff_words) + (0.0496 * avg_sentence_length(text)) 
      
    
   
  
    if diff_words > 5:        
  
        raw_score += 3.6365
          
    return legacy_round(score, 2)

test_data = (
    "It is very easy to not feel obliged"
    "The feeling of entitlement is what keeps people not kind to people"
    "In this world, mental health and your mind is very important."
)

textstat.flesch_reading_ease(test_data)

OUTPUT:  52.23

In conclusion, we can see how readability would help us to make our NLP models more accurate and user friendly. This helps us to make our model output more comprehensible data which makes it more accurate.

Python Readability Index (NLP)

Leave a Reply Cancel reply

Related Posts