Python Readability Index (NLP)
In this tutorial, we will talk about the readability index in Python(NLP). It focuses on how we put the words that are chosen together into sentences and paragraphs. In NLP, the readability of a sentence depends on the complexity of vocabulary and syntax. There are different readability formulas that determine the difficulty of the text. These are called readability scores evaluation. This helps the programmer to make sentences that are easily comprehensible and are engaging for the audience.
Following are the various methods to find the readability score:
- The Dale-Chall formula:
The Dale-Chall formula is a numeric estimation of the difficulty that readers can have while reading the text. It is a list of 3000 words all of which a fourth-grade American student can understand so these words are not considered difficult.
Formula:
Raw Score = 0.1579 * (DW) + 0.0496 * AS
Raw Score = Reading Grade of a reader who can understand the text at 3rd grade or below.
DW = Percentage of Difficult Words
AS = average length of sentence in a word
- The Gunning Fog Formula:Find average sentence length by dividing the total number of words by the number of sentences. Percentage of hard words are found by seeing any word that is more than two syllables. So a number of those words divided by the number of total words in the passage is the percentage of hard words.Formula:
Grade Level = 0.4 (AS+ PH)
AS= Average Sentence LengthPH = Percentage of Hard Words - Smog Formula:It is a simple method used to determine the difficulty level of written material. These words will be understood by people at or above a certain grade. Generally, this grade is a sixth grade or less.In a passage, we will take 10 sentences in the start, middle and end of the passage. Then we will count all the words that have syllables greater than 3. This is the count that is used in the formula.
SMOG grade = 3 + Square Root of Count
This formula predicts scores two grades higher than Dale-Chall formula. - Flesch Formula:This formula tells you what level of education someone needs to understand the data. A score is given between 0-100 which tells us the grade.Formula:
score = 206.835 - (1.015 × ASL) - (84.6 × ASW)
Here,ASL = average sentence lengthASW = average word length in syllablesNow we will see how to implement this readability index in the following program.
import spacy from textstat.textstat import textstatistics, legacy_round import textstat def break_sentences(text): nlp = spacy.load('en') # load a sentence or paragraph in english. doc = nlp(text) return doc.sents def word_count(text): sentences = break_sentences(text) words = 0 for sentence in sentences: words += len([token for token in sentence]) return words def sentence_count(text): sentences = break_sentences(text) return len(sentences) def avg_sentence_length(text): words = word_count(text) sentences = sentence_count(text) average_sentence_length = float(words / sentences) return average_sentence_length . def syllables_count(word): return textstatistics().syllable_count(word) def avg_syllables_per_word(text): syllable = syllables_count(text) words = word_count(text) ASPW = float(syllable) / float(words) return legacy_round(ASPW, 1) def difficult_words(text): words = [] sentences = break_sentences(text) for sentence in sentences: words += [str(token) for token in sentence] diff_words_set = set() for word in words: syllable_count = syllables_count(word) if word not in textstat._textstatistics__get_lang_easy_words() and syllable_count >= 2: diff_words_set.add(word) return len(diff_words_set) def poly_syllable_count(text): count = 0 words = [] sentences = break_sentences(text) for sentence in sentences: words += [token for token in sentence] for word in words: syllable_count = syllables_count(word) if syllable_count >= 3: count += 1 return count def flesch_reading_ease(text): FRE = 206.835 - float(1.015 * avg_sentence_length(text))- float(84.6 * avg_syllables_per_word(text)) return legacy_round(FRE, 2) def gunning_fog(text): per_diff = (difficult_words(text) / word_count(text) * 100) + 5 grade = 0.4 * (avg_sentence_length(text) + per_diff) return grade def smog_index(text): if sentence_count(text) >= 3: poly_syllab = poly_syllable_count(text) SMOG = (1.043 * (30*(poly_syllab / sentence_count(text)))**0.5)+ 3.1291 return legacy_round(SMOG, 1) else: return 0 def dale_chall_readability_score(text): words = word_count(text) count = word_count - difficult_words(text) if words > 0: per = float(count) / float(words) * 100 diff_words = 100 - per raw_score = (0.1579 * diff_words) + (0.0496 * avg_sentence_length(text)) if diff_words > 5: raw_score += 3.6365 return legacy_round(score, 2)
test_data = ( "It is very easy to not feel obliged" "The feeling of entitlement is what keeps people not kind to people" "In this world, mental health and your mind is very important." ) textstat.flesch_reading_ease(test_data)
OUTPUT: 52.23
In conclusion, we can see how readability would help us to make our NLP models more accurate and user friendly. This helps us to make our model output more comprehensible data which makes it more accurate.
Leave a Reply