Using natural language processing to analyze religious text

Religion has sparked many a war over differences in beliefs and interpretations. And yet, the different religious texts on which these belief systems are built are quite similar. However, the textual data that exists in religious texts is difficult to analyze because of the multitude of different languages used. Additionally, texts often don’t follow typical or set syntax. Most of all, analysis is complicated by the sheer amount of different texts. Comparing all religious texts is a virtually impossible task for any human and a complex task for computers. Specifically, this becomes an intriguing natural language processing problem.

Comparing all religious texts is a virtually impossible task for any human, so this becomes an intriguing natural language processing problem.

Natural language processing (NLP) is a subset of artificial intelligence focused on processing and understanding textual information through machine learning. A few different NLP techniques are used, which include term frequency analysis, (tracking the number of times certain words appear), sentiment analysis (classifying text into positive or negative depending on its connotations), and topic extraction (determining the main topic discussed in the text). In a study by researcher Daniel McDonald at Utah Valley University, various religious texts were analyzed and compared, including the Bible, Qu’ran, and Torah. In this study, the primary NLP technique was topic extraction.

First, the text was divided into sentence chunks, and words were tagged based on their content and part of speech. McDonald’s research focused on just the verbs and nouns used. After this initial round of processing, a second pass through the text combined similar content tags to make more encompassing topic tags. Topics were then sorted based on the frequency of the terms in each topic, with only the most relevant kept. Topics with the greatest term frequency were considered the most relevant. Additionally, the topics were divided based on whether they were noun or verb topics. Some examples of noun topics were animals, family relationships, and Earth, so these consisted of words like “sheep,” “brother,” and “world.” Some examples of verb topics were to amuse, to appear, and conjecture, so words like “dazzle,” “arose,” and “knoweth.” The analysis determined how much overlap existed between different texts in both noun and verb topics.

Texts not often associated with each other, like the New Testament and Tao Te Ching, have significant topic overlap.

With noun topic overlap, the books most similar were the Torah and Old Testament, with the Book of Mormon and Old Testament following close behind; the former had 80 percent similarity while the latter had 78 percent overlap. The Tao Te Ching, a book about the fundamentals of Taoism, and the Torah were most different in noun topic similarity, with an overlap percentage of only 27 percent. Verb overlap scores produced similar results, with the Torah and Old Testament still as the most similar. The verb overlap scores between the Tao Te Ching and Torah were still low, but the Greater Holy Assembly and the Rig Veda, an ancient Indian collection of Vedic hymns, beat them out for the lowest score.

Although this compares texts in only one way and doesn’t fully cover religious text complexities, it still shows how similar they are and, interestingly, how texts not often associated with each other, like the New Testament and Tao Te Ching, which together surpassed 50 percent similarity, have significant topic overlap. In the future, even more NLP processes should be applied, such as sentiment analysis, to determine similarity further and aid our understanding of why texts are perceived the way they are.

For further reading:
https://journals.uvu.edu/index.php/jbi/article/view/130

Related posts: