Notes from Limmud 2012
Computerised Bible Criticism
[Standard disclaimer: All views not in square brackets are those of the speaker, not myself. Accuracy of transcription is not guaranteed.]
After the research presented in his previous talk, people came up to the speaker and said: "Can you do that for the Bible?" For a long time, the speaker avoided doing so; but eventually he did it.
In the case of the Bible, what people really want to know is: can you isolate different aspects of it displaying consistent authorial features?
The speaker regards this work as having no theological implications. Firstly because he does not make any claim as to the number of authors of the document. He claims simply that if you want to split a document into a number of authorial components, this is the best way of doing it. But it's up to you to interpret the result. If you tell him to split Moby-Dick into three parts, he'll split it into three; if you tell him to split it into seventeen, he'll split it into seventeen.
Secondly, if you tell him you should not see multiple styles reflected [lacuna]. But he sees no reason why prophetic books should not display multiple styles.
The story has modest beginnings: He wrote a paper appearing in Proceedings of the Assocation of Computational Linguistics, "Unsupervised Decomposition of a Document into Authorial Components". In the last paragraph, in the conclusion, he said, "We find that our split corresponds to the expert consensus regarding P and non-P for over 90% of the verses in the Pentateuch." This sounds technical but actually was like throwing a grenade into the room.
The press picked up on this: Haaretz: "Israeli software supports theory that Bible was written by multiple authors". [other headlines, too fast for me to catch.] It then appeared in over a thousand papers worldwide. There was not a single person he might have wanted not to see this who did not see it.
Enter the realm of damage limitation. He got invited to the TV show "London and Persian [something]". He started by explaining his disclaimer, as above. They said, "Oh, we get it." He thought he was fantastic about explaining everything, and went home feeling really good about the show... and then watched the result, and the caption underneath, as he's saying it, read "Orthodox scholar proves Bible written by multiple authors".
So, what did he actually do? Let's start with showing how the method works. Consider two books known to be written by different authors. See if, without knowing which chapter belongs to which book, they can be assigned correctly by the method.
Take every pair of chapters and ask how similar they are, and continue doing this on all pairs. This divides the chapters into two clusters.
To do this, we need to know how to assess how similar they are. To do this, take one verse from each book (Jeremiah and Ezekiel in the case study), and count their use of words in the verse: על, ולא, כל, והיה. Some words appearing in one verse do not appear in the other; this can be generalised to chapters as well. The simplest way is to use all the words in Tenach, but there are some tricks for selecting useful words. Typically this would be done in a way that is generic and does not require domain knowledge. (It's all done mechanically.)
What do we get? Answer: useless clustering to separate the chapters:
The reason it didn't work is there are so many different ways you can split up 100 chapters. The algorithm doesn't know that what we want is to split it up according to which book the chapter belongs to. We need to pick a way of splitting it up which captures the difference between writing styles.
The way to do this is to use synonyms: One author might use, say, מַטֵה for staff or tribe, and the other שֶׁבֶט. And so forth for lots of other synonyms. Now, how to find synonyms? Well, the KJV always translates synonymous Hebrew words into the same English words, so we can use a concordance of the KJV. (Of course, we need to look for word roots rather than inflected forms.) Occasionally there will be glitches, where phrases have a word in common but which do not mean the same thing, but these can be identified and factored out.
When we do this, we end up with two clusters which almost but not quite separate the two works:
A few further tricks improve the results even more: These clusters are not uniform. Within them there are some chapters that are really similar, and some which are not so similar, but are more similar than to the other cluster. So, try considering only the cores of the clusters: the mistakes are all in the peripheral chapters. This makes the results just about perfect.
The problem with all of this is that it takes chapter divisions as granted. Now, if you want to divide the chumash [Pentateuch] in the best possible way, it's not going to be divided by chapters (which were introduced by the authors of the Vulgate, and have nothing to do with the Jewish tradition)! We have one long unsegmented text, which we would like to segment.
So, continuing with our case study, where we know what's the right answer, let's create a new book, Jer-iel. First generate a random number from 1 to 100, and take that number of verses from Jeremiah. Now generate a second random number and take that number of verses from Ezekiel, etc, until all the verses are used.
At the end we have one big munged book where the divisions can be anywhere. Can we now again divide this up into Jeremiah and Ezekiel? This is obviously a much harder problem.
We want to reduce the problem into the solved one we've used already. So let's start by segmenting the text into chunks of equal length. Now ignore the fact that many of these will include text by more than one author, and use synonym-based similarity to obtain initial clustering.
Choose the most representative chapters in each cluster. Some of these will be purely taken from one book.
Now determine the most salient differences among the clusters. (We're looking at words, now, not synonyms.) We discover subtle differences such as that one author prefers הִנֵה and one וְהִנֵה.
Finally use these differences to assign each individual verse to one thread.
How do we know this works? When we apply it to Jeremiah and Ezekiel we end up with two sets, 97% of one belonging to Jeremiah and 97% to Ezekiel. (Though only eighty-something percent of the verses belong to either of the two clusters.)
[Audience question: What happens if you ask if you try and divide the text into three clusters? Answer: nothing amazing.]
Some telltale words: The first cluster has words including הנה, הזה, הזאת, כן and הדבר; the second והנה, אחד, אדם, אני and אלי. This works well for differentiating some books, less well for others:
|Jeremiah/Ezekiel||82%||? [too fast for me]|
These results use only the first 39 chapters of Isaiah, to avoid problems resulting from the fact the later chapters are thought to come from different authors.
FWIW, trying it on Isaiah splits it into 32 chapters of one cluster, and from Chapter 40 onwards one cluster; the chapters in between are not clearly categorised.
[I asked: Does this method prove whether Trito-Isaiah is actually Deutero-Isaiah writing twenty years later? The speaker said: I don't know how to formulate that with my method; I'll have to think about it. Me: Some homework for you!]
The same method works if we create a munged book out of k > 2 constituent books. But, we need to be given the k: We don't yet know how many components there are, only how to best split into a given number of components.
So, what happens if we apply this method to the? Let's keep it simple and split it into two parts.
There are many synonyms in the Chumash: תוך/קרב, e.g. ויאבדו מתוך הקהל (Num 16) versus עד תום כל תדור... מקרב המחנה (Deut. 2); בגד/שמלה etc.
In the first phase, we get a rough split of chapters into two clusters based on synonym usage (treating the chapters as if they were homogenous chunks even though he knows they are not). Then check which words charaterise the vocabulary in each cluster.
We end up with one cluster using words like ישראל, וכל, הכהן, על, עליו, ועל, לפני, etc, and one using כי, ויאמר, לא, אנכי, etc [I ran out of time to take down more before he moved the slide on].
Now we use differences in word usage to obtain a split of all the verses. To a large extent, the split we obtain is between narrative portions and legal portions.
But we obtain something more remarkable. Apply this to the Korach narrative [Numbers 16]. The splits correspond to the two stories going on in the narrative. One involves Korach and takes place mostly at the. The other involves rabble-rousers Dathan and Abiram. Just looking at the linguistic elements exactly divides these stories. Biblical scholars have noticed this as well, and claim there are two different stories of different authorship that have been edited together here.
Classification of every verse in the first four books of the Torah (forcing classification of the non-clear verses) corresponds very well into the critics' division into P and non-P verses (ignoring places where the assignation of verses by the Bible critics Driver and Freedman do not agree with each other).
There is one major exception, which is Genesis chapter 1, which the consensus among all Bible scholars is P. However our method, no matter how you twist it, is convinced it is non-P. The reason is that it is convinced is because, which occurs many times in Gen 1, is non-P. Also because of the use of אֱלֹהִים for "God". [Huh? I thought אֱלֹהִים was supposed to be indicative of P!] Bible critics claim that until Moses is at the Burning Bush and God tells Moses his name, he uses אֱלֹהִים, and then stops using that and starts using God's name.
The number of Bible scholars he will convince about Genesis 1 is exactly zero, but he is sticking to his guns.
[Looking at the graphs, I can see a few other smaller exceptions.]
What about trying to subdivide the text further? Answer: If you split it into three, or split non-P into two, in neither case do you get anything like the split into J and E; and indeed amongst Bible scholars a growing group of people think that the J/E split does not give you anything of theological value. Though if you include Deuteronomy, a three-way classification does split neatly into P, D and neither.
So, in conclusion, what should we make of this?
Answer: Not too much.
Firstly, we do not claim that we know the right number of clusters, only that once you choose the number you want, we'll split the text in the best possible way.
Secondly, this sort of analysis is irrelevant to theological considerations. If you are committed to Divine authorship of the Chumash, you can just use these stylistic differences as the basis for exegesis. Indeed, Mordechai Breuer said as much.
[Hmm, this talk has given me pause for thought, coming as it did a matter of hours after I stood in front of forty people and talked about why I don't believe in the Documentary Hypothesis (using the Samaritan text of the Torah as evidence). I don't think the speaker's work actually adds to the evidence in favour of the Documentary Hypothesis—as I see it, his analytic method was simply latching onto the same cues as the critics behind the Documentary Hypothesis did manually from the nineteenth century—but possibly I need to refine my thoughts and find a happy medium. Perhaps I should read Breuer too, though I imagine I'd find Breuer's presumed Sinaitic-origin dogmatism as annoying as the intransigence of pro-Documentary Hypothesis apologists...]