Notes from Limmud 2012

Uncovering forgeries in Jewish literature

Professor Moshe Koppel

[Standard disclaimer: All views not in square brackets are those of the speaker, not myself. Accuracy of transcription is not guaranteed.]

Case study 1: Documents by Rashba or the Ritva

As a simple example of identifying authorship attribution of a document, to demonstrate the methodology, the speaker showed a responsum in a book of responsa by the Ritva, a Spanish rabbi of the thirteenth century. This is generally regarded as a document he wrote. However, the Beit Yosef cites this responsum as written by the Rashba, another thirteenth century Spanish rabbi, the Ritva's teacher.

What features do we want, to be able to identify the author? Answer: Words that are used consistently by given author regardless of document type. This eliminates topic-dependent words. (For example, you could not use "football" to distinguish between two different English authors: they would only use the word if they were writing about football, so their level of use of it would not tell you which had written a document.) So we want to find words which cross topics. Such words Might be used differently by different authors. Our challenge is to identify these.

There's a class of words in every language called function words: connector words, like "and", "the", "if", "of". If you look at an author's use of these, you learn something about their writing style.
A famous example is the series of newspaper articles in favour of the American constitution known as the Federalist Papers. These were cowritten by James Madison, Alexander Hamilton and John Jay, but signed as "Publius", the pen name they all used. Each document was only written by one of them, and eventually their authorship came out—but there were twelve papers in the federalist documents that both Madison and Hamilton claimed to have written.

In the 1960s, statisticians decided all twelve documents were written by Madison, based on a list of seventy function words. The most telling was that one used "while", the other "whilst".
Syntax: Different authors structure their sentences differently. How does one quantify syntactical features in a text? Use automated methods to assign a part of speech to every word, and then and then look at sequences of parts of speech.
Morphology. Especially in Semitic languages, lots of information is rolled into the morphology (e.g. "and to him" is all a combination of prefix and suffix in Semitic languages).
Complexity of sentences and words
Idiosyncracies. A famous example: the Unabomber published tracts elucidating his philosophy. Careful textual analysis of idiosyncratic usage, a man named Donald Foster identified him. (His brother also turned him in.)

In most cases, it's enough to use common words that aren't tied to a topic. In the text in hand:

Word:	אֶת	עַל	כַּאֲשֶׁר	שֶׁלוֹ	וְכֵן	אֶלָא
Occurrences:	0	2	1	1	0	1

[There were more words, but I didn't have time to get them down.] This sequence of numbers can then be represented graphically.

[Graph with two clusters of points: dark ones showing documents by Rashba,
and light ones showing documents by Ritva, for two words.]

A real example would involve more words, so more dimensions the graph. We want to find a line that separates these two authors. <speaker adds line to graph> Once you have that, you add a new, anonymous document, to your graph, and the dividing line allows you to assign it to an author.

How to know this works? Use cross-validation: Keep some known documents in our pockets, and check if the classifier assigns them correctly. In this case, nearly 100% of documents not seen at the learning stage are then correctly classified.

Of the five disputed documents attributed to Ritva, two turn out to be by Rashba.

The Ritva has an aggressive writing style: his markers include פשוט, ברור, מפורש, דעתי, כראוי, כדאיתא. The Rashba's markers include: שאלתי, אמרת, נמי, בריש, כלומר.

Further applications

[We] have used similar methods to pick out the true author of a text from among 10,000 candidates.

This has many interesting security applications (in terms of identifying anonymous texts).

One can also use these methods to determine:

Gender (with about 80% accuracy—women use many more pronouns)
Age (accuracy of 77%, with reasonable breakdown into teenagers, twenty-somethings, thirty-pluses)
Native language
Personality type (neurotic types use first person more)

...though some of these may be culturally dependent.

The same features turn up in fiction. (The baselines are different, but the biases remain.) Some authors are better than others at getting their characters to speak in character in this respect.

The author was put onto a live test by the BBC: given three documents, and had to identify them on air an hour later. One document was obviously male, one obviously female. The third document was female at a surface level, but male at a more subtle level, and the speaker tentatively, but correctly, identified it as being by a transsexual.

Case study 2: Genizat Harson

In 1916 a trove of letters by early Chassidic leaders began circulating in the antiquities market. It included lots by the founder of Chassidism, the Baal Shem Tov, by whom we only have one authenticated letter. It was claimed they were refently released from an archive in Leningrad.

Most scholars and some Rabbis declared them forgeries, but the then Lubavitcher Rebbe insisted they were authentic, קודש קדשים (Holy of Holies) he said. (About half the letters were by the first Lubavitcher Rebbe, R. Shneur Zalman of Liadi.) However, the Gerrer Rebbe said ‫"זִיוּף!"‬—forged. The paper and ink had previously been shown to date from the early twentieth century; the Lubavitcher Rebbe said irrelevant; they'd been copied over. So, are they authentic?

Two grandsons of prominent Lubavitchers, who were convinced these letters were forgeries, brought the speaker these letters. (The speaker is the grandson of the Gerrer chassidim, so solving the problem involves a revival of the inter-sect dispute from a century ago!)

We have many authentic letters by R. Shneur Zalman of Liadi, the Baal haTanya.

There are two possibilities: that all the letters [in the trove] are forged, or that all the letters are written by the person who signed them. So, we should use the letters attributed to the Baal haTanya: If they look like his authentic letters they shouldn't look like the letters of the Baal Shem Tov.

The attributed letters say וכו׳ for "etc", and י״ח (short for יחיה נצח "he should live forever"). Genuine letters of the Baal haTanya say כו׳ and ש״י (short for שיחיה). The non-Baal haTanya letters also say וכו׳ and י״ח.

In other words, there is absolutely no doubt that all of these letters are forgeries.

Gur wins out.

Case study 3: תּוֹרָה לִשְׁמָהּ: By the Ben Ish Ḥai or Yeḥezqel Kaḥli

Yoseph Ḥayyim of Baghdad, the Ben Ish Ḥai, was the Chief Rabbi of Baghdad, dying a little over a hundred years ago. He is heavily influential amongst the Sephardim (and is the bad guy for R. Ovadya Yosef, who has taken it upon himself to cleanse halacha of the kabbalistic influences largely introduced by the Ben Ish Ḥai).

One of his books is a book of responsa, רַב פְּעָלִים. He also published a second book of responsa, תּוֹרָה לִשְׁמָהּ, which he claimed he found in the archive of the main shul in Baghdad.

The first responsum starts אמר הקטן יחזקאל כחלי "Yeḥezqel Kaḥli the Little said", continuing "I began writing this book in 1682 ‫(התמ״ב)‬." (The Ben Ish Ḥai was born in the middle of the nineteenth century.) Who was Yeḥezqel Kaḥli? Since he wrote so many responsa, he must have been important. But no one has ever heard of him elsewhere.

He signs at the bottom והיה זא שלום; this is not the Ben Ish Ḥai's customary sign-off יאיר עיניננו באור תורתו. In another book, the Ben Ish Ḥai says "This is my opinion and if you don't believe me, even Yeḥezqel Kaḥli says so"—and signs off, as he always does, יאיר עיניננו באור תורתו.

But since the Ben Ish Ḥai is the only person to have seen this manuscript prior to his publishing it, maybe the Ben Ish Ḥai wrote it himself? In which case it's problematic that he uses him for references.

A responsum by a contemporary [of us], Pinḥas Zviri, a student of R. Ovadya Yosef, cites a letter in תּוֹרָה לִשְׁמָהּ. He says he spoke to the great rabbis of the generation to get their opinion as to whether it was authentic or not. Their opinion was divided.

This is a harder issue than the one we were dealing with before. It's not here a case of whether this was written by known person A or B, but whether it was written by A or an unknown. This is a much harder problem to solve. So, Can we distinguish רַב פְּעָלִים from תּוֹרָה לִשְׁמָהּ?

The answer involves learning a model for רַב פְּעָלִים versus תּוֹרָה לִשְׁמָהּ. If, after providing a subset, cross-validation "fails" (so that we can't distinguish רַב פְּעָלִים from תּוֹרָה לִשְׁמָהּ), they are probably by the same author.

When the speaker tried that, he obtained 98% cross-validation accuracy. In other words, תּוֹרָה לִשְׁמָהּ is easily distinguished from רַב פְּעָלִים.

So, does that mean תּוֹרָה לִשְׁמָהּ and רַב פְּעָלִים are by two different authors? Answer: No! How, then to solve the problem?

What about chronology? Does a person's style change over time?
Superficial differences, due to [lacuna]
Thematic differences
Chronological drift
Different purposes or contexts
Deliberate ruses—the author went out of his way to change his style.

These would be enough to allow differentiation between two books even if they were by the same author. We call these differences "masks".

You don't even have to work too hard to fool the prior method. All you needed to do is be consistent with the sign-offs, and the two works would be easily distinguished. How, then, to eliminate such shallow distinguishers?

E.g. the American author, Nathaniel Hawthorne, wrote two novels, The Scarlet Letter and The House of Seven Gables. The two books are distinguishable because the words "he" and "she" appear in different proportions in the two books: the gender of the characters in the two books is different.

The solution may be described mathematically as:

Learn models for X vs. S (and for X vs. each impostor).
For each of these, drop the K_{(k=5, 10, 15,
...)} best_{(= highest weight in SVM)} features and learn again.

I.e. drop the most distinguishing features and run the comparison again, and again and again iteratively. When you do this, you get a gradual degradation in your ability to distinguish the authors. Once you have done this, you can compare curves: how fast your ability to distinguish degrades.

When you have two books written by the same author, the degradation is extremely fast; when they are written by different authors, you can still tell between them because they do everything differently, not just a few shallow things.

[Audience member: Is this your methodology, or is it well known? Speaker's answer: Both: It is my methodology, and it is now well known.]

[Graph showing curves of תּוֹרָה לִשְׁמָהּ compared against various other books from the same period.
The curve of תּוֹרָה לִשְׁמָהּ against רַב פְּעָלִים drops way faster than the others.]

Conclusion: There is no douby that תּוֹרָה לִשְׁמָהּ was written by the Ben Ish Ḥai himself.

One other proof: The author is called יחזקאל כחלי, the Ben Ish Ḥai's name is יוסף חיים. Now, יחזקאל has the same gematria as יוסף and כחלי as חיים. The Ben Ish Ḥai was into this kind of thing.

Jewish learning notes index

Flat | Top-Level Comments Only

From:

iddewes

It does sound like it was an interesting talk.

green_knight

Utterly fascinating, thanks for sharing this.

Has anyone tried to apply this to Shakespeare yet?

<strokes can of worms>

lethargic_man

Heh. <googles> Possibly not... though I wouldn't be surprised if it's on their to-do list.

I'll be opening an even wormier can of worms in the next set of notes I post.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Lethargic Man (anag.)