Methodology and editorial approach

How the text is preserved and reviewed

This page explains the principles behind The Olam’s digital restoration work, beginning with the 1886 Amharic Bible, 1886 Edition (Abu Rumi). It is intended to make the project’s editorial decisions clear without turning the reader itself into a technical report.

Purpose of the project

The purpose of this work is to preserve historic Amharic Bible texts in a reliable digital form. The current main focus is the 1886 Amharic Bible, with the reader also supporting parallel comparison with the 1962 Amharic Bible.

The project is a preservation and proofreading effort. It is not intended as a doctrinal revision, modernization, denominational reinterpretation, or rewriting of the historic text.

The guiding principle is simple: correct clear digitization and typographical problems where the evidence is strong; preserve historic spelling and orthography where they belong to the edition; preserve the historic reading when the evidence is uncertain.

Source text and edition identity

The 1886 Amharic Bible is treated as a historic edition. Its wording, spelling, orthography, and textual character are the object of preservation, not obstacles to be modernized away. The project does not normalize historic spellings simply to match later usage.

The reader may provide tools such as search, parallel viewing, and verse images, but those tools are meant to support the text rather than alter its identity.

Role of the 1962 Amharic Bible

The 1962 Amharic Bible is available for parallel comparison in the reader. It is used as a comparison edition, not as a replacement for the 1886 text. Each edition should remain distinct.

OCR and digital restoration

OCR, or optical character recognition, is the process of converting page images into machine-readable text. Because historic printed Amharic text can be difficult for OCR systems, the raw output must be reviewed and corrected carefully.

The project’s digital text is therefore not presented as raw OCR output. It is a reviewed digital text produced through OCR, correction, and proofreading.

Source image

The image preserves the appearance of the historic printed source and helps reviewers verify difficult readings.

Reviewed digital text

The digital text reflects the reviewed reading after OCR correction and proofreading decisions have been applied.

Context-based correction

Some OCR errors are obvious from the immediate word, verse, and surrounding context. In such cases, missing or misread characters may be corrected when the intended reading is clear.

Proofreading and review

This work has not been only an automated OCR project. A team of volunteer reviewers has diligently gone through multiple rounds of proofreading and review to identify OCR errors, typographical problems, suspicious word forms, red-letter text issues, and places where the digital text needed further checking.

Reviewers compared the digital text against source images throughout the text, checked unusual forms, reported possible errors, and gave special attention to passages where the red-letter text needed to be verified. They helped distinguish clear corrections from readings that should be preserved. Their work has been essential to moving the text from machine-generated output toward a carefully reviewed preservation edition.

The review process has also helped shape the project’s editorial discipline: correct what is clear, investigate what is suspicious, and preserve uncertain readings rather than forcing unnecessary changes.

During active review stages, corrections may continue to be made. For that reason, the official website and reader are intended to remain the authoritative source for the current state of the edition.

Editorial decisions

The project makes limited editorial decisions so the historic text can be read and searched reliably in digital form. These decisions are intended to support preservation, not replace it.

Arabic numerals for navigation

Chapter and verse numbers are shown using Arabic numerals such as 1, 2, and 3 instead of Geʽez numerals such as ፩፣ ፪፣ and ፫. This supports modern navigation, search, and reference handling. It does not imply modernization of the biblical text itself.

Correction of clear OCR errors

Clear OCR errors are corrected when the intended character or word can be determined with confidence from the source image, the word form, or the immediate verse context. These corrections are not intended as orthographical modernization or spelling normalization of the historic text.

Completion of missing characters

When OCR has dropped characters and the missing reading is clear from the word and surrounding context, the missing characters may be restored in the digital text.

No orthographical modernization

Historic spelling and orthographic forms are not changed simply because they differ from later usage or modern expectations. The goal is to correct clear digitization, OCR, printing, and typographical problems while preserving the spelling character of the 1886 edition.

Printing and typographical errors

Obvious printing mistakes and typographical errors may be corrected when the intended reading can be determined with confidence. These are handled cautiously because the project’s first duty is preservation.

Preservation when uncertain

When a word or form is uncertain, especially if it is not found in available dictionaries or references, the historic form is preserved rather than silently changed. This includes unusual spellings or orthographic forms that may reflect the edition itself. Uncertainty is not treated as permission to rewrite the text.

In short: confident corrections are made carefully; historic orthography is not modernized; uncertain readings are preserved carefully.

Character Error Rate (CER) and accuracy

Character Error Rate, usually abbreviated as CER, is one way the project measures OCR accuracy. It compares a machine-generated text against a verified reference text and measures how many character-level edits are needed to make the OCR output match the reference.

CER = (substitutions + insertions + deletions) / total characters in the reference text

The project has used CER progression to track improvement across OCR and automated cleanup stages. A falling CER is useful evidence that the generated text is moving closer to the verified reading, especially when the measurement scope is clearly stated. CER is not shown here as a measure of later proofreading, because at that point the work becomes editorial correction rather than an OCR-performance measurement.

Stage	Date	Pages	Reference characters	CER	Notes
Initial OCR baseline	2025-11-29	128	135,015	0.035907	Early measured OCR output.
Early automated cleanup	2025-11-30	128	135,028	0.011183	Substantial improvement after initial cleanup passes.
Run 11 improvement	2025-12-03	185	183,307	0.007272	Best measured Run 11 result before later tuning.
Run 12 final cleanup	2025-12-09	416	293,584	0.005804	Improved result after additional automated cleanup and tuning.
Run 13 baseline refinement	2025-12-13	417	293,826	0.004966	Stable improvement after RunPod baseline correction.
Run 14 baseline refinement	2025-12-14	417	293,821	0.004241	Best measured Run 14 RunPod baseline variant.
Latest automated OCR measurement	2025-12-18	436	294,038	0.004156	Latest listed automated OCR/CER measurement.

This table summarizes representative milestones rather than every experimental run. Test runs with tiny samples, zero-character scope, or known broken settings are intentionally omitted. The values should be described as automated OCR/CER measurements, not as final post-proofreading accuracy.

Limits of the work

No digital restoration can remove every uncertainty from a historic text. Some readings may remain difficult because of image quality, print condition, unusual word forms, historic orthography, or limits in available reference materials.

The project’s approach is to be transparent about those limits and to prefer preservation over overconfident correction.