Split long introductory sentences#951
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #626 by implementing sentence-level splitting for long non-Scripture USFM segments (>200 characters) to improve translation quality. The solution splits long introductory paragraphs and other non-verse content into individual sentences before translation, then recombines them afterwards while preserving structure.
Changes:
- Refactored translation data structures (SentenceTranslation, SentenceTranslationGroup, TranslatedDraft, DraftGroup) into a new module
translation_data_structures.pyto support new combining operations - Introduced
UsfmTextRowCollectionandTranslatedTextRowCollectionclasses to handle splitting and recombining of USFM text rows - Created
NLTKSentenceTokenizerwrapper with caching to standardize sentence tokenization across the codebase
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| silnlp/common/translation_data_structures.py | New file containing refactored translation data structures with combine() methods to merge split sentence translations |
| silnlp/common/usfm_utils.py | Added UsfmTextRowCollection to split long non-verse rows and TranslatedTextRowCollection to recombine translations |
| silnlp/common/utils.py | Added NLTKSentenceTokenizer class with language-aware sentence splitting and instance caching |
| silnlp/common/translator.py | Refactored to use new UsfmTextRowCollection; removed old data structure classes; cleaned up USFM processing logic |
| silnlp/nmt/translate.py | Updated imports to reference new translation_data_structures module |
| silnlp/nmt/hugging_face_config.py | Updated imports and changed return type from list to SentenceTranslationGroup |
| silnlp/nmt/config.py | Updated imports for SentenceTranslationGroup |
| silnlp/common/translate_google.py | Updated to use new SentenceTranslationGroup class and updated imports |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
benjaminking
left a comment
There was a problem hiding this comment.
@benjaminking made 9 comments and resolved 9 discussions.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
benjaminking
left a comment
There was a problem hiding this comment.
@benjaminking made 5 comments and resolved 5 discussions.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.
TaperChipmunk32
left a comment
There was a problem hiding this comment.
@TaperChipmunk32 made 1 comment.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 8 files and all commit messages, and made 2 comments.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on benjaminking).
silnlp/common/utils.py line 243 at r3 (raw file):
self._tokenizer = nltk.data.load("tokenizers/punkt/english.pickle") def _initialize(self) -> None:
Small nit: this method feels like it should be a classmethod, since it doesn't reference any instance fields/methods.
benjaminking
left a comment
There was a problem hiding this comment.
@benjaminking made 1 comment.
Reviewable status: 7 of 8 files reviewed, all discussions resolved (waiting on ddaspit).
silnlp/common/utils.py line 243 at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Small nit: this method feels like it should be a
classmethod, since it doesn't reference any instance fields/methods.
Done.
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 1 file and all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on benjaminking).
This PR fixes #626 by splitting long non-Scripture segments in USFM documents using the NLTK sentence tokenizer and re-combining the segments after translation. This also required some refactorizing, which is what many of the changes are.
This change is