Split long introductory sentences by benjaminking · Pull Request #951 · sillsdev/silnlp

benjaminking · 2026-02-26T18:57:00Z

This PR fixes #626 by splitting long non-Scripture segments in USFM documents using the NLTK sentence tokenizer and re-combining the segments after translation. This also required some refactorizing, which is what many of the changes are.

This change is

Copilot

Pull request overview

This PR addresses issue #626 by implementing sentence-level splitting for long non-Scripture USFM segments (>200 characters) to improve translation quality. The solution splits long introductory paragraphs and other non-verse content into individual sentences before translation, then recombines them afterwards while preserving structure.

Changes:

Refactored translation data structures (SentenceTranslation, SentenceTranslationGroup, TranslatedDraft, DraftGroup) into a new module translation_data_structures.py to support new combining operations
Introduced UsfmTextRowCollection and TranslatedTextRowCollection classes to handle splitting and recombining of USFM text rows
Created NLTKSentenceTokenizer wrapper with caching to standardize sentence tokenization across the codebase

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
silnlp/common/translation_data_structures.py	New file containing refactored translation data structures with combine() methods to merge split sentence translations
silnlp/common/usfm_utils.py	Added UsfmTextRowCollection to split long non-verse rows and TranslatedTextRowCollection to recombine translations
silnlp/common/utils.py	Added NLTKSentenceTokenizer class with language-aware sentence splitting and instance caching
silnlp/common/translator.py	Refactored to use new UsfmTextRowCollection; removed old data structure classes; cleaned up USFM processing logic
silnlp/nmt/translate.py	Updated imports to reference new translation_data_structures module
silnlp/nmt/hugging_face_config.py	Updated imports and changed return type from list to SentenceTranslationGroup
silnlp/nmt/config.py	Updated imports for SentenceTranslationGroup
silnlp/common/translate_google.py	Updated to use new SentenceTranslationGroup class and updated imports

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

silnlp/common/usfm_utils.py

silnlp/common/utils.py

silnlp/common/translator.py

silnlp/common/usfm_utils.py

silnlp/common/translation_data_structures.py

silnlp/common/usfm_utils.py

silnlp/common/utils.py

silnlp/common/translation_data_structures.py

benjaminking

@benjaminking made 9 comments and resolved 9 discussions.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.

silnlp/common/translate_google.py

silnlp/common/translation_data_structures.py

silnlp/common/translator.py

silnlp/common/utils.py

silnlp/common/usfm_utils.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

silnlp/common/translator.py

silnlp/common/utils.py

silnlp/common/translation_data_structures.py

benjaminking

@benjaminking made 5 comments and resolved 5 discussions.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.

silnlp/common/translation_data_structures.py

silnlp/common/translator.py

silnlp/common/utils.py

TaperChipmunk32

@TaperChipmunk32 made 1 comment.
Reviewable status: 0 of 8 files reviewed, all discussions resolved.

ddaspit

@ddaspit reviewed 8 files and all commit messages, and made 2 comments.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on benjaminking).

silnlp/common/utils.py line 243 at r3 (raw file):

            self._tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

    def _initialize(self) -> None:

Small nit: this method feels like it should be a classmethod, since it doesn't reference any instance fields/methods.

benjaminking

@benjaminking made 1 comment.
Reviewable status: 7 of 8 files reviewed, all discussions resolved (waiting on ddaspit).

silnlp/common/utils.py line 243 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Small nit: this method feels like it should be a classmethod, since it doesn't reference any instance fields/methods.

Done.

ddaspit

@ddaspit reviewed 1 file and all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on benjaminking).

Ben King added 8 commits February 25, 2026 17:31

Refactor the filtering of translation sentences

0eb3fa4

Refactor translation data structures to avoid circular imports

05aae9b

Bug fixes and missed functionality

90e1696

Bug fix for selected chapters

b0d2f63

Split long non-verse sentences and rejoin them after translation

657c2e5

Refactor NLTK into separate util class

9d73e3e

Minor fixes missed in last commit

8e28bbe

Fix missing import

c259304

benjaminking requested a review from Copilot February 26, 2026 18:57

Copilot started reviewing on behalf of benjaminking February 26, 2026 18:57 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Addressing Copilot review comments

fe2d573

benjaminking commented Feb 27, 2026

View reviewed changes

benjaminking requested a review from Copilot February 27, 2026 17:10

Copilot started reviewing on behalf of benjaminking February 27, 2026 17:10 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

Responding to Copilot comment round 2

45e83aa

benjaminking commented Feb 27, 2026

View reviewed changes

benjaminking requested a review from TaperChipmunk32 February 27, 2026 18:07

TaperChipmunk32 approved these changes Feb 27, 2026

View reviewed changes

ddaspit approved these changes Mar 2, 2026

View reviewed changes

Change NLTK tokenizer method to classmethod

d59a37c

benjaminking commented Mar 3, 2026

View reviewed changes

ddaspit reviewed Mar 3, 2026

View reviewed changes

benjaminking merged commit 8b99c53 into master Mar 3, 2026
1 check passed

benjaminking deleted the split_introductory_sentences branch March 3, 2026 19:10

Uh oh!

Conversation

benjaminking commented Feb 26, 2026 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benjaminking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benjaminking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaperChipmunk32 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

benjaminking left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

benjaminking commented Feb 26, 2026 •

edited by ddaspit

Loading