Cleaning code by TheoMoins · Pull Request #92 · SupervisedStylometry/SuperStyl

TheoMoins · 2025-11-27T15:58:44Z

Progressive improvement of code factorization :

Refactor read_clean and read_clean_split into one function

PR not ready yet!

codecov · 2025-11-27T16:00:03Z

Codecov Report

❌ Patch coverage is 87.34177% with 70 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.02%. Comparing base (0117851) to head (b3079a8).
⚠️ Report is 11 commits behind head on master.

Files with missing lines	Patch %	Lines
superstyl/config.py	82.52%	36 Missing ⚠️
superstyl/load.py	84.21%	12 Missing ⚠️
superstyl/preproc/pipe.py	91.89%	12 Missing ⚠️
superstyl/svm.py	72.22%	10 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master      #92       +/-   ##
===========================================
+ Coverage   74.08%   87.02%   +12.93%     
===========================================
  Files           9       10        +1     
  Lines         656      863      +207     
===========================================
+ Hits          486      751      +265     
+ Misses        170      112       -58

Flag	Coverage Δ
unittests	`87.02% <87.34%> (+12.93%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TheoMoins · 2025-11-28T15:43:06Z

Creation of a different config classes, to better handle all the parameters

The initial idea was to avoid the propagation of all the parameters into the different functions, which is painful if one want to add a new one + make all the function signature heavy. But I don't change 100% to maintain retrocompatibility of the API

Merge the two functions, load, and load_from_config, into one function that can either use a config or parameters given individually

Note that if one wants to use several features concatenated, the only way is to provide the information using the config class.

TheoMoins · 2025-12-01T16:35:51Z

SuperStyl Configuration Parameters

Corpus Configuration (`CorpusConfig`)

Parameter	Type	Default	Description
`paths`	`List[str]`	`[]`	List of paths to text files to load
`format`	`str`	`"txt"`	File format. Options: `"txt"`, `"xml"`, `"tei"`, `"txm"`
`identify_lang`	`bool`	`False`	Automatically detects the language of each text (uses langdetect)

Feature Configuration (`FeatureConfig`)

Parameter	Type	Default	Description
`name`	`Optional[str]`	`None`	Name identifying the configuration for multi-feature extractions
`type`	`str`	`"words"`	Type of features to extract. Options: `"words"`, `"chars"`, `"affixes"`, `"lemma"`, `"pos"`, `"met_line"`, `"met_syll"`
`n`	`int`	`1`	N-gram length (e.g., 3 for trigrams)
`k`	`int`	`5000`	Maximum number of most frequent features to keep
`freq_type`	`str`	`"relative"`	Type of frequencies. Options: `"relative"`, `"absolute"`, `"binary"`
`feat_list`	`Optional[List]`	`None`	Predefined list of features to use (for training on a test set)
`feat_list_path`	`Optional[str]`	`None`	Path to a JSON or TXT file containing a predefined features list
`embedding`	`Optional[str]`	`None`	Path to a Word2Vec embedding file (txt format) for extracting semantic frequencies
`neighbouring_size`	`int`	`10`	Number of semantic neighbors to consider in the embedding
`culling`	`float`	`0`	Minimum percentage of samples containing a feature to keep it (0-100)

Sampling Configuration (`SamplingConfig`)

Parameter	Type	Default	Description
`enabled`	`bool`	`False`	Enables text sampling into segments
`units`	`str`	`"words"`	Sampling unit. Options: `"words"`, `"verses"`
`size`	`int`	`3000`	Size of each segment (in words or verses depending on `units`)
`step`	`Optional[int]`	`None`	Step size between segments (default = `size` for non-overlapping segments)
`max_samples`	`Optional[int]`	`None`	Maximum number of segments per author/class (random selection if exceeded)
`random`	`bool`	`False`	Uses random sampling with replacement instead of continuous sliding

Normalization Configuration (`NormalizationConfig`)

Parameter	Type	Default	Description
`keep_punct`	`bool`	`False`	Preserves punctuation and uppercase/lowercase distinction
`keep_sym`	`bool`	`False`	Preserves punctuation, case, digits, symbols, and diacritical marks (disables Unidecode)
`no_ascii`	`bool`	`False`	Disables ASCII conversion via Unidecode (useful for non-Latin alphabets)

SVM Configuration (`SVMConfig`)

Parameter	Type	Default	Description
`cross_validate`	`Optional[str]`	`None`	Cross-validation method. Options: `"leave-one-out"`, `"k-fold"`, `"group-k-fold"` or `None`
`k`	`int`	`0`	Number of folds for k-fold (0 = default 10) or number of groups for group-k-fold
`dim_reduc`	`Optional[str]`	`None`	Dimensionality reduction. Options: `"pca"` or `None`
`norms`	`bool`	`True`	Applies StandardScaler and Normalizer to the pipeline
`balance`	`Optional[str]`	`None`	Strategy for imbalanced data. Options: `"downsampling"`, `"Tomek"`, `"upsampling"`, `"SMOTE"`, `"SMOTETomek"` or `None`
`class_weights`	`bool`	`False`	Uses balanced class weights (inversely proportional to class sizes)
`kernel`	`str`	`"LinearSVC"`	SVM kernel type. Options: `"LinearSVC"`, `"linear"`, `"sigmoid"`, `"rbf"`, `"poly"`
`final_pred`	`bool`	`False`	Trains the final model on the entire training set for final predictions
`get_coefs`	`bool`	`False`	Extracts and visualizes the most important coefficients for each class (LinearSVC only)
`plot_rolling`	`bool`	`False`	Generates rolling stylometry plots (requires `final_pred=True` and sampling)
`plot_smoothing`	`int`	`3`	Window size for smoothing the rolling plot (0 to disable)

Main Configuration (`Config`)

Parameter	Type	Default	Description
`corpus`	`CorpusConfig`	See CorpusConfig	Corpus configuration
`features`	`List[FeatureConfig]`	`[FeatureConfig()]`	List of feature configurations (allows multiple simultaneous extractions)
`sampling`	`SamplingConfig`	See SamplingConfig	Sampling configuration
`normalization`	`NormalizationConfig`	See NormalizationConfig	Normalization configuration
`svm`	`SVMConfig`	See SVMConfig	SVM configuration
`output_prefix`	`Optional[str]`	`None`	Optional prefix for output files

Usage Examples

Example 1: Minimal Configuration

{
  "corpus": {
    "paths": ["data/texts/*.txt"],
    "format": "txt"
  },
  "features": [
    {
      "type": "words",
      "n": 1
    }
  ]
}

Example 2: Advanced Configuration

{
  "corpus": {
    "paths": ["data/texts/*.txt"],
    "format": "txt",
    "identify_lang": true
  },
  "features": [
    {
      "name": "word_1grams",
      "type": "words",
      "n": 1,
      "k": 3000,
      "freq_type": "relative",
      "culling": 5
    },
    {
      "name": "char_3grams",
      "type": "chars",
      "n": 3,
      "k": 5000,
      "freq_type": "relative"
    }
  ],
  "sampling": {
    "enabled": true,
    "units": "words",
    "size": 1000,
    "step": 500,
    "max_samples": 10
  },
  "normalization": {
    "keep_punct": false,
    "keep_sym": false,
    "no_ascii": false
  },
  "svm": {
    "cross_validate": "k-fold",
    "k": 10,
    "norms": true,
    "balance": "SMOTE",
    "kernel": "LinearSVC",
    "final_pred": true,
    "get_coefs": true
  }
}

refactor read_clean and read_clean_split into one function

f366068

TheoMoins added 5 commits November 27, 2025 17:23

add tests for select part

fbbe850

Adding a config class to better handle the pile of parameters

9609d4d

Merge load and load_from_config functionnality

f8b14ce

creating tons of errors oupsi

bc93751

Fix tests

267a09c

TheoMoins added 3 commits December 1, 2025 12:10

Simplify functions arguments of load_corpus and train_svm

95c0e91

Simplify function calls

03e3418

fix tests

b3079a8

Jean-Baptiste-Camps merged commit 7744aef into master Jan 14, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleaning code#92

Cleaning code#92
Jean-Baptiste-Camps merged 9 commits intomasterfrom
refactoring_code

TheoMoins commented Nov 27, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 27, 2025 •

edited

Loading

Uh oh!

TheoMoins commented Nov 28, 2025 •

edited

Loading

Uh oh!

TheoMoins commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

TheoMoins commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TheoMoins commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheoMoins commented Dec 1, 2025

SuperStyl Configuration Parameters

Corpus Configuration (CorpusConfig)

Feature Configuration (FeatureConfig)

Sampling Configuration (SamplingConfig)

Normalization Configuration (NormalizationConfig)

SVM Configuration (SVMConfig)

Main Configuration (Config)

Usage Examples

Example 1: Minimal Configuration

Example 2: Advanced Configuration

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

TheoMoins commented Nov 27, 2025 •

edited

Loading

codecov bot commented Nov 27, 2025 •

edited

Loading

TheoMoins commented Nov 28, 2025 •

edited

Loading

Corpus Configuration (`CorpusConfig`)

Feature Configuration (`FeatureConfig`)

Sampling Configuration (`SamplingConfig`)

Normalization Configuration (`NormalizationConfig`)

SVM Configuration (`SVMConfig`)

Main Configuration (`Config`)