Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #92 +/- ##
===========================================
+ Coverage 74.08% 87.02% +12.93%
===========================================
Files 9 10 +1
Lines 656 863 +207
===========================================
+ Hits 486 751 +265
+ Misses 170 112 -58
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The initial idea was to avoid the propagation of all the parameters into the different functions, which is painful if one want to add a new one + make all the function signature heavy. But I don't change 100% to maintain retrocompatibility of the API
Note that if one wants to use several features concatenated, the only way is to provide the information using the config class. |
SuperStyl Configuration ParametersCorpus Configuration (
|
| Parameter | Type | Default | Description |
|---|---|---|---|
paths |
List[str] |
[] |
List of paths to text files to load |
format |
str |
"txt" |
File format. Options: "txt", "xml", "tei", "txm" |
identify_lang |
bool |
False |
Automatically detects the language of each text (uses langdetect) |
Feature Configuration (FeatureConfig)
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
Optional[str] |
None |
Name identifying the configuration for multi-feature extractions |
type |
str |
"words" |
Type of features to extract. Options: "words", "chars", "affixes", "lemma", "pos", "met_line", "met_syll" |
n |
int |
1 |
N-gram length (e.g., 3 for trigrams) |
k |
int |
5000 |
Maximum number of most frequent features to keep |
freq_type |
str |
"relative" |
Type of frequencies. Options: "relative", "absolute", "binary" |
feat_list |
Optional[List] |
None |
Predefined list of features to use (for training on a test set) |
feat_list_path |
Optional[str] |
None |
Path to a JSON or TXT file containing a predefined features list |
embedding |
Optional[str] |
None |
Path to a Word2Vec embedding file (txt format) for extracting semantic frequencies |
neighbouring_size |
int |
10 |
Number of semantic neighbors to consider in the embedding |
culling |
float |
0 |
Minimum percentage of samples containing a feature to keep it (0-100) |
Sampling Configuration (SamplingConfig)
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled |
bool |
False |
Enables text sampling into segments |
units |
str |
"words" |
Sampling unit. Options: "words", "verses" |
size |
int |
3000 |
Size of each segment (in words or verses depending on units) |
step |
Optional[int] |
None |
Step size between segments (default = size for non-overlapping segments) |
max_samples |
Optional[int] |
None |
Maximum number of segments per author/class (random selection if exceeded) |
random |
bool |
False |
Uses random sampling with replacement instead of continuous sliding |
Normalization Configuration (NormalizationConfig)
| Parameter | Type | Default | Description |
|---|---|---|---|
keep_punct |
bool |
False |
Preserves punctuation and uppercase/lowercase distinction |
keep_sym |
bool |
False |
Preserves punctuation, case, digits, symbols, and diacritical marks (disables Unidecode) |
no_ascii |
bool |
False |
Disables ASCII conversion via Unidecode (useful for non-Latin alphabets) |
SVM Configuration (SVMConfig)
| Parameter | Type | Default | Description |
|---|---|---|---|
cross_validate |
Optional[str] |
None |
Cross-validation method. Options: "leave-one-out", "k-fold", "group-k-fold" or None |
k |
int |
0 |
Number of folds for k-fold (0 = default 10) or number of groups for group-k-fold |
dim_reduc |
Optional[str] |
None |
Dimensionality reduction. Options: "pca" or None |
norms |
bool |
True |
Applies StandardScaler and Normalizer to the pipeline |
balance |
Optional[str] |
None |
Strategy for imbalanced data. Options: "downsampling", "Tomek", "upsampling", "SMOTE", "SMOTETomek" or None |
class_weights |
bool |
False |
Uses balanced class weights (inversely proportional to class sizes) |
kernel |
str |
"LinearSVC" |
SVM kernel type. Options: "LinearSVC", "linear", "sigmoid", "rbf", "poly" |
final_pred |
bool |
False |
Trains the final model on the entire training set for final predictions |
get_coefs |
bool |
False |
Extracts and visualizes the most important coefficients for each class (LinearSVC only) |
plot_rolling |
bool |
False |
Generates rolling stylometry plots (requires final_pred=True and sampling) |
plot_smoothing |
int |
3 |
Window size for smoothing the rolling plot (0 to disable) |
Main Configuration (Config)
| Parameter | Type | Default | Description |
|---|---|---|---|
corpus |
CorpusConfig |
See CorpusConfig | Corpus configuration |
features |
List[FeatureConfig] |
[FeatureConfig()] |
List of feature configurations (allows multiple simultaneous extractions) |
sampling |
SamplingConfig |
See SamplingConfig | Sampling configuration |
normalization |
NormalizationConfig |
See NormalizationConfig | Normalization configuration |
svm |
SVMConfig |
See SVMConfig | SVM configuration |
output_prefix |
Optional[str] |
None |
Optional prefix for output files |
Usage Examples
Example 1: Minimal Configuration
{
"corpus": {
"paths": ["data/texts/*.txt"],
"format": "txt"
},
"features": [
{
"type": "words",
"n": 1
}
]
}Example 2: Advanced Configuration
{
"corpus": {
"paths": ["data/texts/*.txt"],
"format": "txt",
"identify_lang": true
},
"features": [
{
"name": "word_1grams",
"type": "words",
"n": 1,
"k": 3000,
"freq_type": "relative",
"culling": 5
},
{
"name": "char_3grams",
"type": "chars",
"n": 3,
"k": 5000,
"freq_type": "relative"
}
],
"sampling": {
"enabled": true,
"units": "words",
"size": 1000,
"step": 500,
"max_samples": 10
},
"normalization": {
"keep_punct": false,
"keep_sym": false,
"no_ascii": false
},
"svm": {
"cross_validate": "k-fold",
"k": 10,
"norms": true,
"balance": "SMOTE",
"kernel": "LinearSVC",
"final_pred": true,
"get_coefs": true
}
}
Progressive improvement of code factorization :
PR not ready yet!