Skip to content

Replicated softmax#137

Open
Fede-Rausa wants to merge 13 commits intomasterfrom
replicated_softmax
Open

Replicated softmax#137
Fede-Rausa wants to merge 13 commits intomasterfrom
replicated_softmax

Conversation

@Fede-Rausa
Copy link
Collaborator

Adding Replicated Softmax (a restricted boltzmann machine for topic modeling) and Over Replicated Softmax (a deep version of the RSM) models to the octis topic models list.
Both can be currently evaluated only on classification, diversity and coherence metrics, but support locally a fast computation for the perplexity upper bound (that is supported also for the LDA in the gensim implementation, used by octis).

@silviatti
Copy link
Collaborator

Hi Federico, regarding the Python 3.8 error, it seems that some libraries have dropped support for python 3.8. I think it's time to deprecate it. Could you please remove Python 3.8 from python-publish.yml and any other related references?

Also, please add tests to ensure everything works as expected. I don't currently have time to verify the implementation manually, but adding tests will give us more confidence and serve as guardrails. You can follow the same patterns as in: https://github.com/MIND-Lab/OCTIS/blob/master/tests/test_octis.py and https://github.com/MIND-Lab/OCTIS/blob/master/tests/test_optimization.py.

Lastly, could you run a formatter (e.g., ruff) on the files you added?

Thanks,

Silvia

@silviatti silviatti self-requested a review November 24, 2025 00:20
Copy link
Collaborator Author

@Fede-Rausa Fede-Rausa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following Your advice, the old tests are passed.
Now the build process works for 3.9 and 3.10. I will supply soon the test functions for the RSM and oRSM, plus a better formatting. Thank You for the support.

Best regards,

Federico

Copy link
Collaborator

@silviatti silviatti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the comments in the code. I recommend running a linter to check for unused functions, duplicated code blocks, and other potential issues. I also asked ChatGPT for a quick review and it identified several points worth examining.

I am not an expert in RSM, so I can only validate those findings to a certain extent. I suggest doing a similar review yourself to further assess and verify potential issues in the implementation.

Comment on lines +361 to +377
def _get_topic_word_matrix(self):
"""
Return the topic representation of the words
"""
w_vh, w_v, w_h = self.W
topic_word_matrix = w_vh.T
normalized = []
for words_w in topic_word_matrix:
minimum = min(words_w)
words = words_w - minimum
normalized.append([float(i) / sum(words) for i in words])
topic_word_matrix = np.array(normalized)
return topic_word_matrix

def _get_topic_word_matrix0(self):
"""
Return the topic representation of the words
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions are duplicated

visible_sample = self.sample_softmax(visible_probs, D)
return visible_sample

def sample_h2(self, h1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function used? if not, remove

mu1 = self.v_and_h2_to_h1(v, h2)
mu2 = self.h1_to_softmax(mu1)

if (old_mu2 - mu2).sum() < self.epsilon:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure about this? I asked Chatgpt and they say it's incorrect because:

  • Differences may cancel out (positive/negative)
  • It does not use absolute values
    It should be:
if np.linalg.norm(old_mu2 - mu2) < self.epsilon:

or

if np.sum(np.abs(old_mu2 - mu2)) < self.epsilon:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that's a big point. That error is also present in the original version proposed by dongwookim. Maybe I will have to tune the default value of epsilon to make it larger and let a faster convergence.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done the tests and the convergence is reached with a reasonably small epsilon, so I have applied the changes.

else:
self.dtm = dtm
if doval:
self.val_dtm = np.log(1 + val_dtm)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct that you log transform the validation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is really wrong, I will correct it. The log transform should be applied only when the user requires it, for both training and validation (it should be considered a preprocessing step).

w_vh, w_v, w_h = self.W
D = v.sum(axis=1)
energy = np.outer((D + self.M), w_h) + (v @ w_vh) * np.reshape(
(1 + self.M / D), (-1, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can D=0 lead to a division by zero error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely. But it isn't supposed to happen, since any document observed in the dataset should have at least one word, and if that word or all the words in the document aren't included in the vocabulary it would get an error. However, this cannot happen in both the training and validation sets, since the softmax layer is initialized from the vocabulary and the vocabulary is built from the entire corpus (validation included). So this is a problem only trying to use the model for documents with no words or with a preprocessing different from the one of OCTIS.
To ensure that each element of D should be positive, there are two ways:

way 1: throw an error like this:

assert np.any(D==0), 'D vector contains 0. All the documents should have at least one word in the vocabulary'
This can slow the training process, since for each batch iteration requires to compute the conditions D==0.
Maybe can make more sense to add it at the start of training, but the problem would persist for new documents in validation or out of corpus.

way 2:
replace (1 + self.M / D) with (1 + self.M / (D+1))
This solves easily the problem but it produces a small bias in the strength the prior should have (M is the fixed length that the prior version of the document has. When D>M the observed document is stronger than the prior version of it, and impacts more on the topics estimates. Viceversa when M>D).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added:

D = self.dtm.sum(axis=1)
assert np.any(D==0), 'all the documents should have positive length'

inside set_structure_from_dtm to prevent this from happening (v is a batch of the dtm)

else:
self.dtm = dtm
if doval:
self.val_dtm = np.log(1 + val_dtm)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see similar comment in oRSM

"""
mfh = self.visible2hidden(dtm)
vprob = self.hidden2visible(mfh)
lpub = np.exp(-np.nansum(np.log(vprob) * dtm) / np.sum(dtm))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this log perplexity or simply perplexity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply perplexity, I have to correct it or remove the np.exp

given a document term matrix
"""
mfh = self.visible2hidden(dtm)
vprob = self.hidden2visible(mfh)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibily clip vprob (e.g. vprob = np.clip(vprob, 1e-12, None)) to avoid vprob=0 and Nan values later?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, can np.sum(dtm) = 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first problem, your clip solution is more elegant and efficient than my sad np.nansum operation, so I will use it.
For the second, I give an answer similar to the D=0 problem above.
It shouldn't be possible that dtm is a matrix of zeros. The dtm columns are as many as the words in the vocabulary initialized by the preprocessed corpus. To handle a case were the vocabulary is given and the corpus contains empty documents, an assert check can be added to solve it.

Maybe an error can be thrown in this way?

assert np.sum(dtm)==0, 'All the documents in the corpus seems to be empty for the given vocabulary'

DTM[i, id] = count
return DTM

class oRSM_model(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend implementing a shared superclass which implements the shared functions, while you override the model-specific functions. This prevents you from having a lot of duplicated code e.g. _get_topic_word_matrix, get_topics, softmax, multinomial_sample, set_train_hyper, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will add a third file, RS_utils.py, together with RSM.py and oRSM.py in the models folder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have called the file RS_class.py, where the shared superclass is implemented. I hope I've done it in the way You mean. Many functions (like set_train_hyper) are quite different between the two classes, despite the common names they have.

setup.py Outdated
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
#'Programming Language :: Python :: 3.7',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just remove

@Fede-Rausa
Copy link
Collaborator Author

Hi Silvia, thanks for the review, it surfaces important problems.
I will try to solve everything and answer each question as soon as possible.

@Fede-Rausa
Copy link
Collaborator Author

Hi Silvia, I have inserted the duplicated functions in RS_class.py , and removed them from the other scripts. Let me know If I have done everything in the good way, or if I have to correct something else. Thank You for your time.

Federico

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants