Conversation
|
Hi Federico, regarding the Python 3.8 error, it seems that some libraries have dropped support for python 3.8. I think it's time to deprecate it. Could you please remove Python 3.8 from python-publish.yml and any other related references? Also, please add tests to ensure everything works as expected. I don't currently have time to verify the implementation manually, but adding tests will give us more confidence and serve as guardrails. You can follow the same patterns as in: https://github.com/MIND-Lab/OCTIS/blob/master/tests/test_octis.py and https://github.com/MIND-Lab/OCTIS/blob/master/tests/test_optimization.py. Lastly, could you run a formatter (e.g., ruff) on the files you added? Thanks, Silvia |
silviatti
left a comment
There was a problem hiding this comment.
Please review the comments in the code. I recommend running a linter to check for unused functions, duplicated code blocks, and other potential issues. I also asked ChatGPT for a quick review and it identified several points worth examining.
I am not an expert in RSM, so I can only validate those findings to a certain extent. I suggest doing a similar review yourself to further assess and verify potential issues in the implementation.
octis/models/oRSM.py
Outdated
| def _get_topic_word_matrix(self): | ||
| """ | ||
| Return the topic representation of the words | ||
| """ | ||
| w_vh, w_v, w_h = self.W | ||
| topic_word_matrix = w_vh.T | ||
| normalized = [] | ||
| for words_w in topic_word_matrix: | ||
| minimum = min(words_w) | ||
| words = words_w - minimum | ||
| normalized.append([float(i) / sum(words) for i in words]) | ||
| topic_word_matrix = np.array(normalized) | ||
| return topic_word_matrix | ||
|
|
||
| def _get_topic_word_matrix0(self): | ||
| """ | ||
| Return the topic representation of the words |
There was a problem hiding this comment.
these functions are duplicated
octis/models/oRSM.py
Outdated
| visible_sample = self.sample_softmax(visible_probs, D) | ||
| return visible_sample | ||
|
|
||
| def sample_h2(self, h1): |
There was a problem hiding this comment.
is this function used? if not, remove
octis/models/oRSM.py
Outdated
| mu1 = self.v_and_h2_to_h1(v, h2) | ||
| mu2 = self.h1_to_softmax(mu1) | ||
|
|
||
| if (old_mu2 - mu2).sum() < self.epsilon: |
There was a problem hiding this comment.
are you sure about this? I asked Chatgpt and they say it's incorrect because:
- Differences may cancel out (positive/negative)
- It does not use absolute values
It should be:
if np.linalg.norm(old_mu2 - mu2) < self.epsilon:or
if np.sum(np.abs(old_mu2 - mu2)) < self.epsilon:There was a problem hiding this comment.
I agree, that's a big point. That error is also present in the original version proposed by dongwookim. Maybe I will have to tune the default value of epsilon to make it larger and let a faster convergence.
There was a problem hiding this comment.
I have done the tests and the convergence is reached with a reasonably small epsilon, so I have applied the changes.
octis/models/oRSM.py
Outdated
| else: | ||
| self.dtm = dtm | ||
| if doval: | ||
| self.val_dtm = np.log(1 + val_dtm) |
There was a problem hiding this comment.
is it correct that you log transform the validation?
There was a problem hiding this comment.
No, it is really wrong, I will correct it. The log transform should be applied only when the user requires it, for both training and validation (it should be considered a preprocessing step).
| w_vh, w_v, w_h = self.W | ||
| D = v.sum(axis=1) | ||
| energy = np.outer((D + self.M), w_h) + (v @ w_vh) * np.reshape( | ||
| (1 + self.M / D), (-1, 1) |
There was a problem hiding this comment.
can D=0 lead to a division by zero error?
There was a problem hiding this comment.
Yes absolutely. But it isn't supposed to happen, since any document observed in the dataset should have at least one word, and if that word or all the words in the document aren't included in the vocabulary it would get an error. However, this cannot happen in both the training and validation sets, since the softmax layer is initialized from the vocabulary and the vocabulary is built from the entire corpus (validation included). So this is a problem only trying to use the model for documents with no words or with a preprocessing different from the one of OCTIS.
To ensure that each element of D should be positive, there are two ways:
way 1: throw an error like this:
assert np.any(D==0), 'D vector contains 0. All the documents should have at least one word in the vocabulary'
This can slow the training process, since for each batch iteration requires to compute the conditions D==0.
Maybe can make more sense to add it at the start of training, but the problem would persist for new documents in validation or out of corpus.
way 2:
replace (1 + self.M / D) with (1 + self.M / (D+1))
This solves easily the problem but it produces a small bias in the strength the prior should have (M is the fixed length that the prior version of the document has. When D>M the observed document is stronger than the prior version of it, and impacts more on the topics estimates. Viceversa when M>D).
There was a problem hiding this comment.
I have added:
D = self.dtm.sum(axis=1)
assert np.any(D==0), 'all the documents should have positive length'
inside set_structure_from_dtm to prevent this from happening (v is a batch of the dtm)
octis/models/RSM.py
Outdated
| else: | ||
| self.dtm = dtm | ||
| if doval: | ||
| self.val_dtm = np.log(1 + val_dtm) |
There was a problem hiding this comment.
see similar comment in oRSM
octis/models/RSM.py
Outdated
| """ | ||
| mfh = self.visible2hidden(dtm) | ||
| vprob = self.hidden2visible(mfh) | ||
| lpub = np.exp(-np.nansum(np.log(vprob) * dtm) / np.sum(dtm)) |
There was a problem hiding this comment.
is this log perplexity or simply perplexity?
There was a problem hiding this comment.
Simply perplexity, I have to correct it or remove the np.exp
| given a document term matrix | ||
| """ | ||
| mfh = self.visible2hidden(dtm) | ||
| vprob = self.hidden2visible(mfh) |
There was a problem hiding this comment.
possibily clip vprob (e.g. vprob = np.clip(vprob, 1e-12, None)) to avoid vprob=0 and Nan values later?
There was a problem hiding this comment.
also, can np.sum(dtm) = 0?
There was a problem hiding this comment.
For the first problem, your clip solution is more elegant and efficient than my sad np.nansum operation, so I will use it.
For the second, I give an answer similar to the D=0 problem above.
It shouldn't be possible that dtm is a matrix of zeros. The dtm columns are as many as the words in the vocabulary initialized by the preprocessed corpus. To handle a case were the vocabulary is given and the corpus contains empty documents, an assert check can be added to solve it.
Maybe an error can be thrown in this way?
assert np.sum(dtm)==0, 'All the documents in the corpus seems to be empty for the given vocabulary'
octis/models/oRSM.py
Outdated
| DTM[i, id] = count | ||
| return DTM | ||
|
|
||
| class oRSM_model(object): |
There was a problem hiding this comment.
I would recommend implementing a shared superclass which implements the shared functions, while you override the model-specific functions. This prevents you from having a lot of duplicated code e.g. _get_topic_word_matrix, get_topics, softmax, multinomial_sample, set_train_hyper, etc.
There was a problem hiding this comment.
Ok, I will add a third file, RS_utils.py, together with RSM.py and oRSM.py in the models folder.
There was a problem hiding this comment.
I have called the file RS_class.py, where the shared superclass is implemented. I hope I've done it in the way You mean. Many functions (like set_train_hyper) are quite different between the two classes, despite the common names they have.
setup.py
Outdated
| 'Programming Language :: Python :: 3', | ||
| 'Programming Language :: Python :: 3.7', | ||
| 'Programming Language :: Python :: 3.8', | ||
| #'Programming Language :: Python :: 3.7', |
|
Hi Silvia, thanks for the review, it surfaces important problems. |
…ect RS and oRS to handle errors of empty documents.
|
Hi Silvia, I have inserted the duplicated functions in RS_class.py , and removed them from the other scripts. Let me know If I have done everything in the good way, or if I have to correct something else. Thank You for your time. Federico |
Adding Replicated Softmax (a restricted boltzmann machine for topic modeling) and Over Replicated Softmax (a deep version of the RSM) models to the octis topic models list.
Both can be currently evaluated only on classification, diversity and coherence metrics, but support locally a fast computation for the perplexity upper bound (that is supported also for the LDA in the gensim implementation, used by octis).