-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Where can I find topics of reuters dataset #12072
Comments
Update: As this topic gained some traction in internet discussions, and was even referenced from the official Keras documentation ( https://keras.io/api/datasets/reuters/ ), I collected all code and data from this investigation and put it here: https://github.com/SteffenBauer/KerasTools/tree/master/Reuters_Analysis In case it might be useful, I wrote a small library with some tools for Keras that I build for my personal deep learning explorations over the last year. I was interested in the exact mappings for all the Keras datasets, you can find the according dataset decoding module from my library here: https://github.com/SteffenBauer/KerasTools/blob/master/KerasTools/datasets/decode.py I got the Reuters topic mapping by transcribing the reuters entries back to human-readable form, sorting by topic label frequency, and matching them with the label topics found here: https://martin-thoma.com/nlp-reuters/ Using this, I got these label mappings for the Keras Reuters dataset:
|
@SteffenBauer Thanks for the reply. I already saw the https://maratin-thoma.com blog post but the number of examples in each topic doesn't match with Keras' Reuters dataset. For example, class name |
@hadifar: Yes, I never found any instance of the Reuters dataset with 11228 entries elsewhere than in Keras. When the Keras dataset was produced, there must have been some kind of pre-processing / pruning. As a result, a direct matching between the Keras set and that at martin-thoma.com is not possible. My list of topic mapping is therefore only kind of a 'reverse engineering' result. It needed a lot of manual matching. I used the number of entries as a hint where to look deeper, and then I directly inspected several re-transcribed entries visually, trying to figure out what category they match best. After some iterations, I ended up at above result, which should match the real categories. Here is the jupyter notebook that I used in identifying the categories: But yes, I would also be very interested in more detailed information how the Keras reuters set was generated. |
I just browsed the commit history of datasets.reuters, and it looks like older versions indeed contained the code which was used to parse the Reuters-21578 dataset into the reuters.pkl file, but it was removed 3 years ago: 71952f2#diff-4e341a06492281a7032f4fe4ecf6a3f7 So it should be possible to investigate further how the Keras reuters dataset was derived from the official data. |
Looks like the old
(the only change to the original code was that I needed to sort the filename list, so that it starts to parse with So this could really be the real topic mapping, directly derived from the original data. |
If you are interested in the code, I created a gist: https://gist.github.com/SteffenBauer/2444afea5ea844119b3985685e6aac29 Download |
@SteffenBauer Very nice investigation 👍 |
A last remark: This discrepancy is probably explained by martin-thoma using a different percentage for the test set than Keras. Keras splits 20% for the test set, while martin-thoma uses percentages between ~20% and ~30%, |
Imported from GitHub PR #17635 From discussions and references from: - #12072 (comment) - https://martin-thoma.com/nlp-reuters/ Add documentation: - That explains the word indices returned from keras `keras.datasets.reuters.get_word_index`. - Add helper function to return `ylabels` for label data. Copybara import of the project: -- 3c2ac2d by Kevin Hu <hxy9243@gmail.com>: update documentation to keras reuters dataset -- d08241c by Kevin Hu <hxy9243@gmail.com>: format the code -- b1fcf1b by Kevin Hu <hxy9243@gmail.com>: address PR reviews on formatting -- d85556e by Kevin Hu <hxy9243@gmail.com>: fix lint errors -- d29df56 by Kevin Hu <hxy9243@gmail.com>: address PR review Merging this change closes #17635 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17635 from hxy9243:master d29df56 PiperOrigin-RevId: 515713085
Imported from GitHub PR #17635 From discussions and references from: - #12072 (comment) - https://martin-thoma.com/nlp-reuters/ Add documentation: - That explains the word indices returned from keras `keras.datasets.reuters.get_word_index`. - Add helper function to return `ylabels` for label data. Copybara import of the project: -- 3c2ac2d by Kevin Hu <hxy9243@gmail.com>: update documentation to keras reuters dataset -- d08241c by Kevin Hu <hxy9243@gmail.com>: format the code -- b1fcf1b by Kevin Hu <hxy9243@gmail.com>: address PR reviews on formatting -- d85556e by Kevin Hu <hxy9243@gmail.com>: fix lint errors -- d29df56 by Kevin Hu <hxy9243@gmail.com>: address PR review Merging this change closes #17635 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17635 from hxy9243:master d29df56 PiperOrigin-RevId: 515713085
In Reuters dataset, there are 11228 instances while in the dataset's webpage there are 21578. Even in the reference paper there are more than 11228 examples after pruning.
Unfortunately, there is no information about the Reuters dataset in Keras documentation. Is it possible to clarify how this dataset gathered and what the topics labels are? It mentioned there are 46 topics but what is the category e.g. for topic number 32?
The text was updated successfully, but these errors were encountered: