-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Feature Update] Optional raw text support #484
base: main
Are you sure you want to change the base?
Conversation
Thanks @Jn-Huang for the inputs! Why do we want to save the optional raw texts into a dictionary? Imo, a more consistent way is to treat every kind of raw text as a normal (node/edge/graph) attribute. For example, if the nodes are papers, then titles are one node attribute, and abstracts are another node attributes. What do you think? |
@xingjian-zhang Thank you for you comment! |
@xingjian-zhang Hi! I have updated the implementation as we discussed. Could you please take a look when you are available? Thanks! The metadata is not updated as
where the raw texts related to nodes are saved as node attributes. Dataset contributors can save such raw text by defining extra node attributes:
For users who want to do load a dataset with raw text, they can simply do
And the raw texts are stored in
Output:
Note, here we cannot save raw texts in TestingSimilar testing are conducted for this version of implementation. |
In this PR, we
RawText
in parallel withNode
,Edge
andGraph
.gli/raw_text_utils.py
. This file will help the saving of datasets with raw texts.Cora
dataset's metadata to accommodate the raw texts.test_metadata.py
to accommodate the cases we useoptional file
instead offile
inmetadata.json
.Subsequently, after this PR is merged, we will update other datasets with raw text:
ogbn-arxiv
,pubmed
,ogbn-product
,arxiv-2023
.Description (Outdated, See below comments for the newest version)
For dataset contributor with raw text.
Updated
Cora
metadata comes with an extra entryRawText
:The generation of this extra entry is simple, dataset contributors can simply save a dictionary of raw texts by passing
raw_text_attrs
tosave_graph
:With generality, dataset contributors can also define other raw texts dictionaries, such as
EdgeRawText
.For users who want to load a dataset with raw text.
Users can load the dataset with raw text by passing an optional argument
load_raw_text
toget_gli_dataset
. This argument will download the.npz
file for raw text, if now downloaded yet.The the raw text will be returned in the dictionary
data.NodeRawText['RawText_NodeRawText']
:Output:
Related Issue
See issue.
Motivation and Context
Support the loading of raw text in GLI framework.
How Has This Been Tested?
Get a dataset, but do not load raw text. Raw text file should not be downloaded.
At the same time the raw text file is not downloaded.
Get a dataset and load raw text. Raw text file should be downloaded.
Load a dataset without raw text, simply load the dataset without raw text