Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Feature Update] Optional raw text support #484

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

Jn-Huang
Copy link
Collaborator

@Jn-Huang Jn-Huang commented Mar 4, 2024

In this PR, we

  1. Support the saving and loading of raw text in GLI datasets. In metadata, raw text will occupy an entry RawText in parallel with Node, Edge and Graph.
  2. Add helper functions to process raw texts from source in gli/raw_text_utils.py. This file will help the saving of datasets with raw texts.
  3. Update the notebook to generate Cora dataset's metadata to accommodate the raw texts.
  4. Update the test_metadata.py to accommodate the cases we use optional file instead of file in metadata.json.

Subsequently, after this PR is merged, we will update other datasets with raw text: ogbn-arxiv, pubmed, ogbn-product, arxiv-2023.

Description (Outdated, See below comments for the newest version)

For dataset contributor with raw text.

Updated Cora metadata comes with an extra entry RawText:

{
    "description": "CORA dataset.",
    "data": {
        "Node": {
            "NodeFeature": {
                "description": "Node features of Cora dataset, 1/0-valued vectors.",
                "type": "int",
                "format": "SparseTensor",
                "file": "cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz"
            },
        },
        # Edge, Graph entries..
        <------------ New ------------ >
        "RawText": {
            "NodeRawText": {
                "description": "Raw text of title, abstract and label of each node in Cora dataset, dict of list of strings.",
                "type": "Dict",
                "format": "Dict[str, list[str]]",
                "optional file": "cora__graph__835178b65ba8cfdfb9c91f33c6260506.optional.npz",
                "key": "RawText_NodeRawText"
            }
        }
        <------------ New ------------ >
    }
}

The generation of this extra entry is simple, dataset contributors can simply save a dictionary of raw texts by passing raw_text_attrs to save_graph:

<------------ New ------------ >
raw_text_attrs = [
    Attribute(
        "NodeRawText",
        raw_text_dict,
        "Raw text of title, abstract and label of each node in Cora dataset, dict of list of strings.",
        "Dict",
        'Dict[str, list[str]]'
    )
]
<------------ New ------------ >

metadata = save_graph(
    name="Cora",
    edge=edge,
    num_nodes=graph.num_nodes(),
    node_attrs=node_attrs,
    raw_text_attrs=raw_text_attrs, # <--- New
    description="CORA dataset."
)

With generality, dataset contributors can also define other raw texts dictionaries, such as EdgeRawText.

For users who want to load a dataset with raw text.

Users can load the dataset with raw text by passing an optional argument load_raw_text to get_gli_dataset. This argument will download the .npz file for raw text, if now downloaded yet.

dataset = get_gli_dataset("cora",
                          "NodeClassification",
                          load_raw_text=True, # <--- New
                          verbose=True)

The the raw text will be returned in the dictionary data.NodeRawText['RawText_NodeRawText']:

data = dataset[0]
for key, item in data.NodeRawText['RawText_NodeRawText'].items():
    print(key, item[:1])

Output:

title ['Title: The megaprior heuristic for discovering protein sequence patterns  ']
abs ['Abstract: Several computer algorithms for discovering patterns in groups of protein sequences are ...']
label ['Neural Networks']

Related Issue

See issue.

Motivation and Context

Support the loading of raw text in GLI framework.

How Has This Been Tested?

Get a dataset, but do not load raw text. Raw text file should not be downloaded.

In [3]: dataset = get_gli_dataset("cora", "NodeClassification", verbose=True)
Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/Cora/cora__task_node_classification_1__41e167258678b585872679839ce9c40f.npz’

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz’

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/Cora/cora__graph__6c912909fa18eff10797210ea5e485fe.npz’

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/Cora/cora__graph__Graph_NodeList__23bbef862fd6037395412eb03b4e1d9c.sparse.npz’

CORA dataset.
All data files already exist. Skip downloading.
Node classification on CORA dataset. Planetoid split.

At the same time the raw text file is not downloaded.

Get a dataset and load raw text. Raw text file should be downloaded.

In [4]: dataset = get_gli_dataset("cora", "NodeClassification", load_raw_text=True, verbose=True)

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__835178b65ba8cfdfb9c91f33c6260506.optional.npz’

/Users/jinhuang/Documents/research/gli/datasets/cora/cora__task_node_classification_1__41e167258678b585872679839ce9c40f.npz already exists. Skip downloading.
/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz already exists. Skip downloading.
/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__6c912909fa18eff10797210ea5e485fe.npz already exists. Skip downloading.
/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__Graph_NodeList__23bbef862fd6037395412eb03b4e1d9c.sparse.npz already exists. Skip downloading.
CORA dataset.
All data files already exist. Skip downloading.
Node classification on CORA dataset. Planetoid split.

In [6]: data = dataset[0]

In [7]: data.NodeRawText['RawText_NodeRawText'].keys()
Out[7]: dict_keys(['title', 'abs', 'label'])

Load a dataset without raw text, simply load the dataset without raw text

In [3]: dataset = get_gli_dataset("pubmed", "NodeClassification", load_raw_text=True)

@Jn-Huang Jn-Huang marked this pull request as ready for review March 5, 2024 00:20
@Jn-Huang Jn-Huang changed the title (On-going draft PR) [Feature Update] Optional raw text support [Feature Update] Optional raw text support Mar 5, 2024
@xingjian-zhang xingjian-zhang self-requested a review March 5, 2024 14:21
@xingjian-zhang
Copy link
Collaborator

Thanks @Jn-Huang for the inputs!

Why do we want to save the optional raw texts into a dictionary? Imo, a more consistent way is to treat every kind of raw text as a normal (node/edge/graph) attribute. For example, if the nodes are papers, then titles are one node attribute, and abstracts are another node attributes.

What do you think?

@Jn-Huang
Copy link
Collaborator Author

Jn-Huang commented Mar 5, 2024

@xingjian-zhang Thank you for you comment!
Yes I think this is a good point, saving the node raw texts as extra node attributes will be better. I will update this PR.

@Jn-Huang
Copy link
Collaborator Author

Jn-Huang commented Mar 7, 2024

@xingjian-zhang Hi! I have updated the implementation as we discussed. Could you please take a look when you are available? Thanks!

The metadata is not updated as

{
    "description": "CORA dataset.",
    "data": {
        "Node": {
            "NodeFeature": {
                "description": "Node features of Cora dataset, 1/0-valued vectors.",
                "type": "int",
                "format": "SparseTensor",
                "file": "cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz"
            },
            <------------ New ------------ >
            "NodeRawTextTitle": {
                "description": "Raw text of title of each node in Cora dataset, list of strings.",
                "type": "str",
                "format": "List[str]",
                "optional file": "cora__graph__Node_NodeRawTextTitle__4a9ad6575f5acfe3b828fe66f072bd5c.optional.npz",
                "key": "Node_NodeRawTextTitle"
            },
            "NodeRawTextAbstract": {
                "description": "Raw text of abstract of each node in Cora dataset, list of strings.",
                "type": "str",
                "format": "List[str]",
                "optional file": "cora__graph__Node_NodeRawTextAbstract__d0e5436087314624c74a9f040d6f394f.optional.npz",
                "key": "Node_NodeRawTextAbstract"
            },
            "NodeRawTextLabel": {
                "description": "Raw text of label of each node in Cora dataset, list of strings.",
                "type": "str",
                "format": "List[str]",
                "optional file": "cora__graph__Node_NodeRawTextLabel__06d184316789acc0902db2b8c1472f95.optional.npz",
                "key": "Node_NodeRawTextLabel"
            }
            <------------ New ------------ >
        },
        # Other Attributes
}

where the raw texts related to nodes are saved as node attributes.

Dataset contributors can save such raw text by defining extra node attributes:

node_attrs = [
    Attribute(
        "NodeFeature",
        node_feats,
        "Node features of Cora dataset, 1/0-valued vectors.",
        "int",
        "SparseTensor",
    ),
    <------------ New ------------ >
    Attribute(
        "NodeRawTextTitle",
        raw_text_dict["title"],
        "Raw text of title of each node in Cora dataset, list of strings.",
        "str",
        "List[str]"
    ),
    Attribute(
        "NodeRawTextAbstract",
        raw_text_dict["abs"],
        "Raw text of abstract of each node in Cora dataset, list of strings.",
        "str",
        "List[str]"
    ),
    Attribute(
        "NodeRawTextLabel",
        raw_text_dict["label"],
        "Raw text of label of each node in Cora dataset, list of strings.",
        "str",
        "List[str]"
    )
    <------------ New ------------ >
]

For users who want to do load a dataset with raw text, they can simply do

dataset = get_gli_dataset("cora",
                          "NodeClassification",
                          load_raw_text=True, # <--- New
                          verbose=True)
data = dataset[0]

And the raw texts are stored in

data.NodeRawTextTitle[0], data.NodeRawTextAbstract[0], data.NodeRawTextLabel[0]

Output:

('Title: The megaprior heuristic for discovering protein sequence patterns  ',
 'Abstract: Several computer algorithms for discovering patterns in groups of protein sequences are in use that are based on fitting the parameters of a statistical model to a group of related sequences. These include hidden Markov model (HMM) algorithms for multiple sequence alignment, and the MEME and Gibbs sampler algorithms for discovering motifs. These algorithms are sometimes prone to producing models that are incorrect because two or more patterns have been combined. The statistical model produced in this situation is a convex combination (weighted average) of two or more different models. This paper presents a solution to the problem of convex combinations in the form of a heuristic based on using extremely low variance Dirichlet mixture priors as part of the statistical model. This heuristic, which we call the megaprior heuristic, increases the strength (i.e., decreases the variance) of the prior in proportion to the size of the sequence dataset. This causes each column in the final model to strongly resemble the mean of a single component of the prior, regardless of the size of the dataset. We describe the cause of the convex combination problem, analyze it mathematically, motivate and describe the implementation of the megaprior heuristic, and show how it can effectively eliminate the problem of convex combinations in protein sequence pattern discovery. ',
 'Neural Networks')

Note, here we cannot save raw texts in data.ndata, because dgl enforce that each element in ndata is a tensor. And it's not a good practice to save lists of strings as tensor.

Testing

Similar testing are conducted for this version of implementation.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants