[Feature Update] Optional raw text support #484

Jn-Huang · 2024-03-04T05:57:24Z

In this PR, we

Support the saving and loading of raw text in GLI datasets. In metadata, raw text will occupy an entry RawText in parallel with Node, Edge and Graph.
Add helper functions to process raw texts from source in gli/raw_text_utils.py. This file will help the saving of datasets with raw texts.
Update the notebook to generate Cora dataset's metadata to accommodate the raw texts.
Update the test_metadata.py to accommodate the cases we use optional file instead of file in metadata.json.

Subsequently, after this PR is merged, we will update other datasets with raw text: ogbn-arxiv, pubmed, ogbn-product, arxiv-2023.

Description (Outdated, See below comments for the newest version)

For dataset contributor with raw text.

Updated Cora metadata comes with an extra entry RawText:

{
    "description": "CORA dataset.",
    "data": {
        "Node": {
            "NodeFeature": {
                "description": "Node features of Cora dataset, 1/0-valued vectors.",
                "type": "int",
                "format": "SparseTensor",
                "file": "cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz"
            },
        },
        # Edge, Graph entries..
        <------------ New ------------ >
        "RawText": {
            "NodeRawText": {
                "description": "Raw text of title, abstract and label of each node in Cora dataset, dict of list of strings.",
                "type": "Dict",
                "format": "Dict[str, list[str]]",
                "optional file": "cora__graph__835178b65ba8cfdfb9c91f33c6260506.optional.npz",
                "key": "RawText_NodeRawText"
            }
        }
        <------------ New ------------ >
    }
}

The generation of this extra entry is simple, dataset contributors can simply save a dictionary of raw texts by passing raw_text_attrs to save_graph:

<------------ New ------------ >
raw_text_attrs = [
    Attribute(
        "NodeRawText",
        raw_text_dict,
        "Raw text of title, abstract and label of each node in Cora dataset, dict of list of strings.",
        "Dict",
        'Dict[str, list[str]]'
    )
]
<------------ New ------------ >

metadata = save_graph(
    name="Cora",
    edge=edge,
    num_nodes=graph.num_nodes(),
    node_attrs=node_attrs,
    raw_text_attrs=raw_text_attrs, # <--- New
    description="CORA dataset."
)

With generality, dataset contributors can also define other raw texts dictionaries, such as EdgeRawText.

For users who want to load a dataset with raw text.

Users can load the dataset with raw text by passing an optional argument load_raw_text to get_gli_dataset. This argument will download the .npz file for raw text, if now downloaded yet.

dataset = get_gli_dataset("cora",
                          "NodeClassification",
                          load_raw_text=True, # <--- New
                          verbose=True)

The the raw text will be returned in the dictionary data.NodeRawText['RawText_NodeRawText']:

data = dataset[0]
for key, item in data.NodeRawText['RawText_NodeRawText'].items():
    print(key, item[:1])

Output:

title ['Title: The megaprior heuristic for discovering protein sequence patterns  ']
abs ['Abstract: Several computer algorithms for discovering patterns in groups of protein sequences are ...']
label ['Neural Networks']

Related Issue

See issue.

Motivation and Context

Support the loading of raw text in GLI framework.

How Has This Been Tested?

Get a dataset, but do not load raw text. Raw text file should not be downloaded.

In [3]: dataset = get_gli_dataset("cora", "NodeClassification", verbose=True)
Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/Cora/cora__task_node_classification_1__41e167258678b585872679839ce9c40f.npz’

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz’

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/Cora/cora__graph__6c912909fa18eff10797210ea5e485fe.npz’

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/Cora/cora__graph__Graph_NodeList__23bbef862fd6037395412eb03b4e1d9c.sparse.npz’

CORA dataset.
All data files already exist. Skip downloading.
Node classification on CORA dataset. Planetoid split.

At the same time the raw text file is not downloaded.

Get a dataset and load raw text. Raw text file should be downloaded.

In [4]: dataset = get_gli_dataset("cora", "NodeClassification", load_raw_text=True, verbose=True)

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__835178b65ba8cfdfb9c91f33c6260506.optional.npz’

/Users/jinhuang/Documents/research/gli/datasets/cora/cora__task_node_classification_1__41e167258678b585872679839ce9c40f.npz already exists. Skip downloading.
/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz already exists. Skip downloading.
/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__6c912909fa18eff10797210ea5e485fe.npz already exists. Skip downloading.
/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__Graph_NodeList__23bbef862fd6037395412eb03b4e1d9c.sparse.npz already exists. Skip downloading.
CORA dataset.
All data files already exist. Skip downloading.
Node classification on CORA dataset. Planetoid split.

In [6]: data = dataset[0]

In [7]: data.NodeRawText['RawText_NodeRawText'].keys()
Out[7]: dict_keys(['title', 'abs', 'label'])

Load a dataset without raw text, simply load the dataset without raw text

In [3]: dataset = get_gli_dataset("pubmed", "NodeClassification", load_raw_text=True)

xingjian-zhang · 2024-03-05T14:30:15Z

Thanks @Jn-Huang for the inputs!

Why do we want to save the optional raw texts into a dictionary? Imo, a more consistent way is to treat every kind of raw text as a normal (node/edge/graph) attribute. For example, if the nodes are papers, then titles are one node attribute, and abstracts are another node attributes.

What do you think?

Jn-Huang · 2024-03-05T15:47:43Z

@xingjian-zhang Thank you for you comment!
Yes I think this is a good point, saving the node raw texts as extra node attributes will be better. I will update this PR.

Jn-Huang · 2024-03-07T03:04:29Z

@xingjian-zhang Hi! I have updated the implementation as we discussed. Could you please take a look when you are available? Thanks!

The metadata is not updated as

{
    "description": "CORA dataset.",
    "data": {
        "Node": {
            "NodeFeature": {
                "description": "Node features of Cora dataset, 1/0-valued vectors.",
                "type": "int",
                "format": "SparseTensor",
                "file": "cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz"
            },
            <------------ New ------------ >
            "NodeRawTextTitle": {
                "description": "Raw text of title of each node in Cora dataset, list of strings.",
                "type": "str",
                "format": "List[str]",
                "optional file": "cora__graph__Node_NodeRawTextTitle__4a9ad6575f5acfe3b828fe66f072bd5c.optional.npz",
                "key": "Node_NodeRawTextTitle"
            },
            "NodeRawTextAbstract": {
                "description": "Raw text of abstract of each node in Cora dataset, list of strings.",
                "type": "str",
                "format": "List[str]",
                "optional file": "cora__graph__Node_NodeRawTextAbstract__d0e5436087314624c74a9f040d6f394f.optional.npz",
                "key": "Node_NodeRawTextAbstract"
            },
            "NodeRawTextLabel": {
                "description": "Raw text of label of each node in Cora dataset, list of strings.",
                "type": "str",
                "format": "List[str]",
                "optional file": "cora__graph__Node_NodeRawTextLabel__06d184316789acc0902db2b8c1472f95.optional.npz",
                "key": "Node_NodeRawTextLabel"
            }
            <------------ New ------------ >
        },
        # Other Attributes
}

where the raw texts related to nodes are saved as node attributes.

Dataset contributors can save such raw text by defining extra node attributes:

node_attrs = [
    Attribute(
        "NodeFeature",
        node_feats,
        "Node features of Cora dataset, 1/0-valued vectors.",
        "int",
        "SparseTensor",
    ),
    <------------ New ------------ >
    Attribute(
        "NodeRawTextTitle",
        raw_text_dict["title"],
        "Raw text of title of each node in Cora dataset, list of strings.",
        "str",
        "List[str]"
    ),
    Attribute(
        "NodeRawTextAbstract",
        raw_text_dict["abs"],
        "Raw text of abstract of each node in Cora dataset, list of strings.",
        "str",
        "List[str]"
    ),
    Attribute(
        "NodeRawTextLabel",
        raw_text_dict["label"],
        "Raw text of label of each node in Cora dataset, list of strings.",
        "str",
        "List[str]"
    )
    <------------ New ------------ >
]

For users who want to do load a dataset with raw text, they can simply do

dataset = get_gli_dataset("cora",
                          "NodeClassification",
                          load_raw_text=True, # <--- New
                          verbose=True)
data = dataset[0]

And the raw texts are stored in

data.NodeRawTextTitle[0], data.NodeRawTextAbstract[0], data.NodeRawTextLabel[0]

Output:

('Title: The megaprior heuristic for discovering protein sequence patterns  ',
 'Abstract: Several computer algorithms for discovering patterns in groups of protein sequences are in use that are based on fitting the parameters of a statistical model to a group of related sequences. These include hidden Markov model (HMM) algorithms for multiple sequence alignment, and the MEME and Gibbs sampler algorithms for discovering motifs. These algorithms are sometimes prone to producing models that are incorrect because two or more patterns have been combined. The statistical model produced in this situation is a convex combination (weighted average) of two or more different models. This paper presents a solution to the problem of convex combinations in the form of a heuristic based on using extremely low variance Dirichlet mixture priors as part of the statistical model. This heuristic, which we call the megaprior heuristic, increases the strength (i.e., decreases the variance) of the prior in proportion to the size of the sequence dataset. This causes each column in the final model to strongly resemble the mean of a single component of the prior, regardless of the size of the dataset. We describe the cause of the convex combination problem, analyze it mathematically, motivate and describe the implementation of the megaprior heuristic, and show how it can effectively eliminate the problem of convex combinations in protein sequence pattern discovery. ',
 'Neural Networks')

Note, here we cannot save raw texts in data.ndata, because dgl enforce that each element in ndata is a tensor. And it's not a good practice to save lists of strings as tensor.

Testing

Similar testing are conducted for this version of implementation.

Jn-Huang added 7 commits March 4, 2024 00:48

changes for optional raw text support

a9e0b9e

fix style issue

9bb6c07

fix panda version

9822f32

add datasets package

211560c

remove raw_text_utils in gli.init

7336d6b

remove redundent prints

8f11f22

fix error handling of raw test

8dda830

Jn-Huang marked this pull request as ready for review March 5, 2024 00:20

Jn-Huang changed the title ~~(On-going draft PR) [Feature Update] Optional raw text support~~ [Feature Update] Optional raw text support Mar 5, 2024

xingjian-zhang self-requested a review March 5, 2024 14:21

Jn-Huang mentioned this pull request Mar 5, 2024

[FEATURE REQUEST] Raw Text Support for GLI; New Dataset Arxiv-2023 #480

Open

Jn-Huang added 7 commits March 6, 2024 20:34

save raw text as node attributes

5dfa9f9

add pandas into env

b4c5ab2

fix doc test

bd91a99

fix yaml for doc test

611cc57

fix doc test env

05fedf9

update doc env

a218d9a

fix test env

dfa1895

finialize notebook

7447d64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Update] Optional raw text support #484

[Feature Update] Optional raw text support #484

Jn-Huang commented Mar 4, 2024 •

edited

Loading

xingjian-zhang commented Mar 5, 2024

Jn-Huang commented Mar 5, 2024

Jn-Huang commented Mar 7, 2024 •

edited

Loading

[Feature Update] Optional raw text support #484

Are you sure you want to change the base?

[Feature Update] Optional raw text support #484

Conversation

Jn-Huang commented Mar 4, 2024 • edited Loading

Description (Outdated, See below comments for the newest version)

For dataset contributor with raw text.

For users who want to load a dataset with raw text.

Related Issue

Motivation and Context

How Has This Been Tested?

Get a dataset, but do not load raw text. Raw text file should not be downloaded.

Get a dataset and load raw text. Raw text file should be downloaded.

Load a dataset without raw text, simply load the dataset without raw text

xingjian-zhang commented Mar 5, 2024

Jn-Huang commented Mar 5, 2024

Jn-Huang commented Mar 7, 2024 • edited Loading

Testing

Jn-Huang commented Mar 4, 2024 •

edited

Loading

Jn-Huang commented Mar 7, 2024 •

edited

Loading