-
Notifications
You must be signed in to change notification settings - Fork 5
Metadata Wrappers and Repositories
All metadata wrappers are stored inside XML documents called metadata repositories. Each repository can hold any number of wrappers, and each wrapper is defined inside of one repository. Wrappers can inherit from wrappers in other repositories.
We have two conventions for organizing wrappers into repositories. For highly-complicated websites like IMDB, we have one repository that holds all wrappers required to store IMDB information. For less-complicated websites that share a common super-type, we often group them together. For example, we place all products and services in on repository. Use your best judgement on how to organize your wrappers.
A respository is stored in an XML document. Typically, we store the source file in ./BigSemanticsWrapperRepository/BigSemanticsWrappers/MmdRepository/mmdrepository/repositorySources
.
Each repository must have a unique name and package attribute.
Example:
(from the document named "example.xml" stored in ./BigSemanticsWrapperRepository/BigSemanticsWrappers/MmdRepository/mmdrepository/repositorySources
)
<meta_metadata_repository name="relevant_name"
package="ecologylab.bigsemantics.generated.library.example" default_user_agent_name="chrome_2">
</meta_metadata_repository>
In addition to name and package, a repository may also have a default_user_agent_name, which may effect what page is obtained for a given URL (ie: mobile vs desktop).
Any reference to a metadata wrapper corresponds directly to a meta_metadata field. Each meta_metadata field may contain a mixture of data fields, selectors, and operations.
A meta_metadata field should have the following attributes:
- name - A unique name
- type or extends - type indicates that a wrapper will inherit all fields from a parent wrapper and cannot add any new fields. extends indicates that a wrapper will inherit all fields from a parent wrapper and can add new fields.
- parser - for wrappers operating on HTML documents, the parser should be set to "xpath". For wrappers operating on XML docs, the parser should be set to "direct".
<meta_metadata name="league_champion" extends="compound_document" parser="xpath">
<selector url_path_tree="http://gameinfo.na.leagueoflegends.com/en/game-info/champions/" />
<example_url url="http://gameinfo.na.leagueoflegends.com/en/game-info/champions/diana/" />
<example_url url="http://gameinfo.na.leagueoflegends.com/en/game-info/champions/caitlyn/" />
<scalar name="title" label="champion_name">
<xpath>//div[@id='champ_header']/h1</xpath>
</scalar>
<scalar name="epithet" scalar_type="String">
<xpath>//div[@id='champ_header']/h3</xpath>
</scalar>
<scalar name="description" label="lore">
<xpath>//div[@id='champion-lore']//div[@class='gs-container']/div[1]</xpath>
</scalar>
</meta_metadata>
With dozens of wrappers to choose from, each wrapper must have some way of being uniquely selected, unless that wrapper is only used as a type for a composite or collection to dump values into. See selectors for details.
For most wrappers, you should include at least two example_url's which serve as, well, examples for the webpages that the wrapper will be used on.
What do we mean when we say a Wrapper inherits from another wrapper?
Practically, we mean that a wrapper may implement all fields provided by its parent. For example, if the parent wrapper includes
<scalar name="review" scalar_type="String"/>
then the child can use
<scalar name="review">
<xpath>//div[@id='review']/text()</xpath>
</scalar>
without needing to specify that "review" is of type "String". We could also specify additional attributes for the scalar, for example.
However, a wrapper must extend another wrapper in order to add new fields.