Skip to content
Dimitri van Hees edited this page Mar 18, 2016 · 2 revisions

Discoverability

In order to let data be crawled, it first needs to be discovered. After all, if a crawler doesn't know the data exists, it simply won't crawl. There are different ways to notify crawlers of the existence of data.

XML sitemaps

A generic way for search engines is to provide XML sitemaps, containing links to the all data resources. These XML sitemaps can be submitted to search engines using the engine's webmaster tools or other notification mechanisms. Also, the location of the XML sitemaps should be available in the robots.txt file, the file in the root of a website which gives instructions to robots (crawlers).

Internal links

Internal links in the dataset implicate that there are other resources in the dataset as well. Crawlers can use these links to navigate to the resources, for example when there is pagination in place.

Incoming links

If external sites link to your resources, chances are that crawlers already know these sites and follow the links to your resources. This is quite hard for new datasets because the external sites need to know of the existence as well, but you could submit your datasets to external portals or known link directories for example (in the early days this happened large scale at the dutch Startpagina pages or DMOZ.org).

Pagerank

Pagerank is score a crawler grants to a certain resource based on different (unknown) factors. One of the known factors is that a crawler reasons that the more external websites link to a single resource, this resource is probably important and gets a higher pagerank. As we didn't have any incoming links during our research period, we can't influence the pagerank. Besides, it appears that pagerank is mainly used to determine the position within the search results. For our research we are not interested in top positions, we want crawlers to crawl and interpret the data. Because of that, we disregarded this crawlability dependency.

Indexation period

The indexaction period is the time a crawler takes before it starts indexing your pages after discovering the data. For our own website it took almost 10 months before Google indexed our structured data, while for other websites it will take much less. This all depends on different factors, of which none we have any influence on.

Indexation frequency

Implies the frequency of a visit from a crawler. Large newssites have a high frequency, which means that crawlers come by multiple times per day or even per hour, but new websites which are less newsworthy may be glad if a crawler comes by every day. Using the XML sitemap you can give some directions on the actuality of your data, but it's unknwon if a crawler follows these directions.

Content

Another important aspect of crawlability is the content itself. As a search engine, I want my end-users (people looking for something) to be as satisfied as possible. This means that when people find a result, I'd like to point them to a page where they will actually find what they are looking for. Spatial data hasn't got much context about the place itself, so leading a searching person to a page with only a location and some abbreviations (e.g. https://geo4web.apiwise.nl/gemeente/GM0307) is probably a bad idea. Content management is needed to enrich the dataset's detail pages in order to make them attractive to end-users, hence search engines and other crawlers.

Supported Schema.org entities

If a crawler indexes your page, it still is uncertain which structured data it interprets. As shown in the image below, Google recognizes our Schema.org markup, but it's unclear how it's being implemented. There is no clear overview of implementation of structured data for all crawlers, which makes it very hard to say something about it. Our advise is to include the Schema.org markup as much as possible, so it's ready for future implementations of the major search engines.

Clone this wiki locally