Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Source bing webmaster #335

Merged
merged 17 commits into from
Feb 26, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions sources/bing_webmaster/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
title: Bing Webmaster
description: dlt source for Bing Webmaster API
keywords: [bing, bing webmasters, bing webmaster tools]
---


# Bing Webmaster

This source allows site owners to retrieve the organic traffic to their pages via Bing.
[Bing Webmaster](https://www.bing.com/webmasters/tools/) is a free service as part of Microsoft's Bing search engine. It allows webmasters to add their websites to the Bing index crawler, see their site's performance in Bing searches (including chat). This source reports the clicks, impressions and their respective average ranks.

Resources that can be loaded using this verified source are:

| Name | Description |
| ---------------- | --------------------------------------------------------------------------------|
| page_stats | retrieves weekly traffic statistics for top pages belonging to a site_url |
| page_query_stats | retrieves weekly trafic statistics per query for each pair of page and site_url |


## Initialize the pipeline

```bash
dlt init bing_webmaster duckdb
```

Here, we chose duckdb as the destination. Alternatively, you can also choose redshift, bigquery, or
any of the other [destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/).

## Add credentials

1. [Bing Webmaster API](https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi) is an API that
requires authentication or including secrets in `secrets.toml`. Create an account and generate the API key by clicking on the cog wheel in the [Bing Webmaster Web UI](https://www.bing.com/webmasters/home).

2. Add the obtained API key into `secrets.toml` as follows:
```toml
[sources.bing_webmaster]
api_key = "Please set me up!" # please set me up!
```

3. Ensure to add your sites and also verify with Bing that you have ownership of the domains you want to fetch statics for. Follow this [Documentation on add and verify site](https://www.bing.com/webmasters/help/add-and-verify-site-12184f8b). It describes how you can achieve that by importing your sites from Google Search Console or by adding your sites manually.

4. Follow the instructions in the
[destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/) document to add credentials
for your chosen destination.

## Run the pipeline

1. Install the necessary dependencies by running the following command:

```bash
pip install -r requirements.txt
```

2. Substitute the example domain with your domain in the pipeline file `bing_webmaster_pipeline.py`.

3. Now the pipeline can be run by using the command:

```bash
python3 bing_webmaster_pipeline.py
```

3. To make sure that everything is loaded as expected, use the command:

```bash
dlt pipeline bing_webmaster_pipeline show
```

💡 To explore the API documentation, we recommend referring to the official API documentation for the two implemented resources:
- [GetPageStats](https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi.getpagestats)
- [GetPageQueryStats](https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi.getpagequerystats)
The documentation is very sparse and can be potentially misleading. Therefore, we have added more information into the docstrings of our implementation. We gained this additional information comes from our practical observations while working with this data source.
96 changes: 96 additions & 0 deletions sources/bing_webmaster/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
"""
A source loading history of organic search traffic from Bing Webmaster API
See documentation: https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi?view=bing-webmaster-dotnet
The API returns aggregated weekly statistics for the entire history of up to 26 weeks.
The dates are always Fridays and during tests, the data up to the latest Friday has been available on the following Monday.
"""

import time
from typing import Iterable, Iterator, List, Sequence

import dlt
from dlt.common import logger
from dlt.common.typing import DictStrAny, DictStrStr
from dlt.sources import DltResource

from .helpers import get_stats_with_retry, parse_response


@dlt.source(name="bing_webmaster")
def source(
site_urls: List[str] = None, site_url_pages: Iterable[DictStrStr] = None
) -> Sequence[DltResource]:
"""
A dlt source for the Bing Webmaster api.
It groups resources for the APIs which return organic search traffic statistics
Args:
site_urls: List[str]: A list of site_urls, e.g, ["dlthub.com", "dlthub.de"]. Use this if you need the weekly traffic per site_url and page
site_url_pages: Iterable[DictStrStr]: A list of pairs of site_url and page. Use this if you need the weekly traffic per site_url, page, and query
Returns:
Sequence[DltResource]: A sequence of resources that can be selected from including page_stats and page_query_stats.
"""
return (
page_stats(site_urls),
page_query_stats(site_url_pages),
)


@dlt.resource(
write_disposition="merge",
merge_key=("date", "page", "site_url"),
primary_key=("date", "page", "site_url"),
table_name="bing_page_stats",
)
def page_stats(
site_urls: List[str], api_key: str = dlt.secrets.value
) -> Iterator[Iterator[DictStrAny]]:
"""
Yields detailed traffic statistics for top pages belonging to a site_url
Contains the entire available history of up to 26 weeks. Thus, we recommend to use write_disposition="merge"
API documentation:
https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi.getpagestats
Args:
site_urls (List[str]): List of site_urls to retrieve statistics for.
Yields:
Iterator[Dict[str, Any]]: An iterator over list of organic traffic statistics.
"""
api_path = "GetPageStats"
for site_url in site_urls:
params = {"siteUrl": site_url, "apikey": api_key}
logger.info(f"Fetching for site_url: {site_url}")
response = get_stats_with_retry(api_path, params)
if len(response) > 0:
yield parse_response(response, site_url)


@dlt.resource(
write_disposition="merge",
merge_key=("date", "page", "site_url", "query"),
primary_key=("date", "page", "site_url", "query"),
table_name="bing_page_query_stats",
)
def page_query_stats(
site_url_pages: Iterable[DictStrStr],
api_key: str = dlt.secrets.value,
) -> Iterator[Iterator[DictStrAny]]:
"""
Yields weekly statistics and queries for each pair of page and site_url.
Contains the entire available history of up to 26 weeks. Thus, we recommend to use write_disposition="merge"
API documentation:
https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi.getpagequerystats

Args:
site_url_page (Iterable[DictStrStr]): Iterable of site_url and pages to retrieve statistics for. Can be result of a SQL query, a parsed sitemap, etc.
Yields:
Iterator[Dict[str, Any]]: An iterator over list of organic traffic statistics.
"""
api_path = "GetPageQueryStats"
for record in site_url_pages:
time.sleep(0.5) # this avoids rate limit observed after dozens of requests
site_url = record.get("site_url")
page = record.get("page")
params = {"siteUrl": site_url, "page": page, "apikey": api_key}
logger.info(f"Fetching for site_url: {site_url}, page: {page}")
response = get_stats_with_retry(api_path, params)
if len(response) > 0:
yield parse_response(response, site_url, page)
60 changes: 60 additions & 0 deletions sources/bing_webmaster/helpers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
"""Bing Webmaster source helpers"""

import re
from typing import Iterator, List
from urllib.parse import urljoin

from dlt.common import logger, pendulum
from dlt.common.typing import DictStrAny, DictStrStr
from dlt.sources.helpers import requests

from .settings import BASE_URL, HEADERS


def get_url_with_retry(url: str, params: DictStrStr) -> DictStrAny:
try:
r = requests.get(url, headers=HEADERS, params=params)
return r.json() # type: ignore
except requests.HTTPError as e:
if e.response.status_code == 400:
logger.warning(
f"""HTTP Error {e.response.status_code}.
Is your API key authorized to fetch data about the domain
'{params.get('siteUrl')}'?"""
)
e.response.raise_for_status()
return e.response.json() # type: ignore


def get_stats_with_retry(api_path: str, params: DictStrStr) -> List[DictStrAny]:
url = urljoin(BASE_URL, api_path)
response = get_url_with_retry(url, params)
return response.get("d") # type: ignore


def parse_response(
response: List[DictStrAny], site_url: str, page: str = None
) -> Iterator[DictStrAny]:
"""
Adds site_url from the request to the response.
Otherwise, we would not know to which site_url a page and its statistics belong.
Further, corrects that what the API returns as 'Query' is actually the 'page'.
"""
for r in response:
if page is None:
# in GetPageStats endpoint the page is under the key "Query"
r.update({"page": r.get("Query")})
del r["Query"]
else:
r.update({"page": page})
r.update({"site_url": site_url, "Date": _parse_date(r)})
del r["__type"]
yield r


def _parse_date(record: DictStrStr) -> pendulum.Date:
"""Parses Microsoft's date format into a date. The number is a unix timestamp"""
match = re.findall(r"\d+", record.get("Date")) # extract the digits
timestamp_in_seconds = int(match[0]) // 1000
d: pendulum.Date = pendulum.Date.fromtimestamp(timestamp_in_seconds)
return d
1 change: 1 addition & 0 deletions sources/bing_webmaster/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
dlt>=0.3.5
4 changes: 4 additions & 0 deletions sources/bing_webmaster/settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""Bing Webmaster source settings and constants"""

BASE_URL = "https://ssl.bing.com/webmaster/api.svc/json/"
HEADERS = {"Content-Type": "application/json", "charset": "utf-8"}
55 changes: 55 additions & 0 deletions sources/bing_webmaster_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import dlt
from bing_webmaster import source


def load_page_stats_example() -> None:
"""
Constructs a pipeline that will load organic search traffic from Bing Webmaster
for site_url and pages
"""

# configure the pipeline: provide the destination and dataset name to which the data should go
pipeline = dlt.pipeline(
pipeline_name="bing_webmaster_page_stats",
destination="duckdb",
dataset_name="bing_webmaster",
)
# create the data source by providing a list of site_urls.
# Note that you have to first verify your own site urls. Thus, most likely,
# you'll lack the permissions to request statistics for the one provided in this example
data = source(site_urls=["sipgate.de", "satellite.me"])

# load the "page_stats" out of all the possible resources
info = pipeline.run(data.with_resources("page_stats"))
print(info)


def load_page_query_stats_example() -> None:
"""
Constructs a pipeline that will load organic search traffic from Bing Webmaster
for site_url, pages, and query
"""

# configure the pipeline: provide the destination and dataset name to which the data should go
pipeline = dlt.pipeline(
pipeline_name="bing_webmaster_page_query_stats",
destination="duckdb",
dataset_name="bing_webmaster",
)
# create the data source by providing a list pairs of site_urls and pages.
# Note that you have to first verify your own site urls. Thus, most likely,
# you'll lack the permissions to request statistics for the one provided in this example
data = source(
site_url_pages=[
{"site_url": "sipgate.de", "page": "https://www.sipgate.de/preise"},
{"site_url": "sipgate.de", "page": "https://www.sipgate.de/app"},
]
)
# load the "page_query_stats" out of all the possible resources
info = pipeline.run(data.with_resources("page_query_stats"))
print(info)


if __name__ == "__main__":
load_page_stats_example()
load_page_query_stats_example()
Empty file.
Loading
Loading