dlt-hub · sultaniman · Feb 26, 2024 · Jan 26, 2024 · Jan 26, 2024 · Jan 27, 2024
diff --git a/sources/bing_webmaster/README.md b/sources/bing_webmaster/README.md
@@ -0,0 +1,72 @@
+---
+title: Bing Webmaster
+description: dlt source for Bing Webmaster API
+keywords: [bing, bing webmasters, bing webmaster tools]
+---
+
+
+# Bing Webmaster
+
+This source allows site owners to retrieve the organic traffic to their pages via Bing.
+[Bing Webmaster](https://www.bing.com/webmasters/tools/) is a free service as part of Microsoft's Bing search engine. It allows webmasters to add their websites to the Bing index crawler, see their site's performance in Bing searches (including chat). This source reports the clicks, impressions and their respective average ranks.
+
+Resources that can be loaded using this verified source are:
+
+| Name             | Description                                                                     |
+| ---------------- | --------------------------------------------------------------------------------|
+| page_stats       | retrieves weekly traffic statistics for top pages belonging to a site_url       |
+| page_query_stats | retrieves weekly trafic statistics per query for each pair of page and site_url |
+
+
+## Initialize the pipeline
+
+```bash
+dlt init bing_webmaster duckdb
+```
+
+Here, we chose duckdb as the destination. Alternatively, you can also choose redshift, bigquery, or
+any of the other [destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/).
+
+## Add credentials
+
+1. [Bing Webmaster API](https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi) is an API that
+   requires authentication or including secrets in `secrets.toml`. Create an account and generate the API key by clicking on the cog wheel in the [Bing Webmaster Web UI](https://www.bing.com/webmasters/home).
+
+2. Add the obtained API key into `secrets.toml` as follows:
+```toml
+[sources.bing_webmaster]
+api_key = "Please set me up!" # please set me up!
+```
+
+3. Ensure to add your sites and also verify with Bing that you have ownership of the domains you want to fetch statics for. Follow this [Documentation on add and verify site](https://www.bing.com/webmasters/help/add-and-verify-site-12184f8b). It describes how you can achieve that by importing your sites from Google Search Console or by adding your sites manually.
+
+4. Follow the instructions in the
+   [destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/) document to add credentials
+   for your chosen destination.
+
+## Run the pipeline
+
+1. Install the necessary dependencies by running the following command:
+
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+2. Substitute the example domain with your domain in the pipeline file `bing_webmaster_pipeline.py`.
+
+3. Now the pipeline can be run by using the command:
+
+   ```bash
+   python3 bing_webmaster_pipeline.py
+   ```
+
+3. To make sure that everything is loaded as expected, use the command:
+
+   ```bash
+   dlt pipeline bing_webmaster_pipeline show
+   ```
+
+💡 To explore the API documentation, we recommend referring to the official API documentation for the two implemented resources:
+- [GetPageStats](https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi.getpagestats)
+- [GetPageQueryStats](https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi.getpagequerystats)
+The documentation is very sparse and can be potentially misleading. Therefore, we have added more information into the docstrings of our implementation. We gained this additional information comes from our practical observations while working with this data source.
diff --git a/sources/bing_webmaster/__init__.py b/sources/bing_webmaster/__init__.py
@@ -0,0 +1,96 @@
+"""
+A source loading history of organic search traffic from Bing Webmaster API
+See documentation: https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi?view=bing-webmaster-dotnet
+The API returns aggregated weekly statistics for the entire history of up to 26 weeks.
+The dates are always Fridays and during tests, the data up to the latest Friday has been available on the following Monday.
+"""
+
+import time
+from typing import Iterable, Iterator, List, Sequence
+
+import dlt
+from dlt.common import logger
+from dlt.common.typing import DictStrAny, DictStrStr
+from dlt.sources import DltResource
+
+from .helpers import get_stats_with_retry, parse_response
+
+
+@dlt.source(name="bing_webmaster")
+def source(
+    site_urls: List[str] = None, site_url_pages: Iterable[DictStrStr] = None
+) -> Sequence[DltResource]:
+    """
+    A dlt source for the Bing Webmaster api.
+    It groups resources for the APIs which return organic search traffic statistics
+    Args:
+        site_urls: List[str]: A list of site_urls, e.g, ["dlthub.com", "dlthub.de"]. Use this if you need the weekly traffic per site_url and page
+        site_url_pages: Iterable[DictStrStr]: A list of pairs of site_url and page. Use this if you need the weekly traffic per site_url, page, and query
+    Returns:
+        Sequence[DltResource]: A sequence of resources that can be selected from including page_stats and page_query_stats.
+    """
+    return (
+        page_stats(site_urls),
+        page_query_stats(site_url_pages),
+    )
+
+
+@dlt.resource(
+    write_disposition="merge",
+    merge_key=("date", "page", "site_url"),
+    primary_key=("date", "page", "site_url"),
+    table_name="bing_page_stats",
+)
+def page_stats(
+    site_urls: List[str], api_key: str = dlt.secrets.value
+) -> Iterator[Iterator[DictStrAny]]:
+    """
+    Yields detailed traffic statistics for top pages belonging to a site_url
+    Contains the entire available history of up to 26 weeks. Thus, we recommend to use write_disposition="merge"
+    API documentation:
+    https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi.getpagestats
+    Args:
+        site_urls (List[str]): List of site_urls to retrieve statistics for.
+    Yields:
+        Iterator[Dict[str, Any]]: An iterator over list of organic traffic statistics.
+    """
+    api_path = "GetPageStats"
+    for site_url in site_urls:
+        params = {"siteUrl": site_url, "apikey": api_key}
+        logger.info(f"Fetching for site_url: {site_url}")
+        response = get_stats_with_retry(api_path, params)
+        if len(response) > 0:
+            yield parse_response(response, site_url)
+
+
+@dlt.resource(
+    write_disposition="merge",
+    merge_key=("date", "page", "site_url", "query"),
+    primary_key=("date", "page", "site_url", "query"),
+    table_name="bing_page_query_stats",
+)
+def page_query_stats(
+    site_url_pages: Iterable[DictStrStr],
+    api_key: str = dlt.secrets.value,
+) -> Iterator[Iterator[DictStrAny]]:
+    """
+    Yields weekly statistics and queries for each pair of page and site_url.
+    Contains the entire available history of up to 26 weeks. Thus, we recommend to use write_disposition="merge"
+    API documentation:
+    https://learn.microsoft.com/en-us/dotnet/api/microsoft.bing.webmaster.api.interfaces.iwebmasterapi.getpagequerystats
+
+    Args:
+        site_url_page (Iterable[DictStrStr]): Iterable of site_url and pages to retrieve statistics for. Can be result of a SQL query, a parsed sitemap, etc.
+    Yields:
+        Iterator[Dict[str, Any]]: An iterator over list of organic traffic statistics.
+    """
+    api_path = "GetPageQueryStats"
+    for record in site_url_pages:
+        time.sleep(0.5)  # this avoids rate limit observed after dozens of requests
+        site_url = record.get("site_url")
+        page = record.get("page")
+        params = {"siteUrl": site_url, "page": page, "apikey": api_key}
+        logger.info(f"Fetching for site_url: {site_url}, page: {page}")
+        response = get_stats_with_retry(api_path, params)
+        if len(response) > 0:
+            yield parse_response(response, site_url, page)
diff --git a/sources/bing_webmaster/helpers.py b/sources/bing_webmaster/helpers.py
@@ -0,0 +1,60 @@
+"""Bing Webmaster source helpers"""
+
+import re
+from typing import Iterator, List
+from urllib.parse import urljoin
+
+from dlt.common import logger, pendulum
+from dlt.common.typing import DictStrAny, DictStrStr
+from dlt.sources.helpers import requests
+
+from .settings import BASE_URL, HEADERS
+
+
+def get_url_with_retry(url: str, params: DictStrStr) -> DictStrAny:
+    try:
+        r = requests.get(url, headers=HEADERS, params=params)
+        return r.json()  # type: ignore
+    except requests.HTTPError as e:
+        if e.response.status_code == 400:
+            logger.warning(
+                f"""HTTP Error {e.response.status_code}.
+                Is your API key authorized to fetch data about the domain
+                '{params.get('siteUrl')}'?"""
+            )
+        e.response.raise_for_status()
+        return e.response.json()  # type: ignore
+
+
+def get_stats_with_retry(api_path: str, params: DictStrStr) -> List[DictStrAny]:
+    url = urljoin(BASE_URL, api_path)
+    response = get_url_with_retry(url, params)
+    return response.get("d")  # type: ignore
+
+
+def parse_response(
+    response: List[DictStrAny], site_url: str, page: str = None
+) -> Iterator[DictStrAny]:
+    """
+    Adds site_url from the request to the response.
+    Otherwise, we would not know to which site_url a page and its statistics belong.
+    Further, corrects that what the API returns as 'Query' is actually the 'page'.
+    """
+    for r in response:
+        if page is None:
+            # in GetPageStats endpoint the page is under the key "Query"
+            r.update({"page": r.get("Query")})
+            del r["Query"]
+        else:
+            r.update({"page": page})
+        r.update({"site_url": site_url, "Date": _parse_date(r)})
+        del r["__type"]
+        yield r
+
+
+def _parse_date(record: DictStrStr) -> pendulum.Date:
+    """Parses Microsoft's date format into a date. The number is a unix timestamp"""
+    match = re.findall(r"\d+", record.get("Date"))  # extract the digits
+    timestamp_in_seconds = int(match[0]) // 1000
+    d: pendulum.Date = pendulum.Date.fromtimestamp(timestamp_in_seconds)
+    return d
diff --git a/sources/bing_webmaster/requirements.txt b/sources/bing_webmaster/requirements.txt
@@ -0,0 +1 @@
+dlt>=0.3.5
diff --git a/sources/bing_webmaster/settings.py b/sources/bing_webmaster/settings.py
@@ -0,0 +1,4 @@
+"""Bing Webmaster source settings and constants"""
+
+BASE_URL = "https://ssl.bing.com/webmaster/api.svc/json/"
+HEADERS = {"Content-Type": "application/json", "charset": "utf-8"}
diff --git a/sources/bing_webmaster_pipeline.py b/sources/bing_webmaster_pipeline.py
@@ -0,0 +1,55 @@
+import dlt
+from bing_webmaster import source
+
+
+def load_page_stats_example() -> None:
+    """
+    Constructs a pipeline that will load organic search traffic from Bing Webmaster
+    for site_url and pages
+    """
+
+    # configure the pipeline: provide the destination and dataset name to which the data should go
+    pipeline = dlt.pipeline(
+        pipeline_name="bing_webmaster_page_stats",
+        destination="duckdb",
+        dataset_name="bing_webmaster",
+    )
+    # create the data source by providing a list of site_urls.
+    # Note that you have to first verify your own site urls. Thus, most likely,
+    # you'll lack the permissions to request statistics for the one provided in this example
+    data = source(site_urls=["sipgate.de", "satellite.me"])
+
+    # load the "page_stats" out of all the possible resources
+    info = pipeline.run(data.with_resources("page_stats"))
+    print(info)
+
+
+def load_page_query_stats_example() -> None:
+    """
+    Constructs a pipeline that will load organic search traffic from Bing Webmaster
+    for site_url, pages, and query
+    """
+
+    # configure the pipeline: provide the destination and dataset name to which the data should go
+    pipeline = dlt.pipeline(
+        pipeline_name="bing_webmaster_page_query_stats",
+        destination="duckdb",
+        dataset_name="bing_webmaster",
+    )
+    # create the data source by providing a list pairs of site_urls and pages.
+    # Note that you have to first verify your own site urls. Thus, most likely,
+    # you'll lack the permissions to request statistics for the one provided in this example
+    data = source(
+        site_url_pages=[
+            {"site_url": "sipgate.de", "page": "https://www.sipgate.de/preise"},
+            {"site_url": "sipgate.de", "page": "https://www.sipgate.de/app"},
+        ]
+    )
+    # load the "page_query_stats" out of all the possible resources
+    info = pipeline.run(data.with_resources("page_query_stats"))
+    print(info)
+
+
+if __name__ == "__main__":
+    load_page_stats_example()
+    load_page_query_stats_example()
diff --git a/tests/bing_webmaster/__init__.py b/tests/bing_webmaster/__init__.py