4. Webscraping

Date: 2019-06-27

Architecture issue: #252

Changelog:

2019-06-27 Initial version accepted
2021-02-09 Amendment to add exception for authentication phase

Status

Accepted

Context

Webscraping is when we use code to mimic a user and log in to a website and get data in Home Assistant. This is usually needed because certain data sources/integrations do not offer an API.

Webscraping comes with the following downsides:

Very fragile, break often. When the website is updated, the integration will need to be updated.
Some vendors (like USPS) have IP banned users of such integrations
Some rely on beautifulsoup (Python-based), others are relying on PhantomJS or other headless browsers, meaning we need to include a whole browser.

Proposal

We no longer accept any new integration that relies on webscraping
We identify, deprecate for 2 releases and remove integrations that rely on webscraping
It will still be possible to have custom integrations provide information via webscraping

Exceptions

Generic integrations to parse HTML are excluded from this decision.

An exception is made for the authentication phase. An integration is allowed to extract fields from forms. To make it more robust, data should not be gathered by scraping individual fields but instead scrape all fields at once.

Consequences

Integrations that rely on webscraping will have to be maintained as custom integrations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0004-webscraping.md

0004-webscraping.md

4. Webscraping

Status

Context

Proposal

Exceptions

Consequences

Files

0004-webscraping.md

Latest commit

History

0004-webscraping.md

File metadata and controls

4. Webscraping

Status

Context

Proposal

Exceptions

Consequences