From 548e8e0be49692050ea4071d5e9945816bc5aacc Mon Sep 17 00:00:00 2001 From: Kebo Liu Date: Wed, 13 Jan 2021 00:59:27 +0800 Subject: [PATCH] [mellanox]: Backport patches to increase critical threshold for ASIC and validate transceiver temperature (#185) Backport new patches to increase the ASIC critical threshold from 110C to 140C, and validate the transceiver critical threshold temperature: 1. 0022-mlxsw-core-Increase-critical-threshold-for-ASIC-ther.patch https://github.com/torvalds/linux/commit/b06ca3d5a43ca2dd806f7688a17e8e7e0619a80a 2. 0023-mlxsw-core-Add-validation-of-transceiver-temperature.patch https://github.com/torvalds/linux/commit/57726ebe2733891c9f59105eff028735f73d05fb This change has been verified on all Mellanox devices based on Spectrum-1, Spectrum-2, and Spectrum-3 ASIC Signed-off-by: Kebo Liu --- ...ase-critical-threshold-for-ASIC-ther.patch | 42 +++++++++++++++ ...alidation-of-transceiver-temperature.patch | 54 +++++++++++++++++++ patch/series | 2 + 3 files changed, 98 insertions(+) create mode 100644 patch/0022-mlxsw-core-Increase-critical-threshold-for-ASIC-ther.patch create mode 100644 patch/0023-mlxsw-core-Add-validation-of-transceiver-temperature.patch diff --git a/patch/0022-mlxsw-core-Increase-critical-threshold-for-ASIC-ther.patch b/patch/0022-mlxsw-core-Increase-critical-threshold-for-ASIC-ther.patch new file mode 100644 index 000000000000..dab910eadc6b --- /dev/null +++ b/patch/0022-mlxsw-core-Increase-critical-threshold-for-ASIC-ther.patch @@ -0,0 +1,42 @@ +From f79a25e99568d19ed2cd39de4650ced66de4ab5d Mon Sep 17 00:00:00 2001 +From: Vadim Pasternak +Date: Thu, 31 Dec 2020 19:27:02 +0200 +Subject: [PATCH mlxsw/net-next 1/1] mlxsw: core: Increase critical threshold + for ASIC thermal zone + +Increase critical threshold for ASIC thermal zone from 110C to 140C +according to the system hardware requirements. All the supported ASICs +(SX, Spectrum1, Spectune2, Spectrum3) could be still operational with +ASIC temperature below 140C. + +According to the system requirements software thermal protection is the +second level of protection, while the first level of protection should +be performed by firmware. So firmware could decide to perform system +thermal shutdown in case the temperature is below 140C. So firmware can +decide to perform system thermal shutdown in case the temperature is +below 140C. In case firmware did not perform it and ASIC temperature +reached 140C, the second level of thermal protection will be performed +by software. + +Fixes: 41e760841d26 ("mlxsw: core: Replace thermal temperature trips with defines") +Signed-off-by: Vadim Pasternak +--- + drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c +index 141e3655e211..d575aa469517 100644 +--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c ++++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c +@@ -19,7 +19,7 @@ + #define MLXSW_THERMAL_ASIC_TEMP_NORM 75000 /* 75C */ + #define MLXSW_THERMAL_ASIC_TEMP_HIGH 85000 /* 85C */ + #define MLXSW_THERMAL_ASIC_TEMP_HOT 105000 /* 105C */ +-#define MLXSW_THERMAL_ASIC_TEMP_CRIT 110000 /* 110C */ ++#define MLXSW_THERMAL_ASIC_TEMP_CRIT 140000 /* 140C */ + #define MLXSW_THERMAL_MODULE_TEMP_NORM 60000 /* 60C */ + #define MLXSW_THERMAL_MODULE_TEMP_HIGH 70000 /* 70C */ + #define MLXSW_THERMAL_MODULE_TEMP_HOT 80000 /* 80C */ +-- +2.11.0 + diff --git a/patch/0023-mlxsw-core-Add-validation-of-transceiver-temperature.patch b/patch/0023-mlxsw-core-Add-validation-of-transceiver-temperature.patch new file mode 100644 index 000000000000..406aa22edd2a --- /dev/null +++ b/patch/0023-mlxsw-core-Add-validation-of-transceiver-temperature.patch @@ -0,0 +1,54 @@ +From 8845c46138eca323a3fe9cd1332190f092f35888 Mon Sep 17 00:00:00 2001 +From: Vadim Pasternak +Date: Thu, 7 Jan 2021 12:56:21 +0200 +Subject: [PATCH mlxsw/backport 2/2] mlxsw: core: Add validation of transceiver + temperature thresholds + +Validate thresholds to avoid a single failure due to some transceiver +unreliability. Ignore the last readouts in case warning temperature is +above alarm temperature, since it can cause unexpected thermal +shutdown. Stay with the previous values and refresh threshold within +the next iteration. + +This is the rare scenario, but somehow once it has been observed at a +customer site. + +Fixes: 6a79507cfe94 ("mlxsw: core: Extend thermal module with per QSFP module thermal zones") +Signed-off-by: Vadim Pasternak +--- + drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 11 +++++++---- + 1 file changed, 7 insertions(+), 4 deletions(-) + +diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c +index 54d0e8b8d..477c3ed53 100644 +--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c ++++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c +@@ -183,6 +183,12 @@ mlxsw_thermal_module_trips_update(struct device *dev, struct mlxsw_core *core, + if (err) + return err; + ++ if (crit_temp > emerg_temp) { ++ dev_warn(dev, "%s : Critical threshold %d is above emergency threshold %d\n", ++ tz->tzdev->type, crit_temp, emerg_temp); ++ return 0; ++ } ++ + /* According to the system thermal requirements, the thermal zones are + * defined with four trip points. The critical and emergency + * temperature thresholds, provided by QSFP module are set as "active" +@@ -197,11 +203,8 @@ mlxsw_thermal_module_trips_update(struct device *dev, struct mlxsw_core *core, + tz->trips[MLXSW_THERMAL_TEMP_TRIP_NORM].temp = crit_temp; + tz->trips[MLXSW_THERMAL_TEMP_TRIP_HIGH].temp = crit_temp; + tz->trips[MLXSW_THERMAL_TEMP_TRIP_HOT].temp = emerg_temp; +- if (emerg_temp > crit_temp) +- tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp + ++ tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp + + MLXSW_THERMAL_MODULE_TEMP_SHIFT; +- else +- tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp; + + return 0; + } +-- +2.11.0 + diff --git a/patch/series b/patch/series index 32f66c8fc1a8..1ce8244772fe 100755 --- a/patch/series +++ b/patch/series @@ -70,6 +70,8 @@ driver-ixgbe-external-phy.patch 0019-mlxsw-i2c-Allow-flexible-setting-of-I2C-transactions.patch 0020-mlxsw-core-Set-different-thermal-polling-time-based.patch 0021-platform-x86-mlx-platform-Remove-PSU-EEPROM-configur.patch +0022-mlxsw-core-Increase-critical-threshold-for-ASIC-ther.patch +0023-mlxsw-core-Add-validation-of-transceiver-temperature.patch ############################################################ # # Internal patches will be added below (placeholder)