-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[mellanox]: Backport patches to increase critical threshold for ASIC …
…and validate transceiver temperature (#185) Backport new patches to increase the ASIC critical threshold from 110C to 140C, and validate the transceiver critical threshold temperature: 1. 0022-mlxsw-core-Increase-critical-threshold-for-ASIC-ther.patch torvalds/linux@b06ca3d 2. 0023-mlxsw-core-Add-validation-of-transceiver-temperature.patch torvalds/linux@57726eb This change has been verified on all Mellanox devices based on Spectrum-1, Spectrum-2, and Spectrum-3 ASIC Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Loading branch information
Showing
3 changed files
with
98 additions
and
0 deletions.
There are no files selected for viewing
42 changes: 42 additions & 0 deletions
42
patch/0022-mlxsw-core-Increase-critical-threshold-for-ASIC-ther.patch
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
From f79a25e99568d19ed2cd39de4650ced66de4ab5d Mon Sep 17 00:00:00 2001 | ||
From: Vadim Pasternak <vadimp@nvidia.com> | ||
Date: Thu, 31 Dec 2020 19:27:02 +0200 | ||
Subject: [PATCH mlxsw/net-next 1/1] mlxsw: core: Increase critical threshold | ||
for ASIC thermal zone | ||
|
||
Increase critical threshold for ASIC thermal zone from 110C to 140C | ||
according to the system hardware requirements. All the supported ASICs | ||
(SX, Spectrum1, Spectune2, Spectrum3) could be still operational with | ||
ASIC temperature below 140C. | ||
|
||
According to the system requirements software thermal protection is the | ||
second level of protection, while the first level of protection should | ||
be performed by firmware. So firmware could decide to perform system | ||
thermal shutdown in case the temperature is below 140C. So firmware can | ||
decide to perform system thermal shutdown in case the temperature is | ||
below 140C. In case firmware did not perform it and ASIC temperature | ||
reached 140C, the second level of thermal protection will be performed | ||
by software. | ||
|
||
Fixes: 41e760841d26 ("mlxsw: core: Replace thermal temperature trips with defines") | ||
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com> | ||
--- | ||
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 2 +- | ||
1 file changed, 1 insertion(+), 1 deletion(-) | ||
|
||
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | ||
index 141e3655e211..d575aa469517 100644 | ||
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | ||
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | ||
@@ -19,7 +19,7 @@ | ||
#define MLXSW_THERMAL_ASIC_TEMP_NORM 75000 /* 75C */ | ||
#define MLXSW_THERMAL_ASIC_TEMP_HIGH 85000 /* 85C */ | ||
#define MLXSW_THERMAL_ASIC_TEMP_HOT 105000 /* 105C */ | ||
-#define MLXSW_THERMAL_ASIC_TEMP_CRIT 110000 /* 110C */ | ||
+#define MLXSW_THERMAL_ASIC_TEMP_CRIT 140000 /* 140C */ | ||
#define MLXSW_THERMAL_MODULE_TEMP_NORM 60000 /* 60C */ | ||
#define MLXSW_THERMAL_MODULE_TEMP_HIGH 70000 /* 70C */ | ||
#define MLXSW_THERMAL_MODULE_TEMP_HOT 80000 /* 80C */ | ||
-- | ||
2.11.0 | ||
|
54 changes: 54 additions & 0 deletions
54
patch/0023-mlxsw-core-Add-validation-of-transceiver-temperature.patch
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
From 8845c46138eca323a3fe9cd1332190f092f35888 Mon Sep 17 00:00:00 2001 | ||
From: Vadim Pasternak <vadimp@nvidia.com> | ||
Date: Thu, 7 Jan 2021 12:56:21 +0200 | ||
Subject: [PATCH mlxsw/backport 2/2] mlxsw: core: Add validation of transceiver | ||
temperature thresholds | ||
|
||
Validate thresholds to avoid a single failure due to some transceiver | ||
unreliability. Ignore the last readouts in case warning temperature is | ||
above alarm temperature, since it can cause unexpected thermal | ||
shutdown. Stay with the previous values and refresh threshold within | ||
the next iteration. | ||
|
||
This is the rare scenario, but somehow once it has been observed at a | ||
customer site. | ||
|
||
Fixes: 6a79507cfe94 ("mlxsw: core: Extend thermal module with per QSFP module thermal zones") | ||
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com> | ||
--- | ||
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 11 +++++++---- | ||
1 file changed, 7 insertions(+), 4 deletions(-) | ||
|
||
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | ||
index 54d0e8b8d..477c3ed53 100644 | ||
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | ||
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | ||
@@ -183,6 +183,12 @@ mlxsw_thermal_module_trips_update(struct device *dev, struct mlxsw_core *core, | ||
if (err) | ||
return err; | ||
|
||
+ if (crit_temp > emerg_temp) { | ||
+ dev_warn(dev, "%s : Critical threshold %d is above emergency threshold %d\n", | ||
+ tz->tzdev->type, crit_temp, emerg_temp); | ||
+ return 0; | ||
+ } | ||
+ | ||
/* According to the system thermal requirements, the thermal zones are | ||
* defined with four trip points. The critical and emergency | ||
* temperature thresholds, provided by QSFP module are set as "active" | ||
@@ -197,11 +203,8 @@ mlxsw_thermal_module_trips_update(struct device *dev, struct mlxsw_core *core, | ||
tz->trips[MLXSW_THERMAL_TEMP_TRIP_NORM].temp = crit_temp; | ||
tz->trips[MLXSW_THERMAL_TEMP_TRIP_HIGH].temp = crit_temp; | ||
tz->trips[MLXSW_THERMAL_TEMP_TRIP_HOT].temp = emerg_temp; | ||
- if (emerg_temp > crit_temp) | ||
- tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp + | ||
+ tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp + | ||
MLXSW_THERMAL_MODULE_TEMP_SHIFT; | ||
- else | ||
- tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp; | ||
|
||
return 0; | ||
} | ||
-- | ||
2.11.0 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters