Commit b6e0b6be authored by Shay Drory's avatar Shay Drory Committed by Saeed Mahameed
Browse files

net/mlx5: Fix fatal error handling during device load



Currently, in case of fatal error during mlx5_load_one(), we cannot
enter error state until mlx5_load_one() is finished, what can take
several minutes until commands will get timeouts, because these commands
can't be processed due to the fatal error.
Fix it by setting dev->state as MLX5_DEVICE_STATE_INTERNAL_ERROR before
requesting the lock.

Fixes: c1d4d2e9 ("net/mlx5: Avoid calling sleeping function by the health poll thread")
Signed-off-by: default avatarShay Drory <shayd@mellanox.com>
Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
parent 42ea9f1b
Loading
Loading
Loading
Loading
+11 −3
Original line number Diff line number Diff line
@@ -192,15 +192,23 @@ static bool reset_fw_if_needed(struct mlx5_core_dev *dev)

void mlx5_enter_error_state(struct mlx5_core_dev *dev, bool force)
{
	bool err_detected = false;

	/* Mark the device as fatal in order to abort FW commands */
	if ((check_fatal_sensors(dev) || force) &&
	    dev->state == MLX5_DEVICE_STATE_UP) {
		dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR;
		err_detected = true;
	}
	mutex_lock(&dev->intf_state_mutex);
	if (dev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR)
		goto unlock;
	if (!err_detected && dev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR)
		goto unlock;/* a previous error is still being handled */
	if (dev->state == MLX5_DEVICE_STATE_UNINITIALIZED) {
		dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR;
		goto unlock;
	}

	if (check_fatal_sensors(dev) || force) {
	if (check_fatal_sensors(dev) || force) { /* protected state setting */
		dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR;
		mlx5_cmd_flush(dev);
	}