Discussion:
[dpdk-dev] [RFC] Introduce virtual PMD for Hyper-V/Azure platforms
Adrien Mazarguil
2017-11-24 17:21:32 UTC
Permalink
Virtual machines hosted by Hyper-V/Azure platforms are fitted with
simplified virtual network devices named NetVSC that are used for fast
communication between VM to VM, VM to hypervisor, and the outside.

They appear as standard system netdevices to user-land applications, the
main difference being they are implemented on top of VMBUS [1] instead of
emulated PCI devices.

While this reads like a case for a standard DPDK PMD, there is more to it.

To accelerate outside communication, NetVSC devices as they appear in a VM
can be paired with physical SR-IOV virtual function (VF) devices owned by
that same VM [2]. Both netdevices share the same MAC address in that case.

When paired, egress and most of the ingress traffic flow through the VF
device, while part of it (e.g. multicasts, hypervisor control data) still
flows through NetVSC. Moreover VF devices are not retained and disappear
during VM migration; from a VM standpoint, they can be hot-plugged anytime
with NetVSC acting as a fallback.

Running DPDK applications in such a context involves driving VF devices
using their dedicated PMDs in a vendor-independent fashion (to benefit from
maximum performance without writing dedicated code) while simultaneously
listening to NetVSC and handling the related hot-plug events.

This new virtual PMD (referred to as "hyper-v" from this point on)
automatically coordinates the Hyper-V/Azure-specific management part
described above by relying on vendor-specific, fail-safe and tap PMDs to
expose a single consolidated Ethernet device usable directly by existing
applications, as summarized by the following diagram:

.-------------.
| DPDK ethdev |
`------+------'
|
.------+------.
| hyper-v PMD |
`------+------'
|
.------------+------------.
| fail-safe PMD |
`--+-------------------+--'
| |
| .........|.........
| : | :
.----+----. : .------+------. :
| tap PMD | : | $vendor PMD | :
`----+----' : `------+------' :--- hot-pluggable
| : | :
.------+-------. : .-----+-----. :
| NetVSC-based | : | SR-IOV VF | :
| netdevice | : | device | :
`--------------' : `-----------' :
:.................:

Given this RFC targets DPDK 18.02, this approach has the least impact on
applications while work is being performed to enhance public DPDK APIs to
improve it (e.g. hot-plug notification, vdev bus scanning and so on).

Some highlights:

- Enables existing applications to run unmodified with maximum performance
on Hyper-V/Azure platforms.

- All changes should be restricted to the hyper-v PMD (possibly a few in
fail-safe PMD), no API change in DPDK.

- Modular approach with little maintenance overhead (not much code) that
will rely on existing PMDs for all the heavy lifting.

[1] http://dpdk.org/ml/archives/dev/2017-January/054165.html
[2] https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
--
Adrien Mazarguil
6WIND
Adrien Mazarguil
2017-12-18 16:46:19 UTC
Permalink
Virtual machines hosted by Hyper-V/Azure platforms are fitted with
simplified virtual network devices named NetVSC that are used for fast
communication between VM to VM, VM to hypervisor, and the outside.

They appear as standard system netdevices to user-land applications, the
main difference being they are implemented on top of VMBUS [1] instead of
emulated PCI devices.

While this reads like a case for a standard DPDK PMD, there is more to it.

To accelerate outside communication, NetVSC devices as they appear in a VM
can be paired with physical SR-IOV virtual function (VF) devices owned by
that same VM [2]. Both netdevices share the same MAC address in that case.

When paired, egress and most of the ingress traffic flow through the VF
device, while part of it (e.g. multicasts, hypervisor control data) still
flows through NetVSC. Moreover VF devices are not retained and disappear
during VM migration; from a VM standpoint, they can be hot-plugged anytime
with NetVSC acting as a fallback.

Running DPDK applications in such a context involves driving VF devices
using their dedicated PMDs in a vendor-independent fashion (to benefit from
maximum performance without writing dedicated code) while simultaneously
listening to NetVSC and handling the related hot-plug events.

This new virtual PMD (referred to as "hyperv" from this point on)
automatically coordinates the Hyper-V/Azure-specific management part
described above by relying on vendor-specific, failsafe and tap PMDs to
expose a single consolidated Ethernet device usable directly by existing
applications.

.------------------.
| DPDK application |
`--------+---------'
|
.------+------.
| DPDK ethdev |
`------+------' Control
| |
.------------+------------. v .------------.
| failsafe PMD +---------+ hyperv PMD |
`--+-------------------+--' `------------'
| |
| .........|.........
| : | :
.----+----. : .----+----. :
| tap PMD | : | any PMD | :
`----+----' : `----+----' : <-- Hot-pluggable
| : | :
.------+-------. : .-----+-----. :
| NetVSC-based | : | SR-IOV VF | :
| netdevice | : | device | :
`--------------' : `-----------' :
:.................:

Note this diagram differs from that of the original RFC [3], with hyperv no
longer acting as a data plane layer.

This initial version of the driver only works in whitelist mode. Users have
to provide the --vdev net_hyperv EAL option at least once to trigger it.

Subsequent work will add support for blacklist mode based on automatic
detection of the host environment.

[1] http://dpdk.org/ml/archives/dev/2017-January/054165.html
[2] https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
[3] http://dpdk.org/ml/archives/dev/2017-November/082339.html

Adrien Mazarguil (3):
net/hyperv: introduce MS Hyper-V platform driver
net/hyperv: implement core functionality
net/hyperv: add "force" parameter

MAINTAINERS | 6 +
config/common_base | 6 +
config/common_linuxapp | 1 +
doc/guides/nics/features/hyperv.ini | 12 +
doc/guides/nics/hyperv.rst | 119 +++
doc/guides/nics/index.rst | 1 +
drivers/net/Makefile | 1 +
drivers/net/hyperv/Makefile | 58 ++
drivers/net/hyperv/hyperv.c | 799 +++++++++++++++++++++
drivers/net/hyperv/rte_pmd_hyperv_version.map | 4 +
mk/rte.app.mk | 1 +
11 files changed, 1008 insertions(+)
create mode 100644 doc/guides/nics/features/hyperv.ini
create mode 100644 doc/guides/nics/hyperv.rst
create mode 100644 drivers/net/hyperv/Makefile
create mode 100644 drivers/net/hyperv/hyperv.c
create mode 100644 drivers/net/hyperv/rte_pmd_hyperv_version.map
--
2.11.0
Adrien Mazarguil
2017-12-18 16:46:21 UTC
Permalink
This patch lays the groundwork for this PMD (draft documentation, copyright
notices, code base skeleton and build system hooks). While it can be
successfully compiled and invoked, it's an empty shell at this stage.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
---
MAINTAINERS | 6 +
config/common_base | 6 +
config/common_linuxapp | 1 +
doc/guides/nics/features/hyperv.ini | 12 ++
doc/guides/nics/hyperv.rst | 49 ++++++++
doc/guides/nics/index.rst | 1 +
drivers/net/Makefile | 1 +
drivers/net/hyperv/Makefile | 54 +++++++++
drivers/net/hyperv/hyperv.c | 135 +++++++++++++++++++++
drivers/net/hyperv/rte_pmd_hyperv_version.map | 4 +
mk/rte.app.mk | 1 +
11 files changed, 270 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 5a63b40c2..fe686f4c5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -451,6 +451,12 @@ F: drivers/net/mrvl/
F: doc/guides/nics/mrvl.rst
F: doc/guides/nics/features/mrvl.ini

+Microsoft hyperv
+M: Adrien Mazarguil <***@6wind.com>
+F: drivers/net/hyperv/
+F: doc/guides/nics/hyperv.rst
+F: doc/guides/nics/features/hyperv.ini
+
Netcope szedata2
M: Matej Vido <***@cesnet.cz>
F: drivers/net/szedata2/
diff --git a/config/common_base b/config/common_base
index b8ee8f91c..8bc83c8c9 100644
--- a/config/common_base
+++ b/config/common_base
@@ -280,6 +280,12 @@ CONFIG_RTE_LIBRTE_NFP_DEBUG=n
CONFIG_RTE_LIBRTE_MRVL_PMD=n

#
+# Compile Microsoft Hyper-V/Azure driver
+#
+CONFIG_RTE_LIBRTE_HYPERV_PMD=n
+CONFIG_RTE_LIBRTE_HYPERV_DEBUG=n
+
+#
# Compile burst-oriented Broadcom BNXT PMD driver
#
CONFIG_RTE_LIBRTE_BNXT_PMD=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74c7d64ec..fac6cb172 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -47,6 +47,7 @@ CONFIG_RTE_LIBRTE_PMD_VHOST=y
CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
CONFIG_RTE_LIBRTE_PMD_TAP=y
CONFIG_RTE_LIBRTE_AVP_PMD=y
+CONFIG_RTE_LIBRTE_HYPERV_PMD=y
CONFIG_RTE_LIBRTE_NFP_PMD=y
CONFIG_RTE_LIBRTE_POWER=y
CONFIG_RTE_VIRTIO_USER=y
diff --git a/doc/guides/nics/features/hyperv.ini b/doc/guides/nics/features/hyperv.ini
new file mode 100644
index 000000000..170912c25
--- /dev/null
+++ b/doc/guides/nics/features/hyperv.ini
@@ -0,0 +1,12 @@
+;
+; Supported features of the 'hyperv' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+ARMv7 = Y
+ARMv8 = Y
+Power8 = Y
+x86-32 = Y
+x86-64 = Y
+Usage doc = Y
diff --git a/doc/guides/nics/hyperv.rst b/doc/guides/nics/hyperv.rst
new file mode 100644
index 000000000..28c4443d6
--- /dev/null
+++ b/doc/guides/nics/hyperv.rst
@@ -0,0 +1,49 @@
+.. BSD LICENSE
+ Copyright 2017 6WIND S.A.
+ Copyright 2017 Mellanox
+
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in
+ the documentation and/or other materials provided with the
+ distribution.
+ * Neither the name of 6WIND S.A. nor the names of its
+ contributors may be used to endorse or promote products derived
+ from this software without specific prior written permission.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+HYPERV poll mode driver
+=======================
+
+The HYPERV PMD (librte_pmd_hyperv) provides support for NetVSC interfaces
+and associated SR-IOV virtual function (VF) devices found in Linux virtual
+machines running on Microsoft Hyper-V_ (including Azure) platforms.
+
+.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
+
+Build options
+-------------
+
+- ``CONFIG_RTE_LIBRTE_HYPERV_PMD`` (default ``y``)
+
+ Toggle compilation of this driver.
+
+- ``CONFIG_RTE_LIBRTE_HYPERV_DEBUG`` (default ``n``)
+
+ Toggle additional debugging code.
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 23babe933..9d66353a1 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -49,6 +49,7 @@ Network Interface Controller Drivers
ena
enic
fm10k
+ hyperv
i40e
ixgbe
intel_vf
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ef09b4e16..5bcc37cb3 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -55,6 +55,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_LIO_PMD) += liquidio
DIRS-$(CONFIG_RTE_LIBRTE_MLX4_PMD) += mlx4
DIRS-$(CONFIG_RTE_LIBRTE_MLX5_PMD) += mlx5
DIRS-$(CONFIG_RTE_LIBRTE_MRVL_PMD) += mrvl
+DIRS-$(CONFIG_RTE_LIBRTE_HYPERV_PMD) += hyperv
DIRS-$(CONFIG_RTE_LIBRTE_NFP_PMD) += nfp
DIRS-$(CONFIG_RTE_LIBRTE_BNXT_PMD) += bnxt
DIRS-$(CONFIG_RTE_LIBRTE_PMD_NULL) += null
diff --git a/drivers/net/hyperv/Makefile b/drivers/net/hyperv/Makefile
new file mode 100644
index 000000000..82c720353
--- /dev/null
+++ b/drivers/net/hyperv/Makefile
@@ -0,0 +1,54 @@
+# BSD LICENSE
+#
+# Copyright 2017 6WIND S.A.
+# Copyright 2017 Mellanox
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in
+# the documentation and/or other materials provided with the
+# distribution.
+# * Neither the name of 6WIND S.A. nor the names of its
+# contributors may be used to endorse or promote products derived
+# from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# Properties of the generated library.
+LIB = librte_pmd_hyperv.a
+LIBABIVER := 1
+EXPORT_MAP := rte_pmd_hyperv_version.map
+
+# Additional compilation flags.
+CFLAGS += -O3
+CFLAGS += -g
+CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += $(WERROR_FLAGS)
+
+# Dependencies.
+LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_eal
+LDLIBS += -lrte_ethdev
+LDLIBS += -lrte_kvargs
+
+# Source files.
+SRCS-$(CONFIG_RTE_LIBRTE_HYPERV_PMD) += hyperv.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/hyperv/hyperv.c b/drivers/net/hyperv/hyperv.c
new file mode 100644
index 000000000..2f940c76f
--- /dev/null
+++ b/drivers/net/hyperv/hyperv.c
@@ -0,0 +1,135 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright 2017 6WIND S.A.
+ * Copyright 2017 Mellanox
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of 6WIND S.A. nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stddef.h>
+#include <string.h>
+
+#include <rte_bus_vdev.h>
+#include <rte_config.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#define HYPERV_DRIVER net_hyperv
+#define HYPERV_ARG_IFACE "iface"
+#define HYPERV_ARG_MAC "mac"
+
+#ifdef RTE_LIBRTE_HYPERV_DEBUG
+
+#define PMD_DRV_LOG(level, ...) \
+ RTE_LOG(level, PMD, \
+ RTE_FMT("%s:%u: %s(): " RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ strrchr("/" __FILE__, '/') + 1, \
+ __LINE__, \
+ __func__, \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+#else /* RTE_LIBRTE_HYPERV_DEBUG */
+
+#define PMD_DRV_LOG(level, ...) \
+ RTE_LOG(level, PMD, \
+ RTE_FMT(RTE_STR(HYPERV_DRIVER) ": " \
+ RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+#endif /* RTE_LIBRTE_HYPERV_DEBUG */
+
+#define DEBUG(...) PMD_DRV_LOG(DEBUG, __VA_ARGS__)
+#define INFO(...) PMD_DRV_LOG(INFO, __VA_ARGS__)
+#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
+#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)
+
+/** Number of PMD instances relying on context list. */
+static unsigned int hyperv_ctx_inst;
+
+/**
+ * Probe NetVSC interfaces.
+ *
+ * @param dev
+ * Virtual device context for PMD instance.
+ *
+ * @return
+ * Always 0, even in case of errors.
+ */
+static int
+hyperv_vdev_probe(struct rte_vdev_device *dev)
+{
+ static const char *const hyperv_arg[] = {
+ HYPERV_ARG_IFACE,
+ HYPERV_ARG_MAC,
+ NULL,
+ };
+ const char *name = rte_vdev_device_name(dev);
+ const char *args = rte_vdev_device_args(dev);
+ struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
+ hyperv_arg);
+
+ DEBUG("invoked as \"%s\", using arguments \"%s\"", name, args);
+ if (!kvargs) {
+ ERROR("cannot parse arguments list");
+ goto error;
+ }
+error:
+ if (kvargs)
+ rte_kvargs_free(kvargs);
+ ++hyperv_ctx_inst;
+ return 0;
+}
+
+/**
+ * Remove PMD instance.
+ *
+ * @param dev
+ * Virtual device context for PMD instance.
+ *
+ * @return
+ * Always 0.
+ */
+static int
+hyperv_vdev_remove(struct rte_vdev_device *dev)
+{
+ (void)dev;
+ --hyperv_ctx_inst;
+ return 0;
+}
+
+/** Virtual device descriptor. */
+static struct rte_vdev_driver hyperv_vdev = {
+ .probe = hyperv_vdev_probe,
+ .remove = hyperv_vdev_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(HYPERV_DRIVER, hyperv_vdev);
+RTE_PMD_REGISTER_ALIAS(HYPERV_DRIVER, eth_hyperv);
+RTE_PMD_REGISTER_PARAM_STRING(net_hyperv,
+ HYPERV_ARG_IFACE "=<string> "
+ HYPERV_ARG_MAC "=<string>");
diff --git a/drivers/net/hyperv/rte_pmd_hyperv_version.map b/drivers/net/hyperv/rte_pmd_hyperv_version.map
new file mode 100644
index 000000000..179140fb8
--- /dev/null
+++ b/drivers/net/hyperv/rte_pmd_hyperv_version.map
@@ -0,0 +1,4 @@
+DPDK_18.02 {
+
+ local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 6a6a7452e..b0701c49f 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -134,6 +134,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_E1000_PMD) += -lrte_pmd_e1000
_LDLIBS-$(CONFIG_RTE_LIBRTE_ENA_PMD) += -lrte_pmd_ena
_LDLIBS-$(CONFIG_RTE_LIBRTE_ENIC_PMD) += -lrte_pmd_enic
_LDLIBS-$(CONFIG_RTE_LIBRTE_FM10K_PMD) += -lrte_pmd_fm10k
+_LDLIBS-$(CONFIG_RTE_LIBRTE_HYPERV_PMD) += -lrte_pmd_hyperv
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_FAILSAFE) += -lrte_pmd_failsafe
_LDLIBS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += -lrte_pmd_i40e
_LDLIBS-$(CONFIG_RTE_LIBRTE_IXGBE_PMD) += -lrte_pmd_ixgbe
--
2.11.0
Stephen Hemminger
2017-12-18 18:28:35 UTC
Permalink
On Mon, 18 Dec 2017 17:46:21 +0100
Post by Adrien Mazarguil
+#ifdef RTE_LIBRTE_HYPERV_DEBUG
+
+#define PMD_DRV_LOG(level, ...) \
+ RTE_LOG(level, PMD, \
+ RTE_FMT("%s:%u: %s(): " RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ strrchr("/" __FILE__, '/') + 1, \
+ __LINE__, \
+ __func__, \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+#else /* RTE_LIBRTE_HYPERV_DEBUG */
+
+#define PMD_DRV_LOG(level, ...) \
+ RTE_LOG(level, PMD, \
+ RTE_FMT(RTE_STR(HYPERV_DRIVER) ": " \
+ RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+#endif /* RTE_LIBRTE_HYPERV_DEBUG */
+
+#define DEBUG(...) PMD_DRV_LOG(DEBUG, __VA_ARGS__)
+#define INFO(...) PMD_DRV_LOG(INFO, __VA_ARGS__)
+#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
+#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)
+
Please don't use DEBUG() etc macros. It makes it easier for tools that do
global updates or scans if all drivers use the same model of PMD_DRV_LOG
Thomas Monjalon
2017-12-18 19:54:16 UTC
Permalink
Post by Stephen Hemminger
On Mon, 18 Dec 2017 17:46:21 +0100
Post by Adrien Mazarguil
+#ifdef RTE_LIBRTE_HYPERV_DEBUG
+
+#define PMD_DRV_LOG(level, ...) \
+ RTE_LOG(level, PMD, \
+ RTE_FMT("%s:%u: %s(): " RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ strrchr("/" __FILE__, '/') + 1, \
+ __LINE__, \
+ __func__, \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+#else /* RTE_LIBRTE_HYPERV_DEBUG */
+
+#define PMD_DRV_LOG(level, ...) \
+ RTE_LOG(level, PMD, \
+ RTE_FMT(RTE_STR(HYPERV_DRIVER) ": " \
+ RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+#endif /* RTE_LIBRTE_HYPERV_DEBUG */
+
+#define DEBUG(...) PMD_DRV_LOG(DEBUG, __VA_ARGS__)
+#define INFO(...) PMD_DRV_LOG(INFO, __VA_ARGS__)
+#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
+#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)
+
Please don't use DEBUG() etc macros. It makes it easier for tools that do
global updates or scans if all drivers use the same model of PMD_DRV_LOG
The new standard is to use dynamic logtype.
Stephen Hemminger
2017-12-18 21:17:51 UTC
Permalink
On Mon, 18 Dec 2017 20:54:16 +0100
Post by Thomas Monjalon
Post by Stephen Hemminger
Post by Adrien Mazarguil
+#endif /* RTE_LIBRTE_HYPERV_DEBUG */
+
+#define DEBUG(...) PMD_DRV_LOG(DEBUG, __VA_ARGS__)
+#define INFO(...) PMD_DRV_LOG(INFO, __VA_ARGS__)
+#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
+#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)
+
Please don't use DEBUG() etc macros. It makes it easier for tools that do
global updates or scans if all drivers use the same model of PMD_DRV_LOG
The new standard is to use dynamic logtype.
Agree, please use dynamic logging, and also don't redefine new macros like DEBUG/INFO/WARN/ERROR.
Instead use PMD_DRV_LOG or equivalent macros.

The base rule here is that all drivers should look the same as much
as reasonably possible. This makes reviewers of other subsystems more likely
to see problems. It also allows for later changes where some developer does a global
improvement across many PMD's.

Drivers should not be snowflakes, each one is not unique.
Adrien Mazarguil
2017-12-19 10:01:17 UTC
Permalink
Post by Stephen Hemminger
On Mon, 18 Dec 2017 20:54:16 +0100
Post by Thomas Monjalon
Post by Stephen Hemminger
Post by Adrien Mazarguil
+#endif /* RTE_LIBRTE_HYPERV_DEBUG */
+
+#define DEBUG(...) PMD_DRV_LOG(DEBUG, __VA_ARGS__)
+#define INFO(...) PMD_DRV_LOG(INFO, __VA_ARGS__)
+#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
+#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)
+
Please don't use DEBUG() etc macros. It makes it easier for tools that do
global updates or scans if all drivers use the same model of PMD_DRV_LOG
The new standard is to use dynamic logtype.
Agree, please use dynamic logging, and also don't redefine new macros like DEBUG/INFO/WARN/ERROR.
Instead use PMD_DRV_LOG or equivalent macros.
Wait, the above definitions are only convenience wrappers to PMD_DRV_LOG(),
itself a wrapper to RTE_LOG(), itself a wrapper to rte_log(), their presence
is not triggered according to compilation options, did I miss something?

Let me bring back some context from the original patch:

#ifdef RTE_LIBRTE_HYPERV_DEBUG

#define PMD_DRV_LOG(level, ...) \
RTE_LOG(level, PMD, \
RTE_FMT("%s:%u: %s(): " RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
strrchr("/" __FILE__, '/') + 1, \
__LINE__, \
__func__, \
RTE_FMT_TAIL(__VA_ARGS__,)))

#else /* RTE_LIBRTE_HYPERV_DEBUG */

#define PMD_DRV_LOG(level, ...) \
RTE_LOG(level, PMD, \
RTE_FMT(RTE_STR(HYPERV_DRIVER) ": " \
RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
RTE_FMT_TAIL(__VA_ARGS__,)))

#endif /* RTE_LIBRTE_HYPERV_DEBUG */

#define DEBUG(...) PMD_DRV_LOG(DEBUG, __VA_ARGS__)
#define INFO(...) PMD_DRV_LOG(INFO, __VA_ARGS__)
#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)

Enabling RTE_LIBRTE_HYPERV_DEBUG adds file and line information to log
output, messages are otherwise unaffected by that compilation option. Adding
this information required some sort of wrapper to avoid needless clutter.

Nothing against outputting file/line information when compiled in debug mode
right?
Post by Stephen Hemminger
The base rule here is that all drivers should look the same as much
as reasonably possible. This makes reviewers of other subsystems more likely
to see problems. It also allows for later changes where some developer does a global
improvement across many PMD's.
Drivers should not be snowflakes, each one is not unique.
Point taken, do you confirm replacing i.e. WARN(...) with
PMD_DRV_LOG(WARN, ...) and friends is all that's needed?
--
Adrien Mazarguil
6WIND
Thomas Monjalon
2017-12-19 11:15:38 UTC
Permalink
Post by Adrien Mazarguil
Post by Stephen Hemminger
On Mon, 18 Dec 2017 20:54:16 +0100
Post by Thomas Monjalon
Post by Stephen Hemminger
Post by Adrien Mazarguil
+#endif /* RTE_LIBRTE_HYPERV_DEBUG */
+
+#define DEBUG(...) PMD_DRV_LOG(DEBUG, __VA_ARGS__)
+#define INFO(...) PMD_DRV_LOG(INFO, __VA_ARGS__)
+#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
+#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)
+
Please don't use DEBUG() etc macros. It makes it easier for tools that do
global updates or scans if all drivers use the same model of PMD_DRV_LOG
The new standard is to use dynamic logtype.
Agree, please use dynamic logging, and also don't redefine new macros like DEBUG/INFO/WARN/ERROR.
Instead use PMD_DRV_LOG or equivalent macros.
Wait, the above definitions are only convenience wrappers to PMD_DRV_LOG(),
itself a wrapper to RTE_LOG(), itself a wrapper to rte_log(), their presence
is not triggered according to compilation options, did I miss something?
#ifdef RTE_LIBRTE_HYPERV_DEBUG
#define PMD_DRV_LOG(level, ...) \
RTE_LOG(level, PMD, \
RTE_FMT("%s:%u: %s(): " RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
strrchr("/" __FILE__, '/') + 1, \
__LINE__, \
__func__, \
RTE_FMT_TAIL(__VA_ARGS__,)))
#else /* RTE_LIBRTE_HYPERV_DEBUG */
#define PMD_DRV_LOG(level, ...) \
RTE_LOG(level, PMD, \
RTE_FMT(RTE_STR(HYPERV_DRIVER) ": " \
RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
RTE_FMT_TAIL(__VA_ARGS__,)))
#endif /* RTE_LIBRTE_HYPERV_DEBUG */
#define DEBUG(...) PMD_DRV_LOG(DEBUG, __VA_ARGS__)
#define INFO(...) PMD_DRV_LOG(INFO, __VA_ARGS__)
#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)
Enabling RTE_LIBRTE_HYPERV_DEBUG adds file and line information to log
output, messages are otherwise unaffected by that compilation option. Adding
this information required some sort of wrapper to avoid needless clutter.
Nothing against outputting file/line information when compiled in debug mode
right?
I am not sure __FILE__, __LINE__ and __func__ are so much useful.
The log message should be unique enough.
Post by Adrien Mazarguil
Post by Stephen Hemminger
The base rule here is that all drivers should look the same as much
as reasonably possible. This makes reviewers of other subsystems more likely
to see problems. It also allows for later changes where some developer does a global
improvement across many PMD's.
Drivers should not be snowflakes, each one is not unique.
Point taken, do you confirm replacing i.e. WARN(...) with
PMD_DRV_LOG(WARN, ...) and friends is all that's needed?
You need to remove the compile-time option for DEBUG,
and rely on dynamic log type, thanks to rte_log_register().
Adrien Mazarguil
2017-12-19 13:13:43 UTC
Permalink
Post by Thomas Monjalon
Post by Adrien Mazarguil
Post by Stephen Hemminger
On Mon, 18 Dec 2017 20:54:16 +0100
Post by Thomas Monjalon
Post by Stephen Hemminger
Post by Adrien Mazarguil
+#endif /* RTE_LIBRTE_HYPERV_DEBUG */
+
+#define DEBUG(...) PMD_DRV_LOG(DEBUG, __VA_ARGS__)
+#define INFO(...) PMD_DRV_LOG(INFO, __VA_ARGS__)
+#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
+#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)
+
Please don't use DEBUG() etc macros. It makes it easier for tools that do
global updates or scans if all drivers use the same model of PMD_DRV_LOG
The new standard is to use dynamic logtype.
Agree, please use dynamic logging, and also don't redefine new macros like DEBUG/INFO/WARN/ERROR.
Instead use PMD_DRV_LOG or equivalent macros.
Wait, the above definitions are only convenience wrappers to PMD_DRV_LOG(),
itself a wrapper to RTE_LOG(), itself a wrapper to rte_log(), their presence
is not triggered according to compilation options, did I miss something?
#ifdef RTE_LIBRTE_HYPERV_DEBUG
#define PMD_DRV_LOG(level, ...) \
RTE_LOG(level, PMD, \
RTE_FMT("%s:%u: %s(): " RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
strrchr("/" __FILE__, '/') + 1, \
__LINE__, \
__func__, \
RTE_FMT_TAIL(__VA_ARGS__,)))
#else /* RTE_LIBRTE_HYPERV_DEBUG */
#define PMD_DRV_LOG(level, ...) \
RTE_LOG(level, PMD, \
RTE_FMT(RTE_STR(HYPERV_DRIVER) ": " \
RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
RTE_FMT_TAIL(__VA_ARGS__,)))
#endif /* RTE_LIBRTE_HYPERV_DEBUG */
#define DEBUG(...) PMD_DRV_LOG(DEBUG, __VA_ARGS__)
#define INFO(...) PMD_DRV_LOG(INFO, __VA_ARGS__)
#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)
Enabling RTE_LIBRTE_HYPERV_DEBUG adds file and line information to log
output, messages are otherwise unaffected by that compilation option. Adding
this information required some sort of wrapper to avoid needless clutter.
Nothing against outputting file/line information when compiled in debug mode
right?
I am not sure __FILE__, __LINE__ and __func__ are so much useful.
The log message should be unique enough.
I don't share your opinion. mlx4/mlx5 PMDs output similar information when
compiled in debug mode and that proved quite useful during development and
when tracking down bugs.

Thing is, mere users are not the target audience, it's a development tool
that doesn't need to be part of distributed binaries, hence the compilation
option.
Post by Thomas Monjalon
Post by Adrien Mazarguil
Post by Stephen Hemminger
The base rule here is that all drivers should look the same as much
as reasonably possible. This makes reviewers of other subsystems more likely
to see problems. It also allows for later changes where some developer does a global
improvement across many PMD's.
Drivers should not be snowflakes, each one is not unique.
Point taken, do you confirm replacing i.e. WARN(...) with
PMD_DRV_LOG(WARN, ...) and friends is all that's needed?
You need to remove the compile-time option for DEBUG,
and rely on dynamic log type, thanks to rte_log_register().
OK, I didn't know about rte_log_register() which may explain some of the
confusion, I'll add it in v2 then.

To summarize what needs to be done for v2:

- Call rte_log_register() during init.
- Use its return value in place of the second argument to RTE_LOG().
- Replace DEBUG/WARN/INFO/ERROR() wrappers with direct calls to
PMD_DRV_LOG() for consistency with other PMDs.
- Finally, remove debugging code/information and related compilation option
since they're useless to end users.
--
Adrien Mazarguil
6WIND
Adrien Mazarguil
2017-12-18 16:46:23 UTC
Permalink
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.

This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.

PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.

Once initialized, the sole job of the hyperv driver is to regularly scan
for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
---
doc/guides/nics/hyperv.rst | 65 ++++
drivers/net/hyperv/Makefile | 4 +
drivers/net/hyperv/hyperv.c | 654 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 722 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/hyperv.rst b/doc/guides/nics/hyperv.rst
index 28c4443d6..8f7a8b153 100644
--- a/doc/guides/nics/hyperv.rst
+++ b/doc/guides/nics/hyperv.rst
@@ -37,6 +37,50 @@ machines running on Microsoft Hyper-V_ (including Azure) platforms.

.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v

+Implementation details
+----------------------
+
+Each instance of this driver effectively needs to drive two devices: the
+NetVSC interface proper and its SR-IOV VF (referred to as "physical" from
+this point on) counterpart sharing the same MAC address.
+
+Physical devices are part of the host system and cannot be maintained during
+VM migration. From a VM standpoint they appear as hot-plug devices that come
+and go without prior notice.
+
+When the physical device is present, egress and most of the ingress traffic
+flows through it; only multicasts and other hypervisor control still flow
+through NetVSC. Otherwise, NetVSC acts as a fallback for all traffic.
+
+To avoid unnecessary code duplication and ensure maximum performance,
+handling of physical devices is left to their original PMDs; this virtual
+device driver (also known as *vdev*) manages other PMDs as summarized by the
+following block diagram::
+
+ .------------------.
+ | DPDK application |
+ `--------+---------'
+ |
+ .------+------.
+ | DPDK ethdev |
+ `------+------' Control
+ | |
+ .------------+------------. v .------------.
+ | failsafe PMD +---------+ hyperv PMD |
+ `--+-------------------+--' `------------'
+ | |
+ | .........|.........
+ | : | :
+ .----+----. : .----+----. :
+ | tap PMD | : | any PMD | :
+ `----+----' : `----+----' : <-- Hot-pluggable
+ | : | :
+ .------+-------. : .-----+-----. :
+ | NetVSC-based | : | SR-IOV VF | :
+ | netdevice | : | device | :
+ `--------------' : `-----------' :
+ :.................:
+
Build options
-------------

@@ -47,3 +91,24 @@ Build options
- ``CONFIG_RTE_LIBRTE_HYPERV_DEBUG`` (default ``n``)

Toggle additional debugging code.
+
+Run-time parameters
+-------------------
+
+To invoke this PMD, applications have to explicitly provide the
+``--vdev=net_hyperv`` EAL option.
+
+The following device parameters are supported:
+
+- ``iface`` [string]
+
+ Provide a specific NetVSC interface (netdevice) name to attach this PMD
+ to. Can be provided multiple times for additional instances.
+
+- ``mac`` [string]
+
+ Same as ``iface`` except a suitable NetVSC interface is located using its
+ MAC address.
+
+Not specifying either ``iface`` or ``mac`` makes this PMD attach itself to
+all NetVSC interfaces found on the system.
diff --git a/drivers/net/hyperv/Makefile b/drivers/net/hyperv/Makefile
index 82c720353..0a7d2986c 100644
--- a/drivers/net/hyperv/Makefile
+++ b/drivers/net/hyperv/Makefile
@@ -40,6 +40,9 @@ EXPORT_MAP := rte_pmd_hyperv_version.map
CFLAGS += -O3
CFLAGS += -g
CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += -D_XOPEN_SOURCE=600
+CFLAGS += -D_BSD_SOURCE
+CFLAGS += -D_DEFAULT_SOURCE
CFLAGS += $(WERROR_FLAGS)

# Dependencies.
@@ -47,6 +50,7 @@ LDLIBS += -lrte_bus_vdev
LDLIBS += -lrte_eal
LDLIBS += -lrte_ethdev
LDLIBS += -lrte_kvargs
+LDLIBS += -lrte_net

# Source files.
SRCS-$(CONFIG_RTE_LIBRTE_HYPERV_PMD) += hyperv.c
diff --git a/drivers/net/hyperv/hyperv.c b/drivers/net/hyperv/hyperv.c
index 2f940c76f..bad224be9 100644
--- a/drivers/net/hyperv/hyperv.c
+++ b/drivers/net/hyperv/hyperv.c
@@ -31,17 +31,40 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

+#include <errno.h>
+#include <fcntl.h>
+#include <linux/sockios.h>
+#include <net/if.h>
+#include <netinet/ip.h>
+#include <stdarg.h>
#include <stddef.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/queue.h>
+#include <sys/socket.h>
+#include <unistd.h>

+#include <rte_alarm.h>
+#include <rte_bus.h>
#include <rte_bus_vdev.h>
+#include <rte_common.h>
#include <rte_config.h>
+#include <rte_dev.h>
+#include <rte_errno.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
#include <rte_kvargs.h>
#include <rte_log.h>

#define HYPERV_DRIVER net_hyperv
#define HYPERV_ARG_IFACE "iface"
#define HYPERV_ARG_MAC "mac"
+#define HYPERV_PROBE_MS 1000
+
+#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"

#ifdef RTE_LIBRTE_HYPERV_DEBUG

@@ -68,12 +91,603 @@
#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)

+/**
+ * Convert a MAC address string to binary form.
+ *
+ * Note: this function should be exposed by rte_ether.h as the reverse of
+ * ether_format_addr().
+ *
+ * Several MAC string formats are supported on input for convenience:
+ *
+ * 1. "12:34:56:78:9a:bc"
+ * 2. "12-34-56-78-9a-bc"
+ * 3. "123456789abc"
+ * 4. Upper/lowercase hexadecimal.
+ * 5. Any combination of the above, e.g. "12:34-5678-9aBC".
+ * 6. Partial addresses are allowed, with low-order bytes filled first:
+ * - "5:6:78c" translates to "00:00:05:06:07:8c",
+ * - "5678c" translates to "00:00:00:05:67:8c".
+ *
+ * Non-hexadecimal characters, unknown separators and strings specifying
+ * more than 6 bytes are not allowed.
+ *
+ * @param[out] eth_addr
+ * Pointer to conversion result buffer.
+ * @param[in] str
+ * MAC address string to convert.
+ *
+ * @return
+ * 0 on success, -EINVAL in case of unsupported format.
+ */
+static int
+ether_addr_from_str(struct ether_addr *eth_addr, const char *str)
+{
+ static const uint8_t conv[0x100] = {
+ ['0'] = 0x80, ['1'] = 0x81, ['2'] = 0x82, ['3'] = 0x83,
+ ['4'] = 0x84, ['5'] = 0x85, ['6'] = 0x86, ['7'] = 0x87,
+ ['8'] = 0x88, ['9'] = 0x89, ['a'] = 0x8a, ['b'] = 0x8b,
+ ['c'] = 0x8c, ['d'] = 0x8d, ['e'] = 0x8e, ['f'] = 0x8f,
+ ['A'] = 0x8a, ['B'] = 0x8b, ['C'] = 0x8c, ['D'] = 0x8d,
+ ['E'] = 0x8e, ['F'] = 0x8f, [':'] = 0x40, ['-'] = 0x40,
+ ['\0'] = 0x60,
+ };
+ uint64_t addr = 0;
+ uint64_t buf = 0;
+ unsigned int i = 0;
+ unsigned int n = 0;
+ uint8_t tmp;
+
+ do {
+ tmp = conv[(int)*(str++)];
+ if (!tmp)
+ return -EINVAL;
+ if (tmp & 0x40) {
+ i += (i & 1) + (!i << 1);
+ addr = (addr << (i << 2)) | buf;
+ n += i;
+ buf = 0;
+ i = 0;
+ } else {
+ buf = (buf << 4) | (tmp & 0xf);
+ ++i;
+ }
+ } while (!(tmp & 0x20));
+ if (n > 12)
+ return -EINVAL;
+ i = RTE_DIM(eth_addr->addr_bytes);
+ while (i) {
+ eth_addr->addr_bytes[--i] = addr & 0xff;
+ addr >>= 8;
+ }
+ return 0;
+}
+
+/** Context structure for a hyperv instance. */
+struct hyperv_ctx {
+ LIST_ENTRY(hyperv_ctx) entry; /**< Next entry in list. */
+ unsigned int id; /**< ID used to generate unique names. */
+ char name[64]; /**< Unique name for hyperv instance. */
+ char devname[64]; /**< Fail-safe PMD instance name. */
+ char devargs[256]; /**< Fail-safe PMD instance device arguments. */
+ char if_name[IF_NAMESIZE]; /**< NetVSC netdevice name. */
+ unsigned int if_index; /**< NetVSC netdevice index. */
+ struct ether_addr if_addr; /**< NetVSC MAC address. */
+ int pipe[2]; /**< Communication pipe with fail-safe instance. */
+ char yield[256]; /**< Current device string used with fail-safe. */
+};
+
+/** Context list is common to all PMD instances. */
+static LIST_HEAD(, hyperv_ctx) hyperv_ctx_list =
+ LIST_HEAD_INITIALIZER(hyperv_ctx_list);
+
+/** Number of entries in context list. */
+static unsigned int hyperv_ctx_count;
+
/** Number of PMD instances relying on context list. */
static unsigned int hyperv_ctx_inst;

/**
+ * Destroy a hyperv context instance.
+ *
+ * @param ctx
+ * Context to destroy.
+ */
+static void
+hyperv_ctx_destroy(struct hyperv_ctx *ctx)
+{
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ /* Poisoning for debugging purposes. */
+ memset(ctx, 0x22, sizeof(*ctx));
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice found on
+ * the system.
+ *
+ * @param func
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ * @param ...
+ * Variable parameter list passed as @p va_list to @p func.
+ *
+ * @return
+ * 0 when the entire list is traversed successfully, a negative error code
+ * in case or failure, or the nonzero value returned by @p func when list
+ * traversal is aborted.
+ */
+static int
+hyperv_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ ERROR("cannot retrieve system network interfaces");
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ ERROR("cannot open socket: %s", rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ WARN("cannot retrieve information about interface"
+ " \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+error:
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ *
+ * @return
+ * A nonzero value when interface is detected as NetVSC. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+hyperv_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[snprintf(NULL, 0, temp, iface->if_name) + 1];
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ snprintf(path, sizeof(path), temp, iface->if_name);
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve the last component of a path.
+ *
+ * This is a simplified basename() that does not modify its input buffer to
+ * handle trailing backslashes.
+ *
+ * @param[in] path
+ * Path to retrieve the last component from.
+ *
+ * @return
+ * Pointer to the last component.
+ */
+static const char *
+hyperv_basename(const char *path)
+{
+ const char *tmp = path;
+
+ while (*tmp)
+ if (*(tmp++) == '/')
+ path = tmp;
+ return path;
+}
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * @param[out] buf
+ * Output data buffer.
+ * @param size
+ * Output buffer size.
+ * @param[in] if_name
+ * Netdevice name.
+ * @param[in] relpath
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * @return
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+hyperv_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size - 1)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
+
+/**
+ * Probe a network interface to associate with hyperv context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the hyperv context and communicates
+ * its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with hyperv_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - struct hyperv_ctx *ctx:
+ * Context to associate network interface with.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+hyperv_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct hyperv_ctx *ctx = va_arg(ap, struct hyperv_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DEBUG("NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (hyperv_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = hyperv_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ if (strcmp(hyperv_basename(buf), "pci"))
+ return 0;
+ ret = hyperv_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = hyperv_basename(buf);
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance if updated. */
+ if (!strcmp(addr, ctx->yield))
+ return 1;
+ DEBUG("associating PCI device \"%s\" with NetVSC interface \"%s\""
+ " (index %u)",
+ addr, ctx->if_name, ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ WARN("cannot associate PCI device name \"%s\" with interface"
+ " \"%s\": %s",
+ addr, ctx->if_name, rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by HYPERV_PROBE_MS as long
+ * as an hyperv context instance exists.
+ *
+ * @param arg
+ * Ignored.
+ */
+static void
+hyperv_alarm(void *arg)
+{
+ struct hyperv_ctx *ctx;
+ int ret;
+
+ (void)arg;
+ LIST_FOREACH(ctx, &hyperv_ctx_list, entry) {
+ ret = hyperv_foreach_iface(hyperv_device_probe, ctx);
+ if (ret)
+ break;
+ }
+ if (!hyperv_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(HYPERV_PROBE_MS * 1000, hyperv_alarm, NULL);
+ if (ret < 0) {
+ ERROR("unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
+
+/**
+ * Probe a NetVSC interface to generate a hyperv context from.
+ *
+ * This function instantiates hyperv contexts either for all NetVSC devices
+ * found on the system or only a subset provided as device arguments.
+ *
+ * It is normally used with hyperv_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - const char *name:
+ * Name associated with current driver instance.
+ *
+ * - struct rte_kvargs *kvargs:
+ * Device arguments provided to current driver instance.
+ *
+ * - unsigned int specified:
+ * Number of specific netdevices provided as device arguments.
+ *
+ * - unsigned int *matched:
+ * The number of specified netdevices matched by this function.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+hyperv_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct hyperv_ctx *ctx;
+ uint16_t port_id;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, HYPERV_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key, HYPERV_ARG_MAC)) {
+ struct ether_addr tmp;
+
+ if (ether_addr_from_str(&tmp, pair->value)) {
+ ERROR("invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (!is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &hyperv_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ WARN("interface \"%s\" (index %u) is already handled, skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!hyperv_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ WARN("interface \"%s\" (index %u) is not NetVSC, skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
+ ERROR("cannot allocate context for interface \"%s\": %s",
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = hyperv_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ ERROR("cannot allocate control pipe for interface \"%s\": %s",
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+ int fdf = fcntl(ctx->pipe[i], F_GETFD);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1 &&
+ fdf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFD,
+ i ? fdf | FD_CLOEXEC : fdf & ~FD_CLOEXEC) != -1)
+ continue;
+ ret = -errno;
+ ERROR("cannot toggle non-blocking or close-on-exec flags on"
+ " control file descriptor #%u (%d): %s",
+ i, ctx->pipe[i], rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name) - 1)
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname), "net_failsafe_%s",
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname) - 1)
+ ++i;
+ /*
+ * Note: bash replaces the default sh interpreter used by popen()
+ * because as seen with dash, POSIX-compliant shells do not
+ * necessarily support redirections with file descriptor numbers
+ * above 9.
+ */
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "exec(exec bash -c "
+ "'while read -r tmp <&%u 2> /dev/null;"
+ " do dev=$tmp; done;"
+ " echo $dev"
+ "'),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs) - 1)
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ ERROR("generated virtual device name or argument list too long"
+ " for interface \"%s\"", ctx->if_name);
+ goto error;
+ }
+ /*
+ * Remove any competing rte_eth_dev entries sharing the same MAC
+ * address, fail-safe instances created by this PMD will handle them
+ * as sub-devices later.
+ */
+ RTE_ETH_FOREACH_DEV(port_id) {
+ struct rte_device *dev = rte_eth_devices[port_id].device;
+ struct rte_bus *bus = rte_bus_find_by_device(dev);
+ struct ether_addr tmp;
+
+ rte_eth_macaddr_get(port_id, &tmp);
+ if (!is_same_ether_addr(eth_addr, &tmp))
+ continue;
+ WARN("removing device \"%s\" with identical MAC address to"
+ " re-create it as a fail-safe sub-device",
+ dev->name);
+ if (!bus)
+ ret = -EINVAL;
+ else
+ ret = rte_eal_hotplug_remove(bus->name, dev->name);
+ if (ret < 0) {
+ ERROR("unable to remove device \"%s\": %s",
+ dev->name, rte_strerror(-ret));
+ goto error;
+ }
+ }
+ /* Request virtual device generation. */
+ DEBUG("generating virtual device \"%s\" with arguments \"%s\"",
+ ctx->devname, ctx->devargs);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&hyperv_ctx_list, ctx, entry);
+ ++hyperv_ctx_count;
+ DEBUG("added NetVSC interface \"%s\" to context list", ctx->if_name);
+ return 0;
+error:
+ if (ctx)
+ hyperv_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified device
+ * arguments and starts a periodic alarm callback to notify the resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* @param dev
* Virtual device context for PMD instance.
*
@@ -92,12 +706,38 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
hyperv_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;

DEBUG("invoked as \"%s\", using arguments \"%s\"", name, args);
if (!kvargs) {
ERROR("cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, HYPERV_ARG_IFACE) ||
+ !strcmp(pair->key, HYPERV_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(hyperv_alarm, NULL);
+ /* Gather interfaces. */
+ ret = hyperv_foreach_iface(hyperv_netvsc_probe, name, kvargs,
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ WARN("some of the specified parameters did not match valid"
+ " network interfaces");
+ ret = rte_eal_alarm_set(HYPERV_PROBE_MS * 1000, hyperv_alarm, NULL);
+ if (ret < 0) {
+ ERROR("unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
error:
if (kvargs)
rte_kvargs_free(kvargs);
@@ -108,6 +748,9 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
/**
* Remove PMD instance.
*
+ * The alarm callback and underlying hyperv context instances are only
+ * destroyed after the last PMD instance is removed.
+ *
* @param dev
* Virtual device context for PMD instance.
*
@@ -118,7 +761,16 @@ static int
hyperv_vdev_remove(struct rte_vdev_device *dev)
{
(void)dev;
- --hyperv_ctx_inst;
+ if (--hyperv_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(hyperv_alarm, NULL);
+ while (!LIST_EMPTY(&hyperv_ctx_list)) {
+ struct hyperv_ctx *ctx = LIST_FIRST(&hyperv_ctx_list);
+
+ LIST_REMOVE(ctx, entry);
+ --hyperv_ctx_count;
+ hyperv_ctx_destroy(ctx);
+ }
return 0;
}
--
2.11.0
Wiles, Keith
2017-12-18 17:04:23 UTC
Permalink
Post by Adrien Mazarguil
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.
This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.
PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.
Once initialized, the sole job of the hyperv driver is to regularly scan
for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.
---
doc/guides/nics/hyperv.rst | 65 ++++
drivers/net/hyperv/Makefile | 4 +
drivers/net/hyperv/hyperv.c | 654 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 722 insertions(+), 1 deletion(-)
diff --git a/doc/guides/nics/hyperv.rst b/doc/guides/nics/hyperv.rst
index 28c4443d6..8f7a8b153 100644
--- a/doc/guides/nics/hyperv.rst
+++ b/doc/guides/nics/hyperv.rst
@@ -37,6 +37,50 @@ machines running on Microsoft Hyper-V_ (including Azure) platforms.
.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
+Implementation details
+----------------------
+
+Each instance of this driver effectively needs to drive two devices: the
+NetVSC interface proper and its SR-IOV VF (referred to as "physical" from
+this point on) counterpart sharing the same MAC address.
+
+Physical devices are part of the host system and cannot be maintained during
+VM migration. From a VM standpoint they appear as hot-plug devices that come
+and go without prior notice.
+
+When the physical device is present, egress and most of the ingress traffic
+flows through it; only multicasts and other hypervisor control still flow
+through NetVSC. Otherwise, NetVSC acts as a fallback for all traffic.
+
+To avoid unnecessary code duplication and ensure maximum performance,
+handling of physical devices is left to their original PMDs; this virtual
+device driver (also known as *vdev*) manages other PMDs as summarized by the
+
+ .------------------.
+ | DPDK application |
+ `--------+---------'
+ |
+ .------+------.
+ | DPDK ethdev |
+ `------+------' Control
+ | |
+ .------------+------------. v .------------.
+ | failsafe PMD +---------+ hyperv PMD |
+ `--+-------------------+--' `------------'
+ | |
+ | .........|.........
+ `----+----' : `----+----' : <-- Hot-pluggable
+
Build options
-------------
@@ -47,3 +91,24 @@ Build options
- ``CONFIG_RTE_LIBRTE_HYPERV_DEBUG`` (default ``n``)
Toggle additional debugging code.
+
+Run-time parameters
+-------------------
+
+To invoke this PMD, applications have to explicitly provide the
+``--vdev=net_hyperv`` EAL option.
+
+
+- ``iface`` [string]
+
+ Provide a specific NetVSC interface (netdevice) name to attach this PMD
+ to. Can be provided multiple times for additional instances.
+
+- ``mac`` [string]
+
+ Same as ``iface`` except a suitable NetVSC interface is located using its
+ MAC address.
+
+Not specifying either ``iface`` or ``mac`` makes this PMD attach itself to
+all NetVSC interfaces found on the system.
diff --git a/drivers/net/hyperv/Makefile b/drivers/net/hyperv/Makefile
index 82c720353..0a7d2986c 100644
--- a/drivers/net/hyperv/Makefile
+++ b/drivers/net/hyperv/Makefile
@@ -40,6 +40,9 @@ EXPORT_MAP := rte_pmd_hyperv_version.map
CFLAGS += -O3
CFLAGS += -g
CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += -D_XOPEN_SOURCE=600
+CFLAGS += -D_BSD_SOURCE
+CFLAGS += -D_DEFAULT_SOURCE
CFLAGS += $(WERROR_FLAGS)
# Dependencies.
@@ -47,6 +50,7 @@ LDLIBS += -lrte_bus_vdev
LDLIBS += -lrte_eal
LDLIBS += -lrte_ethdev
LDLIBS += -lrte_kvargs
+LDLIBS += -lrte_net
# Source files.
SRCS-$(CONFIG_RTE_LIBRTE_HYPERV_PMD) += hyperv.c
diff --git a/drivers/net/hyperv/hyperv.c b/drivers/net/hyperv/hyperv.c
index 2f940c76f..bad224be9 100644
--- a/drivers/net/hyperv/hyperv.c
+++ b/drivers/net/hyperv/hyperv.c
@@ -31,17 +31,40 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/sockios.h>
+#include <net/if.h>
+#include <netinet/ip.h>
+#include <stdarg.h>
#include <stddef.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/queue.h>
+#include <sys/socket.h>
+#include <unistd.h>
+#include <rte_alarm.h>
+#include <rte_bus.h>
#include <rte_bus_vdev.h>
+#include <rte_common.h>
#include <rte_config.h>
+#include <rte_dev.h>
+#include <rte_errno.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
#include <rte_kvargs.h>
#include <rte_log.h>
#define HYPERV_DRIVER net_hyperv
#define HYPERV_ARG_IFACE "iface"
#define HYPERV_ARG_MAC "mac"
+#define HYPERV_PROBE_MS 1000
+
+#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
#ifdef RTE_LIBRTE_HYPERV_DEBUG
@@ -68,12 +91,603 @@
#define WARN(...) PMD_DRV_LOG(WARNING, __VA_ARGS__)
#define ERROR(...) PMD_DRV_LOG(ERR, __VA_ARGS__)
+/**
+ * Convert a MAC address string to binary form.
+ *
+ * Note: this function should be exposed by rte_ether.h as the reverse of
+ * ether_format_addr().
+ *
+ *
+ * 1. "12:34:56:78:9a:bc"
+ * 2. "12-34-56-78-9a-bc"
+ * 3. "123456789abc"
+ * 4. Upper/lowercase hexadecimal.
+ * 5. Any combination of the above, e.g. "12:34-5678-9aBC".
+ * - "5:6:78c" translates to "00:00:05:06:07:8c",
+ * - "5678c" translates to "00:00:00:05:67:8c".
+ *
+ * Non-hexadecimal characters, unknown separators and strings specifying
+ * more than 6 bytes are not allowed.
+ *
+ * Pointer to conversion result buffer.
+ * MAC address string to convert.
+ *
+ * 0 on success, -EINVAL in case of unsupported format.
+ */
+static int
+ether_addr_from_str(struct ether_addr *eth_addr, const char *str)
+{
+ static const uint8_t conv[0x100] = {
+ ['0'] = 0x80, ['1'] = 0x81, ['2'] = 0x82, ['3'] = 0x83,
+ ['4'] = 0x84, ['5'] = 0x85, ['6'] = 0x86, ['7'] = 0x87,
+ ['8'] = 0x88, ['9'] = 0x89, ['a'] = 0x8a, ['b'] = 0x8b,
+ ['c'] = 0x8c, ['d'] = 0x8d, ['e'] = 0x8e, ['f'] = 0x8f,
+ ['A'] = 0x8a, ['B'] = 0x8b, ['C'] = 0x8c, ['D'] = 0x8d,
+ ['E'] = 0x8e, ['F'] = 0x8f, [':'] = 0x40, ['-'] = 0x40,
+ ['\0'] = 0x60,
+ };
+ uint64_t addr = 0;
+ uint64_t buf = 0;
+ unsigned int i = 0;
+ unsigned int n = 0;
+ uint8_t tmp;
+
+ do {
+ tmp = conv[(int)*(str++)];
+ if (!tmp)
+ return -EINVAL;
+ if (tmp & 0x40) {
+ i += (i & 1) + (!i << 1);
+ addr = (addr << (i << 2)) | buf;
+ n += i;
+ buf = 0;
+ i = 0;
+ } else {
+ buf = (buf << 4) | (tmp & 0xf);
+ ++i;
+ }
+ } while (!(tmp & 0x20));
+ if (n > 12)
+ return -EINVAL;
+ i = RTE_DIM(eth_addr->addr_bytes);
+ while (i) {
+ eth_addr->addr_bytes[--i] = addr & 0xff;
+ addr >>= 8;
+ }
+ return 0;
+}
You already called this out above, why not just push this into rte_ether.h file. I know I could use it if it were public.
Post by Adrien Mazarguil
+
+/** Context structure for a hyperv instance. */
+struct hyperv_ctx {
+ LIST_ENTRY(hyperv_ctx) entry; /**< Next entry in list. */
+ unsigned int id; /**< ID used to generate unique names. */
+ char name[64]; /**< Unique name for hyperv instance. */
+ char devname[64]; /**< Fail-safe PMD instance name. */
+ char devargs[256]; /**< Fail-safe PMD instance device arguments. */
+ char if_name[IF_NAMESIZE]; /**< NetVSC netdevice name. */
+ unsigned int if_index; /**< NetVSC netdevice index. */
+ struct ether_addr if_addr; /**< NetVSC MAC address. */
+ int pipe[2]; /**< Communication pipe with fail-safe instance. */
+ char yield[256]; /**< Current device string used with fail-safe. */
+};
+
+/** Context list is common to all PMD instances. */
+static LIST_HEAD(, hyperv_ctx) hyperv_ctx_list =
+ LIST_HEAD_INITIALIZER(hyperv_ctx_list);
+
+/** Number of entries in context list. */
+static unsigned int hyperv_ctx_count;
+
/** Number of PMD instances relying on context list. */
static unsigned int hyperv_ctx_inst;
/**
+ * Destroy a hyperv context instance.
+ *
+ * Context to destroy.
+ */
+static void
+hyperv_ctx_destroy(struct hyperv_ctx *ctx)
+{
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ /* Poisoning for debugging purposes. */
+ memset(ctx, 0x22, sizeof(*ctx));
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice found on
+ * the system.
+ *
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ *
+ * 0 when the entire list is traversed successfully, a negative error code
+ * traversal is aborted.
+ */
+static int
+hyperv_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ ERROR("cannot retrieve system network interfaces");
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ ERROR("cannot open socket: %s", rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ WARN("cannot retrieve information about interface"
+ " \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * A nonzero value when interface is detected as NetVSC. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+hyperv_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[snprintf(NULL, 0, temp, iface->if_name) + 1];
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ snprintf(path, sizeof(path), temp, iface->if_name);
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve the last component of a path.
+ *
+ * This is a simplified basename() that does not modify its input buffer to
+ * handle trailing backslashes.
+ *
+ * Path to retrieve the last component from.
+ *
+ * Pointer to the last component.
+ */
+static const char *
+hyperv_basename(const char *path)
+{
+ const char *tmp = path;
+
+ while (*tmp)
+ if (*(tmp++) == '/')
+ path = tmp;
+ return path;
+}
Why not just user rindex() to find the last ‘/‘ instead of this routine? I know it is not performance critical.
Post by Adrien Mazarguil
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * Output data buffer.
+ * Output buffer size.
+ * Netdevice name.
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+hyperv_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size - 1)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
+
+/**
+ * Probe a network interface to associate with hyperv context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the hyperv context and communicates
+ * its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with hyperv_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Context to associate network interface with.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+hyperv_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct hyperv_ctx *ctx = va_arg(ap, struct hyperv_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DEBUG("NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (hyperv_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = hyperv_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ if (strcmp(hyperv_basename(buf), "pci"))
+ return 0;
+ ret = hyperv_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = hyperv_basename(buf);
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance if updated. */
+ if (!strcmp(addr, ctx->yield))
+ return 1;
+ DEBUG("associating PCI device \"%s\" with NetVSC interface \"%s\""
+ " (index %u)",
+ addr, ctx->if_name, ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ WARN("cannot associate PCI device name \"%s\" with interface"
+ " \"%s\": %s",
+ addr, ctx->if_name, rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
Not to criticize style, but a few blank lines could help in readability for these files IMHO. Unless blank lines are illegal :-)
Post by Adrien Mazarguil
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by HYPERV_PROBE_MS as long
+ * as an hyperv context instance exists.
+ *
+ * Ignored.
+ */
+static void
+hyperv_alarm(void *arg)
+{
+ struct hyperv_ctx *ctx;
+ int ret;
+
+ (void)arg;
+ LIST_FOREACH(ctx, &hyperv_ctx_list, entry) {
+ ret = hyperv_foreach_iface(hyperv_device_probe, ctx);
+ if (ret)
+ break;
+ }
+ if (!hyperv_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(HYPERV_PROBE_MS * 1000, hyperv_alarm, NULL);
+ if (ret < 0) {
+ ERROR("unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
+
+/**
+ * Probe a NetVSC interface to generate a hyperv context from.
+ *
+ * This function instantiates hyperv contexts either for all NetVSC devices
+ * found on the system or only a subset provided as device arguments.
+ *
+ * It is normally used with hyperv_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Name associated with current driver instance.
+ *
+ * Device arguments provided to current driver instance.
+ *
+ * Number of specific netdevices provided as device arguments.
+ *
+ * The number of specified netdevices matched by this function.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+hyperv_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct hyperv_ctx *ctx;
+ uint16_t port_id;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, HYPERV_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key, HYPERV_ARG_MAC)) {
+ struct ether_addr tmp;
+
+ if (ether_addr_from_str(&tmp, pair->value)) {
+ ERROR("invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (!is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &hyperv_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ WARN("interface \"%s\" (index %u) is already handled, skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!hyperv_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ WARN("interface \"%s\" (index %u) is not NetVSC, skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
+ ERROR("cannot allocate context for interface \"%s\": %s",
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = hyperv_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ ERROR("cannot allocate control pipe for interface \"%s\": %s",
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+ int fdf = fcntl(ctx->pipe[i], F_GETFD);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1 &&
+ fdf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFD,
+ i ? fdf | FD_CLOEXEC : fdf & ~FD_CLOEXEC) != -1)
+ continue;
+ ret = -errno;
+ ERROR("cannot toggle non-blocking or close-on-exec flags on"
+ " control file descriptor #%u (%d): %s",
+ i, ctx->pipe[i], rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name) - 1)
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname), "net_failsafe_%s",
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname) - 1)
+ ++i;
+ /*
+ * Note: bash replaces the default sh interpreter used by popen()
+ * because as seen with dash, POSIX-compliant shells do not
+ * necessarily support redirections with file descriptor numbers
+ * above 9.
+ */
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "exec(exec bash -c "
+ "'while read -r tmp <&%u 2> /dev/null;"
+ " do dev=$tmp; done;"
+ " echo $dev"
+ "'),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs) - 1)
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ ERROR("generated virtual device name or argument list too long"
+ " for interface \"%s\"", ctx->if_name);
+ goto error;
+ }
+ /*
+ * Remove any competing rte_eth_dev entries sharing the same MAC
+ * address, fail-safe instances created by this PMD will handle them
+ * as sub-devices later.
+ */
+ RTE_ETH_FOREACH_DEV(port_id) {
+ struct rte_device *dev = rte_eth_devices[port_id].device;
+ struct rte_bus *bus = rte_bus_find_by_device(dev);
+ struct ether_addr tmp;
+
+ rte_eth_macaddr_get(port_id, &tmp);
+ if (!is_same_ether_addr(eth_addr, &tmp))
+ continue;
+ WARN("removing device \"%s\" with identical MAC address to"
+ " re-create it as a fail-safe sub-device",
+ dev->name);
+ if (!bus)
+ ret = -EINVAL;
+ else
+ ret = rte_eal_hotplug_remove(bus->name, dev->name);
+ if (ret < 0) {
+ ERROR("unable to remove device \"%s\": %s",
+ dev->name, rte_strerror(-ret));
+ goto error;
+ }
+ }
+ /* Request virtual device generation. */
+ DEBUG("generating virtual device \"%s\" with arguments \"%s\"",
+ ctx->devname, ctx->devargs);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&hyperv_ctx_list, ctx, entry);
+ ++hyperv_ctx_count;
+ DEBUG("added NetVSC interface \"%s\" to context list", ctx->if_name);
+ return 0;
+ if (ctx)
+ hyperv_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified device
+ * arguments and starts a periodic alarm callback to notify the resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* Virtual device context for PMD instance.
*
@@ -92,12 +706,38 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
hyperv_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;
DEBUG("invoked as \"%s\", using arguments \"%s\"", name, args);
if (!kvargs) {
ERROR("cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, HYPERV_ARG_IFACE) ||
+ !strcmp(pair->key, HYPERV_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(hyperv_alarm, NULL);
+ /* Gather interfaces. */
+ ret = hyperv_foreach_iface(hyperv_netvsc_probe, name, kvargs,
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ WARN("some of the specified parameters did not match valid"
+ " network interfaces");
+ ret = rte_eal_alarm_set(HYPERV_PROBE_MS * 1000, hyperv_alarm, NULL);
+ if (ret < 0) {
+ ERROR("unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
if (kvargs)
rte_kvargs_free(kvargs);
@@ -108,6 +748,9 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
/**
* Remove PMD instance.
*
+ * The alarm callback and underlying hyperv context instances are only
+ * destroyed after the last PMD instance is removed.
+ *
* Virtual device context for PMD instance.
*
@@ -118,7 +761,16 @@ static int
hyperv_vdev_remove(struct rte_vdev_device *dev)
{
(void)dev;
- --hyperv_ctx_inst;
+ if (--hyperv_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(hyperv_alarm, NULL);
+ while (!LIST_EMPTY(&hyperv_ctx_list)) {
+ struct hyperv_ctx *ctx = LIST_FIRST(&hyperv_ctx_list);
+
+ LIST_REMOVE(ctx, entry);
+ --hyperv_ctx_count;
+ hyperv_ctx_destroy(ctx);
+ }
return 0;
}
--
2.11.0
Regards,
Keith
Adrien Mazarguil
2017-12-18 17:59:00 UTC
Permalink
Post by Wiles, Keith
Post by Adrien Mazarguil
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.
This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.
PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.
Once initialized, the sole job of the hyperv driver is to regularly scan
for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.
---
doc/guides/nics/hyperv.rst | 65 ++++
drivers/net/hyperv/Makefile | 4 +
drivers/net/hyperv/hyperv.c | 654 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 722 insertions(+), 1 deletion(-)
<snip>
Post by Wiles, Keith
Post by Adrien Mazarguil
diff --git a/drivers/net/hyperv/hyperv.c b/drivers/net/hyperv/hyperv.c
index 2f940c76f..bad224be9 100644
--- a/drivers/net/hyperv/hyperv.c
+++ b/drivers/net/hyperv/hyperv.c
<snip>
Post by Wiles, Keith
Post by Adrien Mazarguil
+/**
+ * Convert a MAC address string to binary form.
+ *
+ * Note: this function should be exposed by rte_ether.h as the reverse of
+ * ether_format_addr().
+ *
+ *
+ * 1. "12:34:56:78:9a:bc"
+ * 2. "12-34-56-78-9a-bc"
+ * 3. "123456789abc"
+ * 4. Upper/lowercase hexadecimal.
+ * 5. Any combination of the above, e.g. "12:34-5678-9aBC".
+ * - "5:6:78c" translates to "00:00:05:06:07:8c",
+ * - "5678c" translates to "00:00:00:05:67:8c".
+ *
+ * Non-hexadecimal characters, unknown separators and strings specifying
+ * more than 6 bytes are not allowed.
+ *
+ * Pointer to conversion result buffer.
+ * MAC address string to convert.
+ *
+ * 0 on success, -EINVAL in case of unsupported format.
+ */
+static int
+ether_addr_from_str(struct ether_addr *eth_addr, const char *str)
+{
+ static const uint8_t conv[0x100] = {
+ ['0'] = 0x80, ['1'] = 0x81, ['2'] = 0x82, ['3'] = 0x83,
+ ['4'] = 0x84, ['5'] = 0x85, ['6'] = 0x86, ['7'] = 0x87,
+ ['8'] = 0x88, ['9'] = 0x89, ['a'] = 0x8a, ['b'] = 0x8b,
+ ['c'] = 0x8c, ['d'] = 0x8d, ['e'] = 0x8e, ['f'] = 0x8f,
+ ['A'] = 0x8a, ['B'] = 0x8b, ['C'] = 0x8c, ['D'] = 0x8d,
+ ['E'] = 0x8e, ['F'] = 0x8f, [':'] = 0x40, ['-'] = 0x40,
+ ['\0'] = 0x60,
+ };
+ uint64_t addr = 0;
+ uint64_t buf = 0;
+ unsigned int i = 0;
+ unsigned int n = 0;
+ uint8_t tmp;
+
+ do {
+ tmp = conv[(int)*(str++)];
+ if (!tmp)
+ return -EINVAL;
+ if (tmp & 0x40) {
+ i += (i & 1) + (!i << 1);
+ addr = (addr << (i << 2)) | buf;
+ n += i;
+ buf = 0;
+ i = 0;
+ } else {
+ buf = (buf << 4) | (tmp & 0xf);
+ ++i;
+ }
+ } while (!(tmp & 0x20));
+ if (n > 12)
+ return -EINVAL;
+ i = RTE_DIM(eth_addr->addr_bytes);
+ while (i) {
+ eth_addr->addr_bytes[--i] = addr & 0xff;
+ addr >>= 8;
+ }
+ return 0;
+}
You already called this out above, why not just push this into rte_ether.h file. I know I could use it if it were public.
Hehe, that was to highlight how this driver didn't require any modifications
in public APIs. I planned to do just that in v2 or in a subsequent patch.

<snip>
Post by Wiles, Keith
Post by Adrien Mazarguil
+/**
+ * Retrieve the last component of a path.
+ *
+ * This is a simplified basename() that does not modify its input buffer to
+ * handle trailing backslashes.
+ *
+ * Path to retrieve the last component from.
+ *
+ * Pointer to the last component.
+ */
+static const char *
+hyperv_basename(const char *path)
+{
+ const char *tmp = path;
+
+ while (*tmp)
+ if (*(tmp++) == '/')
+ path = tmp;
+ return path;
+}
Why not just user rindex() to find the last ‘/‘ instead of this routine? I know it is not performance critical.
Right, however both rindex() and strrchr() return NULL when no '/' is
present. strchrnul() works but is GNU-specific (i.e. probably not found on
BSD), I didn't want to perform an additional check for that, so actually
given the size of that function I didn't give it a second thought. I can
modify that if needed.

<snip>
Post by Wiles, Keith
Post by Adrien Mazarguil
+/**
+ * Probe a network interface to associate with hyperv context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the hyperv context and communicates
+ * its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with hyperv_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Context to associate network interface with.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+hyperv_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct hyperv_ctx *ctx = va_arg(ap, struct hyperv_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DEBUG("NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (hyperv_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = hyperv_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ if (strcmp(hyperv_basename(buf), "pci"))
+ return 0;
+ ret = hyperv_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = hyperv_basename(buf);
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance if updated. */
+ if (!strcmp(addr, ctx->yield))
+ return 1;
+ DEBUG("associating PCI device \"%s\" with NetVSC interface \"%s\""
+ " (index %u)",
+ addr, ctx->if_name, ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ WARN("cannot associate PCI device name \"%s\" with interface"
+ " \"%s\": %s",
+ addr, ctx->if_name, rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
Not to criticize style, but a few blank lines could help in readability for these files IMHO. Unless blank lines are illegal :-)
It's a matter of taste, I think people tend to add random blank lines where
they think doing so clarifies things for themselves, resulting in
inconsistent coding style not much clearer for everyone after several
iterations.

As a maintainer I've grown tired of discussions related to blank lines while
reviewing patches. That's why except for a few special cases, I now enforce
exactly the bare minimum of one blank line between variable declarations and
the rest of the code inside each block.

If doing so makes a function unreadable then perhaps it needs to be split :)
I'm sure you'll understand!

Regards,
--
Adrien Mazarguil
6WIND
Wiles, Keith
2017-12-18 18:43:35 UTC
Permalink
Post by Adrien Mazarguil
Post by Wiles, Keith
Not to criticize style, but a few blank lines could help in readability for these files IMHO. Unless blank lines are illegal :-)
It's a matter of taste, I think people tend to add random blank lines where
they think doing so clarifies things for themselves, resulting in
inconsistent coding style not much clearer for everyone after several
iterations.
As a maintainer I've grown tired of discussions related to blank lines while
reviewing patches. That's why except for a few special cases, I now enforce
exactly the bare minimum of one blank line between variable declarations and
the rest of the code inside each block.
If doing so makes a function unreadable then perhaps it needs to be split :)
I'm sure you'll understand!
I do not really understand the problem as I have not seen any complaints about blank lines unless two or more in a row. I have never seen someone complain about a given blank line in a function, unless a missing one to split up the declared variables and code in a function or block of code.

It is a shame you have decided to take the minimum approach to blank lines, IMO it does not make a lot of sense. I only bring it up to help others with reading your code like our customers.

We do not have rule for this so I can not force anyone to add blank lines for readability, so I have to live with it. :-(
Post by Adrien Mazarguil
Regards,
--
Adrien Mazarguil
6WIND
Regards,
Keith
Nelio Laranjeiro
2017-12-19 08:25:36 UTC
Permalink
Hi Keith,
Post by Wiles, Keith
Post by Adrien Mazarguil
Post by Wiles, Keith
Not to criticize style, but a few blank lines could help in
readability for these files IMHO. Unless blank lines are illegal
:-)
It's a matter of taste, I think people tend to add random blank lines where
they think doing so clarifies things for themselves, resulting in
inconsistent coding style not much clearer for everyone after several
iterations.
As a maintainer I've grown tired of discussions related to blank lines while
reviewing patches. That's why except for a few special cases, I now enforce
exactly the bare minimum of one blank line between variable declarations and
the rest of the code inside each block.
If doing so makes a function unreadable then perhaps it needs to be split :)
I'm sure you'll understand!
I do not really understand the problem as I have not seen any
complaints about blank lines unless two or more in a row. I have never
seen someone complain about a given blank line in a function, unless a
missing one to split up the declared variables and code in a function
or block of code.
It is true when the amount of blank lines are few and logical, but we
generally see patch where in the same file we see random blank lines
added without any logic, generally to easily identify where the
modification are done.
Post by Wiles, Keith
It is a shame you have decided to take the minimum approach to blank
lines, IMO it does not make a lot of sense. I only bring it up to help
others with reading your code like our customers.
We do not have rule for this so I can not force anyone to add blank
lines for readability, so I have to live with it. :-(
As there is no clear rules, the best one is limiting this situation to
the extreme minimal, otherwise explaining the logic behind it is very
difficult as it will differ from one maintainer to another one, it will
increase the amount of patches refused due to coding style issues.
Post by Wiles, Keith
Post by Adrien Mazarguil
Regards,
--
Adrien Mazarguil
6WIND
Regards,
Keith
Regards,
--
Nélio Laranjeiro
6WIND
Stephen Hemminger
2017-12-18 18:26:29 UTC
Permalink
On Mon, 18 Dec 2017 17:46:23 +0100
Post by Adrien Mazarguil
+static int
+ether_addr_from_str(struct ether_addr *eth_addr, const char *str)
+{
+ static const uint8_t conv[0x100] = {
+ ['0'] = 0x80, ['1'] = 0x81, ['2'] = 0x82, ['3'] = 0x83,
+ ['4'] = 0x84, ['5'] = 0x85, ['6'] = 0x86, ['7'] = 0x87,
+ ['8'] = 0x88, ['9'] = 0x89, ['a'] = 0x8a, ['b'] = 0x8b,
+ ['c'] = 0x8c, ['d'] = 0x8d, ['e'] = 0x8e, ['f'] = 0x8f,
+ ['A'] = 0x8a, ['B'] = 0x8b, ['C'] = 0x8c, ['D'] = 0x8d,
+ ['E'] = 0x8e, ['F'] = 0x8f, [':'] = 0x40, ['-'] = 0x40,
+ ['\0'] = 0x60,
+ };
+ uint64_t addr = 0;
+ uint64_t buf = 0;
+ unsigned int i = 0;
+ unsigned int n = 0;
+ uint8_t tmp;
+
+ do {
+ tmp = conv[(int)*(str++)];
+ if (!tmp)
+ return -EINVAL;
+ if (tmp & 0x40) {
+ i += (i & 1) + (!i << 1);
+ addr = (addr << (i << 2)) | buf;
+ n += i;
+ buf = 0;
+ i = 0;
+ } else {
+ buf = (buf << 4) | (tmp & 0xf);
+ ++i;
+ }
+ } while (!(tmp & 0x20));
+ if (n > 12)
+ return -EINVAL;
+ i = RTE_DIM(eth_addr->addr_bytes);
+ while (i) {
+ eth_addr->addr_bytes[--i] = addr & 0xff;
+ addr >>= 8;
+ }
+ return 0;
+}
+
Why not ether_ntoa?
Adrien Mazarguil
2017-12-18 20:21:39 UTC
Permalink
Post by Stephen Hemminger
On Mon, 18 Dec 2017 17:46:23 +0100
Post by Adrien Mazarguil
+static int
+ether_addr_from_str(struct ether_addr *eth_addr, const char *str)
+{
+ static const uint8_t conv[0x100] = {
+ ['0'] = 0x80, ['1'] = 0x81, ['2'] = 0x82, ['3'] = 0x83,
+ ['4'] = 0x84, ['5'] = 0x85, ['6'] = 0x86, ['7'] = 0x87,
+ ['8'] = 0x88, ['9'] = 0x89, ['a'] = 0x8a, ['b'] = 0x8b,
+ ['c'] = 0x8c, ['d'] = 0x8d, ['e'] = 0x8e, ['f'] = 0x8f,
+ ['A'] = 0x8a, ['B'] = 0x8b, ['C'] = 0x8c, ['D'] = 0x8d,
+ ['E'] = 0x8e, ['F'] = 0x8f, [':'] = 0x40, ['-'] = 0x40,
+ ['\0'] = 0x60,
+ };
+ uint64_t addr = 0;
+ uint64_t buf = 0;
+ unsigned int i = 0;
+ unsigned int n = 0;
+ uint8_t tmp;
+
+ do {
+ tmp = conv[(int)*(str++)];
+ if (!tmp)
+ return -EINVAL;
+ if (tmp & 0x40) {
+ i += (i & 1) + (!i << 1);
+ addr = (addr << (i << 2)) | buf;
+ n += i;
+ buf = 0;
+ i = 0;
+ } else {
+ buf = (buf << 4) | (tmp & 0xf);
+ ++i;
+ }
+ } while (!(tmp & 0x20));
+ if (n > 12)
+ return -EINVAL;
+ i = RTE_DIM(eth_addr->addr_bytes);
+ while (i) {
+ eth_addr->addr_bytes[--i] = addr & 0xff;
+ addr >>= 8;
+ }
+ return 0;
+}
+
Why not ether_ntoa?
Good question. For the following reasons:

- I forgot about the existence of ether_ntoa() and didn't look it up seeing
struct ether_addr is (re-)defined by rte_ether.h. What happens when one
includes netinet/ether.h together with that file results in various
conflicts that trigger a compilation error. This problem should be
addressed first.

- ether_ntoa() returns a static buffer and is not reentrant, ether_ntoa_r()
is but as a GNU extension, I'm not sure it exists on other OSes. Even if
this driver is currently targeted at Linux, this is likely not the case
for other DPDK code relying on rte_ether.h.

- I had ether_addr_from_str()'s code already ready and lying around for a
future update in testpmd's flow command parser. No other MAC-48 conversion
function I know of is as flexible as this version. The ability to omit ":"
and entering partial addresses is a big plus IMO.

I think both can coexist on their own merits. Since rte_ether.h needs to be
fixed either way, how about I move this function in a separate commit and
address the conflict with netinet/ether.h while there?
--
Adrien Mazarguil
6WIND
Thomas Monjalon
2017-12-18 21:03:55 UTC
Permalink
Post by Adrien Mazarguil
Post by Stephen Hemminger
On Mon, 18 Dec 2017 17:46:23 +0100
Post by Adrien Mazarguil
+static int
+ether_addr_from_str(struct ether_addr *eth_addr, const char *str)
+{
+ static const uint8_t conv[0x100] = {
+ ['0'] = 0x80, ['1'] = 0x81, ['2'] = 0x82, ['3'] = 0x83,
+ ['4'] = 0x84, ['5'] = 0x85, ['6'] = 0x86, ['7'] = 0x87,
+ ['8'] = 0x88, ['9'] = 0x89, ['a'] = 0x8a, ['b'] = 0x8b,
+ ['c'] = 0x8c, ['d'] = 0x8d, ['e'] = 0x8e, ['f'] = 0x8f,
+ ['A'] = 0x8a, ['B'] = 0x8b, ['C'] = 0x8c, ['D'] = 0x8d,
+ ['E'] = 0x8e, ['F'] = 0x8f, [':'] = 0x40, ['-'] = 0x40,
+ ['\0'] = 0x60,
+ };
+ uint64_t addr = 0;
+ uint64_t buf = 0;
+ unsigned int i = 0;
+ unsigned int n = 0;
+ uint8_t tmp;
+
+ do {
+ tmp = conv[(int)*(str++)];
+ if (!tmp)
+ return -EINVAL;
+ if (tmp & 0x40) {
+ i += (i & 1) + (!i << 1);
+ addr = (addr << (i << 2)) | buf;
+ n += i;
+ buf = 0;
+ i = 0;
+ } else {
+ buf = (buf << 4) | (tmp & 0xf);
+ ++i;
+ }
+ } while (!(tmp & 0x20));
+ if (n > 12)
+ return -EINVAL;
+ i = RTE_DIM(eth_addr->addr_bytes);
+ while (i) {
+ eth_addr->addr_bytes[--i] = addr & 0xff;
+ addr >>= 8;
+ }
+ return 0;
+}
+
Why not ether_ntoa?
- I forgot about the existence of ether_ntoa() and didn't look it up seeing
struct ether_addr is (re-)defined by rte_ether.h. What happens when one
includes netinet/ether.h together with that file results in various
conflicts that trigger a compilation error. This problem should be
addressed first.
- ether_ntoa() returns a static buffer and is not reentrant, ether_ntoa_r()
is but as a GNU extension, I'm not sure it exists on other OSes. Even if
this driver is currently targeted at Linux, this is likely not the case
for other DPDK code relying on rte_ether.h.
- I had ether_addr_from_str()'s code already ready and lying around for a
future update in testpmd's flow command parser. No other MAC-48 conversion
function I know of is as flexible as this version. The ability to omit ":"
and entering partial addresses is a big plus IMO.
I think both can coexist on their own merits. Since rte_ether.h needs to be
fixed either way, how about I move this function in a separate commit and
address the conflict with netinet/ether.h while there?
Looks to be a good plan.
Stephen Hemminger
2017-12-18 21:19:57 UTC
Permalink
On Mon, 18 Dec 2017 22:03:55 +0100
Post by Thomas Monjalon
Post by Adrien Mazarguil
- I forgot about the existence of ether_ntoa() and didn't look it up seeing
struct ether_addr is (re-)defined by rte_ether.h. What happens when one
includes netinet/ether.h together with that file results in various
conflicts that trigger a compilation error. This problem should be
addressed first.
- ether_ntoa() returns a static buffer and is not reentrant, ether_ntoa_r()
is but as a GNU extension, I'm not sure it exists on other OSes. Even if
this driver is currently targeted at Linux, this is likely not the case
for other DPDK code relying on rte_ether.h.
- I had ether_addr_from_str()'s code already ready and lying around for a
future update in testpmd's flow command parser. No other MAC-48 conversion
function I know of is as flexible as this version. The ability to omit ":"
and entering partial addresses is a big plus IMO.
I think both can coexist on their own merits. Since rte_ether.h needs to be
fixed either way, how about I move this function in a separate commit and
address the conflict with netinet/ether.h while there?
Looks to be a good plan.
Agree, rte_ether is where it should go. Please put functions for parsing there.
The name and logic conflict between netinet/ether.h and rte is both a blessing
and a curse. Although the definitions of ether_addr overlap, they are equivalent.
Stephen Hemminger
2017-12-18 18:34:12 UTC
Permalink
On Mon, 18 Dec 2017 17:46:23 +0100
Post by Adrien Mazarguil
/**
+ * Destroy a hyperv context instance.
+ *
+ * Context to destroy.
+ */
+static void
+hyperv_ctx_destroy(struct hyperv_ctx *ctx)
+{
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ /* Poisoning for debugging purposes. */
+ memset(ctx, 0x22, sizeof(*ctx));
Don't leave debug code in submitted drivers
Post by Adrien Mazarguil
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice found on
+ * the system.
+ *
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ *
+ * 0 when the entire list is traversed successfully, a negative error code
+ * traversal is aborted.
+ */
+static int
+hyperv_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ ERROR("cannot retrieve system network interfaces");
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ ERROR("cannot open socket: %s", rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ WARN("cannot retrieve information about interface"
+ " \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * A nonzero value when interface is detected as NetVSC. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+hyperv_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[snprintf(NULL, 0, temp, iface->if_name) + 1];
Doing this snprintf is gross. Either use PATH_MAX or asprintf
Post by Adrien Mazarguil
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ snprintf(path, sizeof(path), temp, iface->if_name);
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve the last component of a path.
+ *
+ * This is a simplified basename() that does not modify its input buffer to
+ * handle trailing backslashes.
+ *
+ * Path to retrieve the last component from.
+ *
+ * Pointer to the last component.
+ */
+static const char *
+hyperv_basename(const char *path)
+{
+ const char *tmp = path;
+
+ while (*tmp)
+ if (*(tmp++) == '/')
Too may ()
Post by Adrien Mazarguil
+ path = tmp;
+ return path;
+}
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * Output data buffer.
+ * Output buffer size.
+ * Netdevice name.
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+hyperv_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size - 1)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
+
+/**
+ * Probe a network interface to associate with hyperv context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the hyperv context and communicates
+ * its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with hyperv_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Context to associate network interface with.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+hyperv_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct hyperv_ctx *ctx = va_arg(ap, struct hyperv_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DEBUG("NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (hyperv_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = hyperv_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ if (strcmp(hyperv_basename(buf), "pci"))
+ return 0;
+ ret = hyperv_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = hyperv_basename(buf);
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance if updated. */
+ if (!strcmp(addr, ctx->yield))
+ return 1;
+ DEBUG("associating PCI device \"%s\" with NetVSC interface \"%s\""
+ " (index %u)",
+ addr, ctx->if_name, ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ WARN("cannot associate PCI device name \"%s\" with interface"
+ " \"%s\": %s",
+ addr, ctx->if_name, rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by HYPERV_PROBE_MS as long
+ * as an hyperv context instance exists.
+ *
+ * Ignored.
+ */
+static void
+hyperv_alarm(void *arg)
+{
+ struct hyperv_ctx *ctx;
+ int ret;
+
+ (void)arg;
I assume you are trying to suppress unused warnings.
The DPDK method of doing this __rte_unused
Post by Adrien Mazarguil
+ LIST_FOREACH(ctx, &hyperv_ctx_list, entry) {
+ ret = hyperv_foreach_iface(hyperv_device_probe, ctx);
+ if (ret)
+ break;
+ }
+ if (!hyperv_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(HYPERV_PROBE_MS * 1000, hyperv_alarm, NULL);
+ if (ret < 0) {
+ ERROR("unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
+
+/**
+ * Probe a NetVSC interface to generate a hyperv context from.
+ *
+ * This function instantiates hyperv contexts either for all NetVSC devices
+ * found on the system or only a subset provided as device arguments.
+ *
+ * It is normally used with hyperv_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Name associated with current driver instance.
+ *
+ * Device arguments provided to current driver instance.
+ *
+ * Number of specific netdevices provided as device arguments.
+ *
+ * The number of specified netdevices matched by this function.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+hyperv_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct hyperv_ctx *ctx;
+ uint16_t port_id;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, HYPERV_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key, HYPERV_ARG_MAC)) {
+ struct ether_addr tmp;
+
+ if (ether_addr_from_str(&tmp, pair->value)) {
+ ERROR("invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (!is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &hyperv_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ WARN("interface \"%s\" (index %u) is already handled, skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!hyperv_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ WARN("interface \"%s\" (index %u) is not NetVSC, skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
+ ERROR("cannot allocate context for interface \"%s\": %s",
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = hyperv_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ ERROR("cannot allocate control pipe for interface \"%s\": %s",
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+ int fdf = fcntl(ctx->pipe[i], F_GETFD);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1 &&
+ fdf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFD,
+ i ? fdf | FD_CLOEXEC : fdf & ~FD_CLOEXEC) != -1)
+ continue;
+ ret = -errno;
+ ERROR("cannot toggle non-blocking or close-on-exec flags on"
+ " control file descriptor #%u (%d): %s",
+ i, ctx->pipe[i], rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name) - 1)
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname), "net_failsafe_%s",
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname) - 1)
+ ++i;
+ /*
+ * Note: bash replaces the default sh interpreter used by popen()
+ * because as seen with dash, POSIX-compliant shells do not
+ * necessarily support redirections with file descriptor numbers
+ * above 9.
+ */
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "exec(exec bash -c "
+ "'while read -r tmp <&%u 2> /dev/null;"
+ " do dev=$tmp; done;"
+ " echo $dev"
+ "'),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
Write real code. Shelling out to bash is messy, error prone and potential
security issue.
Post by Adrien Mazarguil
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs) - 1)
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ ERROR("generated virtual device name or argument list too long"
+ " for interface \"%s\"", ctx->if_name);
+ goto error;
+ }
+ /*
+ * Remove any competing rte_eth_dev entries sharing the same MAC
+ * address, fail-safe instances created by this PMD will handle them
+ * as sub-devices later.
+ */
+ RTE_ETH_FOREACH_DEV(port_id) {
+ struct rte_device *dev = rte_eth_devices[port_id].device;
+ struct rte_bus *bus = rte_bus_find_by_device(dev);
+ struct ether_addr tmp;
+
+ rte_eth_macaddr_get(port_id, &tmp);
+ if (!is_same_ether_addr(eth_addr, &tmp))
+ continue;
+ WARN("removing device \"%s\" with identical MAC address to"
+ " re-create it as a fail-safe sub-device",
+ dev->name);
+ if (!bus)
+ ret = -EINVAL;
+ else
+ ret = rte_eal_hotplug_remove(bus->name, dev->name);
+ if (ret < 0) {
+ ERROR("unable to remove device \"%s\": %s",
+ dev->name, rte_strerror(-ret));
+ goto error;
+ }
+ }
+ /* Request virtual device generation. */
+ DEBUG("generating virtual device \"%s\" with arguments \"%s\"",
+ ctx->devname, ctx->devargs);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&hyperv_ctx_list, ctx, entry);
+ ++hyperv_ctx_count;
+ DEBUG("added NetVSC interface \"%s\" to context list", ctx->if_name);
+ return 0;
+ if (ctx)
+ hyperv_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified device
+ * arguments and starts a periodic alarm callback to notify the resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* Virtual device context for PMD instance.
*
@@ -92,12 +706,38 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
hyperv_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;
DEBUG("invoked as \"%s\", using arguments \"%s\"", name, args);
if (!kvargs) {
ERROR("cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, HYPERV_ARG_IFACE) ||
+ !strcmp(pair->key, HYPERV_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(hyperv_alarm, NULL);
+ /* Gather interfaces. */
+ ret = hyperv_foreach_iface(hyperv_netvsc_probe, name, kvargs,
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ WARN("some of the specified parameters did not match valid"
+ " network interfaces");
+ ret = rte_eal_alarm_set(HYPERV_PROBE_MS * 1000, hyperv_alarm, NULL);
+ if (ret < 0) {
+ ERROR("unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
if (kvargs)
rte_kvargs_free(kvargs);
@@ -108,6 +748,9 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
/**
* Remove PMD instance.
*
+ * The alarm callback and underlying hyperv context instances are only
+ * destroyed after the last PMD instance is removed.
+ *
* Virtual device context for PMD instance.
*
@@ -118,7 +761,16 @@ static int
hyperv_vdev_remove(struct rte_vdev_device *dev)
{
(void)dev;
- --hyperv_ctx_inst;
+ if (--hyperv_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(hyperv_alarm, NULL);
+ while (!LIST_EMPTY(&hyperv_ctx_list)) {
+ struct hyperv_ctx *ctx = LIST_FIRST(&hyperv_ctx_list);
+
+ LIST_REMOVE(ctx, entry);
+ --hyperv_ctx_count;
+ hyperv_ctx_destroy(ctx);
+ }
return 0;
}
Adrien Mazarguil
2017-12-18 20:23:41 UTC
Permalink
Post by Stephen Hemminger
On Mon, 18 Dec 2017 17:46:23 +0100
Post by Adrien Mazarguil
/**
+ * Destroy a hyperv context instance.
+ *
+ * Context to destroy.
+ */
+static void
+hyperv_ctx_destroy(struct hyperv_ctx *ctx)
+{
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ /* Poisoning for debugging purposes. */
+ memset(ctx, 0x22, sizeof(*ctx));
Don't leave debug code in submitted drivers
Granted this line should be behind #ifdef RTE_LIBRTE_HYPERV_DEBUG.

Surely you don't mean *no* debugging code at all? This memset() allows an
application to crash early in case its control path parallelizes things it
shouldn't.
Post by Stephen Hemminger
Post by Adrien Mazarguil
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice found on
+ * the system.
+ *
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ *
+ * 0 when the entire list is traversed successfully, a negative error code
+ * traversal is aborted.
+ */
+static int
+hyperv_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ ERROR("cannot retrieve system network interfaces");
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ ERROR("cannot open socket: %s", rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ WARN("cannot retrieve information about interface"
+ " \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * A nonzero value when interface is detected as NetVSC. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+hyperv_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[snprintf(NULL, 0, temp, iface->if_name) + 1];
Doing this snprintf is gross. Either use PATH_MAX or asprintf
I don't think allocating more stack space than necessary or on the heap with
a possible allocation failure to deal with is any better, sorry.

Prove this snprintf() call can fail and you'll have a point.
Post by Stephen Hemminger
Post by Adrien Mazarguil
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ snprintf(path, sizeof(path), temp, iface->if_name);
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve the last component of a path.
+ *
+ * This is a simplified basename() that does not modify its input buffer to
+ * handle trailing backslashes.
+ *
+ * Path to retrieve the last component from.
+ *
+ * Pointer to the last component.
+ */
+static const char *
+hyperv_basename(const char *path)
+{
+ const char *tmp = path;
+
+ while (*tmp)
+ if (*(tmp++) == '/')
Too may ()
Will remove it, I'm considering using strrchr() in the caller and remove
this function entirely following Keith's comment.
Post by Stephen Hemminger
Post by Adrien Mazarguil
+ path = tmp;
+ return path;
+}
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * Output data buffer.
+ * Output buffer size.
+ * Netdevice name.
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+hyperv_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size - 1)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
+
+/**
+ * Probe a network interface to associate with hyperv context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the hyperv context and communicates
+ * its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with hyperv_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Context to associate network interface with.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+hyperv_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct hyperv_ctx *ctx = va_arg(ap, struct hyperv_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DEBUG("NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (hyperv_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = hyperv_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ if (strcmp(hyperv_basename(buf), "pci"))
+ return 0;
+ ret = hyperv_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = hyperv_basename(buf);
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance if updated. */
+ if (!strcmp(addr, ctx->yield))
+ return 1;
+ DEBUG("associating PCI device \"%s\" with NetVSC interface \"%s\""
+ " (index %u)",
+ addr, ctx->if_name, ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ WARN("cannot associate PCI device name \"%s\" with interface"
+ " \"%s\": %s",
+ addr, ctx->if_name, rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by HYPERV_PROBE_MS as long
+ * as an hyperv context instance exists.
+ *
+ * Ignored.
+ */
+static void
+hyperv_alarm(void *arg)
+{
+ struct hyperv_ctx *ctx;
+ int ret;
+
+ (void)arg;
I assume you are trying to suppress unused warnings.
The DPDK method of doing this __rte_unused
This syntax is the standard method for suppressing such warnings,
__rte_unused relies on a GNU syntax extension for that, and I usually tend
to favor standard forms when they exist.

Given DPDK coding rules don't say anything about this, I don't mind to
update it if you really insist.
Post by Stephen Hemminger
Post by Adrien Mazarguil
+ LIST_FOREACH(ctx, &hyperv_ctx_list, entry) {
+ ret = hyperv_foreach_iface(hyperv_device_probe, ctx);
+ if (ret)
+ break;
+ }
+ if (!hyperv_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(HYPERV_PROBE_MS * 1000, hyperv_alarm, NULL);
+ if (ret < 0) {
+ ERROR("unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
+
+/**
+ * Probe a NetVSC interface to generate a hyperv context from.
+ *
+ * This function instantiates hyperv contexts either for all NetVSC devices
+ * found on the system or only a subset provided as device arguments.
+ *
+ * It is normally used with hyperv_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Name associated with current driver instance.
+ *
+ * Device arguments provided to current driver instance.
+ *
+ * Number of specific netdevices provided as device arguments.
+ *
+ * The number of specified netdevices matched by this function.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+hyperv_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct hyperv_ctx *ctx;
+ uint16_t port_id;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, HYPERV_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key, HYPERV_ARG_MAC)) {
+ struct ether_addr tmp;
+
+ if (ether_addr_from_str(&tmp, pair->value)) {
+ ERROR("invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (!is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &hyperv_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ WARN("interface \"%s\" (index %u) is already handled, skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!hyperv_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ WARN("interface \"%s\" (index %u) is not NetVSC, skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
+ ERROR("cannot allocate context for interface \"%s\": %s",
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = hyperv_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ ERROR("cannot allocate control pipe for interface \"%s\": %s",
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+ int fdf = fcntl(ctx->pipe[i], F_GETFD);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1 &&
+ fdf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFD,
+ i ? fdf | FD_CLOEXEC : fdf & ~FD_CLOEXEC) != -1)
+ continue;
+ ret = -errno;
+ ERROR("cannot toggle non-blocking or close-on-exec flags on"
+ " control file descriptor #%u (%d): %s",
+ i, ctx->pipe[i], rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name) - 1)
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname), "net_failsafe_%s",
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname) - 1)
+ ++i;
+ /*
+ * Note: bash replaces the default sh interpreter used by popen()
+ * because as seen with dash, POSIX-compliant shells do not
+ * necessarily support redirections with file descriptor numbers
+ * above 9.
+ */
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "exec(exec bash -c "
+ "'while read -r tmp <&%u 2> /dev/null;"
+ " do dev=$tmp; done;"
+ " echo $dev"
+ "'),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
Write real code. Shelling out to bash is messy, error prone and potential
security issue.
Right, this code brings the basic idea. I forgot to mention it in the cover
letter, I plan a subsequent commit in fail-safe PMD to add file descriptors
as a possible control means in addition to its exec() parameter.
Post by Stephen Hemminger
Post by Adrien Mazarguil
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs) - 1)
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ ERROR("generated virtual device name or argument list too long"
+ " for interface \"%s\"", ctx->if_name);
+ goto error;
+ }
+ /*
+ * Remove any competing rte_eth_dev entries sharing the same MAC
+ * address, fail-safe instances created by this PMD will handle them
+ * as sub-devices later.
+ */
+ RTE_ETH_FOREACH_DEV(port_id) {
+ struct rte_device *dev = rte_eth_devices[port_id].device;
+ struct rte_bus *bus = rte_bus_find_by_device(dev);
+ struct ether_addr tmp;
+
+ rte_eth_macaddr_get(port_id, &tmp);
+ if (!is_same_ether_addr(eth_addr, &tmp))
+ continue;
+ WARN("removing device \"%s\" with identical MAC address to"
+ " re-create it as a fail-safe sub-device",
+ dev->name);
+ if (!bus)
+ ret = -EINVAL;
+ else
+ ret = rte_eal_hotplug_remove(bus->name, dev->name);
+ if (ret < 0) {
+ ERROR("unable to remove device \"%s\": %s",
+ dev->name, rte_strerror(-ret));
+ goto error;
+ }
+ }
+ /* Request virtual device generation. */
+ DEBUG("generating virtual device \"%s\" with arguments \"%s\"",
+ ctx->devname, ctx->devargs);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&hyperv_ctx_list, ctx, entry);
+ ++hyperv_ctx_count;
+ DEBUG("added NetVSC interface \"%s\" to context list", ctx->if_name);
+ return 0;
+ if (ctx)
+ hyperv_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified device
+ * arguments and starts a periodic alarm callback to notify the resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* Virtual device context for PMD instance.
*
@@ -92,12 +706,38 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
hyperv_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;
DEBUG("invoked as \"%s\", using arguments \"%s\"", name, args);
if (!kvargs) {
ERROR("cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, HYPERV_ARG_IFACE) ||
+ !strcmp(pair->key, HYPERV_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(hyperv_alarm, NULL);
+ /* Gather interfaces. */
+ ret = hyperv_foreach_iface(hyperv_netvsc_probe, name, kvargs,
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ WARN("some of the specified parameters did not match valid"
+ " network interfaces");
+ ret = rte_eal_alarm_set(HYPERV_PROBE_MS * 1000, hyperv_alarm, NULL);
+ if (ret < 0) {
+ ERROR("unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
if (kvargs)
rte_kvargs_free(kvargs);
@@ -108,6 +748,9 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
/**
* Remove PMD instance.
*
+ * The alarm callback and underlying hyperv context instances are only
+ * destroyed after the last PMD instance is removed.
+ *
* Virtual device context for PMD instance.
*
@@ -118,7 +761,16 @@ static int
hyperv_vdev_remove(struct rte_vdev_device *dev)
{
(void)dev;
- --hyperv_ctx_inst;
+ if (--hyperv_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(hyperv_alarm, NULL);
+ while (!LIST_EMPTY(&hyperv_ctx_list)) {
+ struct hyperv_ctx *ctx = LIST_FIRST(&hyperv_ctx_list);
+
+ LIST_REMOVE(ctx, entry);
+ --hyperv_ctx_count;
+ hyperv_ctx_destroy(ctx);
+ }
return 0;
}
In any case, thanks for the quick review!
--
Adrien Mazarguil
6WIND
Bruce Richardson
2017-12-19 09:53:27 UTC
Permalink
Post by Adrien Mazarguil
Post by Stephen Hemminger
On Mon, 18 Dec 2017 17:46:23 +0100
<snip>
Post by Adrien Mazarguil
Post by Stephen Hemminger
Post by Adrien Mazarguil
+static int
+hyperv_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[snprintf(NULL, 0, temp, iface->if_name) + 1];
Doing this snprintf is gross. Either use PATH_MAX or asprintf
I don't think allocating more stack space than necessary or on the heap with
a possible allocation failure to deal with is any better, sorry.
Prove this snprintf() call can fail and you'll have a point.
While I get your point, I'd tend to go with Stephen's view on this that
it's looking a bit "gross". What's the problem with allocating a bit
more stack space for it?

/Bruce
Adrien Mazarguil
2017-12-19 10:15:38 UTC
Permalink
Post by Adrien Mazarguil
Post by Adrien Mazarguil
Post by Stephen Hemminger
On Mon, 18 Dec 2017 17:46:23 +0100
<snip>
Post by Adrien Mazarguil
Post by Stephen Hemminger
Post by Adrien Mazarguil
+static int
+hyperv_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[snprintf(NULL, 0, temp, iface->if_name) + 1];
Doing this snprintf is gross. Either use PATH_MAX or asprintf
I don't think allocating more stack space than necessary or on the heap with
a possible allocation failure to deal with is any better, sorry.
Prove this snprintf() call can fail and you'll have a point.
While I get your point, I'd tend to go with Stephen's view on this that
it's looking a bit "gross". What's the problem with allocating a bit
more stack space for it?
Well, apart from making a stand, none really. Too "unusual" perhaps, but I
don't think "gross" is a valid argument to reject a perfectly valid piece of
code that doesn't rely on obscure knowledge nor weird side effects.

I'll update this in v2 to make it look more acceptable in any case.
--
Adrien Mazarguil
6WIND
Stephen Hemminger
2017-12-19 15:31:30 UTC
Permalink
On Tue, 19 Dec 2017 11:15:38 +0100
Post by Adrien Mazarguil
Post by Adrien Mazarguil
Post by Adrien Mazarguil
Post by Stephen Hemminger
On Mon, 18 Dec 2017 17:46:23 +0100
<snip>
Post by Adrien Mazarguil
Post by Stephen Hemminger
Post by Adrien Mazarguil
+static int
+hyperv_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[snprintf(NULL, 0, temp, iface->if_name) + 1];
Doing this snprintf is gross. Either use PATH_MAX or asprintf
I don't think allocating more stack space than necessary or on the heap with
a possible allocation failure to deal with is any better, sorry.
Prove this snprintf() call can fail and you'll have a point.
While I get your point, I'd tend to go with Stephen's view on this that
it's looking a bit "gross". What's the problem with allocating a bit
more stack space for it?
Well, apart from making a stand, none really. Too "unusual" perhaps, but I
don't think "gross" is a valid argument to reject a perfectly valid piece of
code that doesn't rely on obscure knowledge nor weird side effects.
I'll update this in v2 to make it look more acceptable in any case.
In this particular case, you can easily show that the maximum length of
the string would be less than the format plus maximum length of interface
name.

Why not:
char path[sizeof(temp) + IFNAMSIZ];
which keeps the flexibility but also can be evaluated at compile time.


Upleveling. You need to understand that open source software is a collabrative
effort. And like doing improvisational theatre, the best answer to any
feedback is yes unless there is a technical reason otherwise.
Stephen Hemminger
2017-12-18 23:59:46 UTC
Permalink
On Mon, 18 Dec 2017 17:46:23 +0100
Post by Adrien Mazarguil
+static int
+ether_addr_from_str(struct ether_addr *eth_addr, const char *str)
+{
+ static const uint8_t conv[0x100] = {
+ ['0'] = 0x80, ['1'] = 0x81, ['2'] = 0x82, ['3'] = 0x83,
+ ['4'] = 0x84, ['5'] = 0x85, ['6'] = 0x86, ['7'] = 0x87,
+ ['8'] = 0x88, ['9'] = 0x89, ['a'] = 0x8a, ['b'] = 0x8b,
+ ['c'] = 0x8c, ['d'] = 0x8d, ['e'] = 0x8e, ['f'] = 0x8f,
+ ['A'] = 0x8a, ['B'] = 0x8b, ['C'] = 0x8c, ['D'] = 0x8d,
+ ['E'] = 0x8e, ['F'] = 0x8f, [':'] = 0x40, ['-'] = 0x40,
+ ['\0'] = 0x60,
+ };
+ uint64_t addr = 0;
+ uint64_t buf = 0;
+ unsigned int i = 0;
+ unsigned int n = 0;
+ uint8_t tmp;
+
+ do {
+ tmp = conv[(int)*(str++)];
Cast to int will cause out of bounds reference on non-ascii strings.
The parser will get confused by:
001:aa:bb:cc:dd:ee:ff or invalid strings.

Why not use sscanf which would be safer in this case.


/**
* Parse 48bits Ethernet address in pattern xx:xx:xx:xx:xx:xx.
*
* @param eth_addr
* A pointer to a ether_addr structure.
* @param str
* A pointer to string contains the formatted MAC address.
* @return
* 0 if the address is valid
* -EINVAL if address is not formatted properly
*/
static inline int
ether_parse_addr(struct ether_addr *eth_addr, const char *str)
{
int n;

n = sscanf(str,
"%hhx:%hhx:%hhx:%hhx:%hhx:%hhx",
&eth_addr->addr_bytes[0],
&eth_addr->addr_bytes[1],
&eth_addr->addr_bytes[2],
&eth_addr->addr_bytes[3],
&eth_addr->addr_bytes[4],
&eth_addr->addr_bytes[5]);
return (n == ETHER_ADDR_LEN) ? 0 : -EINVAL;
}
Post by Adrien Mazarguil
+ if (!tmp)
+ return -EINVAL;
+ if (tmp & 0x40) {
+ i += (i & 1) + (!i << 1);
+ addr = (addr << (i << 2)) | buf;
+ n += i;
+ buf = 0;
+ i = 0;
+ } else {
+ buf = (buf << 4) | (tmp & 0xf);
+ ++i;
+ }
+ } while (!(tmp & 0x20));
+ if (n > 12)
+ return -EINVAL;
+ i = RTE_DIM(eth_addr->addr_bytes);
+ while (i) {
+ eth_addr->addr_bytes[--i] = addr & 0xff;
+ addr >>= 8;
+ }
+ return 0;
+}
+
Adrien Mazarguil
2017-12-19 10:01:55 UTC
Permalink
Post by Stephen Hemminger
On Mon, 18 Dec 2017 17:46:23 +0100
Post by Adrien Mazarguil
+static int
+ether_addr_from_str(struct ether_addr *eth_addr, const char *str)
+{
+ static const uint8_t conv[0x100] = {
+ ['0'] = 0x80, ['1'] = 0x81, ['2'] = 0x82, ['3'] = 0x83,
+ ['4'] = 0x84, ['5'] = 0x85, ['6'] = 0x86, ['7'] = 0x87,
+ ['8'] = 0x88, ['9'] = 0x89, ['a'] = 0x8a, ['b'] = 0x8b,
+ ['c'] = 0x8c, ['d'] = 0x8d, ['e'] = 0x8e, ['f'] = 0x8f,
+ ['A'] = 0x8a, ['B'] = 0x8b, ['C'] = 0x8c, ['D'] = 0x8d,
+ ['E'] = 0x8e, ['F'] = 0x8f, [':'] = 0x40, ['-'] = 0x40,
+ ['\0'] = 0x60,
+ };
+ uint64_t addr = 0;
+ uint64_t buf = 0;
+ unsigned int i = 0;
+ unsigned int n = 0;
+ uint8_t tmp;
+
+ do {
+ tmp = conv[(int)*(str++)];
Cast to int will cause out of bounds reference on non-ascii strings.
001:aa:bb:cc:dd:ee:ff or invalid strings.
Nice catch! I added the (int) cast to shut up a GCC complaint about using
char as index type. The proper fix taking care of integer conversion and
array bounds safety check should read:

tmp = conv[*str++ & 0xffu];
Post by Stephen Hemminger
Why not use sscanf which would be safer in this case.
Right, this is indeed the obvious implementation, however not only the fixed
MAC-48 format is not the most convenient to use for user input (somewhat
like forcing them to enter fully expanded IPv6 addresses every time),
sscanf() also ignores leading white spaces and successfully parses weird
expressions like " -42: 0x66: 0af: 0: 44:-6", which I think is a
problem.
Post by Stephen Hemminger
/**
* Parse 48bits Ethernet address in pattern xx:xx:xx:xx:xx:xx.
*
* A pointer to a ether_addr structure.
* A pointer to string contains the formatted MAC address.
* 0 if the address is valid
* -EINVAL if address is not formatted properly
*/
static inline int
ether_parse_addr(struct ether_addr *eth_addr, const char *str)
{
int n;
n = sscanf(str,
"%hhx:%hhx:%hhx:%hhx:%hhx:%hhx",
&eth_addr->addr_bytes[0],
&eth_addr->addr_bytes[1],
&eth_addr->addr_bytes[2],
&eth_addr->addr_bytes[3],
&eth_addr->addr_bytes[4],
&eth_addr->addr_bytes[5]);
return (n == ETHER_ADDR_LEN) ? 0 : -EINVAL;
}
Post by Adrien Mazarguil
+ if (!tmp)
+ return -EINVAL;
+ if (tmp & 0x40) {
+ i += (i & 1) + (!i << 1);
+ addr = (addr << (i << 2)) | buf;
+ n += i;
+ buf = 0;
+ i = 0;
+ } else {
+ buf = (buf << 4) | (tmp & 0xf);
+ ++i;
+ }
+ } while (!(tmp & 0x20));
+ if (n > 12)
+ return -EINVAL;
+ i = RTE_DIM(eth_addr->addr_bytes);
+ while (i) {
+ eth_addr->addr_bytes[--i] = addr & 0xff;
+ addr >>= 8;
+ }
+ return 0;
+}
+
--
Adrien Mazarguil
6WIND
Stephen Hemminger
2017-12-19 15:37:07 UTC
Permalink
On Tue, 19 Dec 2017 11:01:55 +0100
Post by Adrien Mazarguil
Post by Stephen Hemminger
Why not use sscanf which would be safer in this case.
Right, this is indeed the obvious implementation, however not only the fixed
MAC-48 format is not the most convenient to use for user input (somewhat
like forcing them to enter fully expanded IPv6 addresses every time),
sscanf() also ignores leading white spaces and successfully parses weird
expressions like " -42: 0x66: 0af: 0: 44:-6", which I think is a
problem.
There is a standard for ethernet representation, that is all you need to
accept. The only simplifications are optional leading zeros 02 vs 2
and upper and lower case a-f.

Don't overthink this. The FreeBSD version of ether_aton_r is:

struct ether_addr *
ether_aton_r(const char *a, struct ether_addr *e)
{
int i;
unsigned int o0, o1, o2, o3, o4, o5;

i = sscanf(a, "%x:%x:%x:%x:%x:%x", &o0, &o1, &o2, &o3, &o4, &o5);
if (i != 6)
return (NULL);
e->octet[0]=o0;
e->octet[1]=o1;
e->octet[2]=o2;
e->octet[3]=o3;
e->octet[4]=o4;
e->octet[5]=o5;
return (e);
}
Ferruh Yigit
2017-12-19 01:54:45 UTC
Permalink
Post by Adrien Mazarguil
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.
This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.
PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.
Once initialized, the sole job of the hyperv driver is to regularly scan
for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.
<...>
Post by Adrien Mazarguil
+ RTE_ETH_FOREACH_DEV(port_id) {
<..>
Post by Adrien Mazarguil
+ ret = rte_eal_hotplug_remove(bus->name, dev->name);
<..>
Post by Adrien Mazarguil
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
Overall why this logic implemented as network PMD?
Yes technically you can implement *anything* as PMD :), but should we?

This code does eal level work (scans bus, add/remove devices), and for control
path, and not a generic solution either (specific to netvsc and failsafe).

Only device argument part of a PMD seems used, rest is unrelated to being a PMD.
Scans netvsc changes in background and reflects them into failsafe PMD...

Why this is implemented as PMD, not another entity, like bus driver perhaps?
Or indeed why this in DPDK instead of being in application?

<...>
Adrien Mazarguil
2017-12-19 15:06:05 UTC
Permalink
Post by Ferruh Yigit
Post by Adrien Mazarguil
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.
This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.
PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.
Once initialized, the sole job of the hyperv driver is to regularly scan
for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.
<...>
Post by Adrien Mazarguil
+ RTE_ETH_FOREACH_DEV(port_id) {
<..>
Post by Adrien Mazarguil
+ ret = rte_eal_hotplug_remove(bus->name, dev->name);
<..>
Post by Adrien Mazarguil
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
Overall why this logic implemented as network PMD?
Yes technically you can implement *anything* as PMD :), but should we?
This code does eal level work (scans bus, add/remove devices), and for control
path, and not a generic solution either (specific to netvsc and failsafe).
Only device argument part of a PMD seems used, rest is unrelated to being a PMD.
Scans netvsc changes in background and reflects them into failsafe PMD...
Why this is implemented as PMD, not another entity, like bus driver perhaps?
Or indeed why this in DPDK instead of being in application?
I'll address that last question first: the point of this driver is enabling
existing applications to run within a Hyper-V environment unmodified,
because they'd otherwise need to manage two driver instances correctly on
their own in addition to hot-plug events during VM migration.

Some kind of driver generating a front end to what otherwise appears as two
distinct ethdev to applications is therefore necessary.

Currently without it, users have to manually configure failsafe properly for
each NetVSC interface on their system. Besides the inconvenience, it's not
even a possibility with DPDK applications that don't rely on EAL
command-line arguments.

As such it's more correctly defined as a "platform" driver rather than a
true PMD. It leaves VF device handling to their respective PMDs while
automatically managing the platform-specific part itself. There's no simpler
alternative when running in blacklist mode (i.e. not specifying any device
parameters on the command line).

Regarding its presence in drivers/net rather than drivers/bus, the end
result from an application standpoint is that each instance exposes a single
ethdev, even if not its own (failsafe's). Busses don't do that. It also
allows passing arguments to individual devices through --vdev if needed.

You're right about putting device detection at the bus level though, and I
think there's work in progress to do just that, this driver will be updated
to benefit from it once applied. In the meantime, the code as submitted
works fine with the current DPDK code base and addresses an existing use
case for which there is no solution at this point.
--
Adrien Mazarguil
6WIND
Ferruh Yigit
2017-12-19 20:44:35 UTC
Permalink
Post by Adrien Mazarguil
Post by Ferruh Yigit
Post by Adrien Mazarguil
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.
This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.
PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.
Once initialized, the sole job of the hyperv driver is to regularly scan
for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.
<...>
Post by Adrien Mazarguil
+ RTE_ETH_FOREACH_DEV(port_id) {
<..>
Post by Adrien Mazarguil
+ ret = rte_eal_hotplug_remove(bus->name, dev->name);
<..>
Post by Adrien Mazarguil
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
Overall why this logic implemented as network PMD?
Yes technically you can implement *anything* as PMD :), but should we?
This code does eal level work (scans bus, add/remove devices), and for control
path, and not a generic solution either (specific to netvsc and failsafe).
Only device argument part of a PMD seems used, rest is unrelated to being a PMD.
Scans netvsc changes in background and reflects them into failsafe PMD...
Why this is implemented as PMD, not another entity, like bus driver perhaps?
Or indeed why this in DPDK instead of being in application?
I'll address that last question first: the point of this driver is enabling
existing applications to run within a Hyper-V environment unmodified,
because they'd otherwise need to manage two driver instances correctly on
their own in addition to hot-plug events during VM migration.
Some kind of driver generating a front end to what otherwise appears as two
distinct ethdev to applications is therefore necessary.
Currently without it, users have to manually configure failsafe properly for
each NetVSC interface on their system. Besides the inconvenience, it's not
even a possibility with DPDK applications that don't rely on EAL
command-line arguments.
As such it's more correctly defined as a "platform" driver rather than a
true PMD. It leaves VF device handling to their respective PMDs while
automatically managing the platform-specific part itself. There's no simpler
alternative when running in blacklist mode (i.e. not specifying any device
parameters on the command line).
Regarding its presence in drivers/net rather than drivers/bus, the end
result from an application standpoint is that each instance exposes a single
ethdev, even if not its own (failsafe's). Busses don't do that. It also
allows passing arguments to individual devices through --vdev if needed.
You're right about putting device detection at the bus level though, and I
think there's work in progress to do just that, this driver will be updated
to benefit from it once applied. In the meantime, the code as submitted
works fine with the current DPDK code base and addresses an existing use
case for which there is no solution at this point.
This may be working but this looks like a hack to me.

If we need a platform driver why not properly work on it. If we need to improve
eal hotplug, this is a good motivation to improve it.

And if this logic needs to be in application let it be, your argument is to not
change the existing application but this logic may lead implementing many
unrelated things as PMD to not change application, what is the line here.

What is the work in progress, exact list, that will replace this solution? If
this hackish solution will prevent that real work, I am against this solution.
Is there a way to ensure this will be a temporary solution and that real work
will happen?
Thomas Monjalon
2017-12-20 14:13:28 UTC
Permalink
Post by Ferruh Yigit
Post by Adrien Mazarguil
Post by Ferruh Yigit
Post by Adrien Mazarguil
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.
This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.
PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.
Once initialized, the sole job of the hyperv driver is to regularly scan
for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.
<...>
Post by Adrien Mazarguil
+ RTE_ETH_FOREACH_DEV(port_id) {
<..>
Post by Adrien Mazarguil
+ ret = rte_eal_hotplug_remove(bus->name, dev->name);
<..>
Post by Adrien Mazarguil
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
Overall why this logic implemented as network PMD?
Yes technically you can implement *anything* as PMD :), but should we?
This code does eal level work (scans bus, add/remove devices), and for control
path, and not a generic solution either (specific to netvsc and failsafe).
Only device argument part of a PMD seems used, rest is unrelated to being a PMD.
Scans netvsc changes in background and reflects them into failsafe PMD...
Why this is implemented as PMD, not another entity, like bus driver perhaps?
Or indeed why this in DPDK instead of being in application?
I'll address that last question first: the point of this driver is enabling
existing applications to run within a Hyper-V environment unmodified,
because they'd otherwise need to manage two driver instances correctly on
their own in addition to hot-plug events during VM migration.
Some kind of driver generating a front end to what otherwise appears as two
distinct ethdev to applications is therefore necessary.
Currently without it, users have to manually configure failsafe properly for
each NetVSC interface on their system. Besides the inconvenience, it's not
even a possibility with DPDK applications that don't rely on EAL
command-line arguments.
As such it's more correctly defined as a "platform" driver rather than a
true PMD. It leaves VF device handling to their respective PMDs while
automatically managing the platform-specific part itself. There's no simpler
alternative when running in blacklist mode (i.e. not specifying any device
parameters on the command line).
Regarding its presence in drivers/net rather than drivers/bus, the end
result from an application standpoint is that each instance exposes a single
ethdev, even if not its own (failsafe's). Busses don't do that. It also
allows passing arguments to individual devices through --vdev if needed.
You're right about putting device detection at the bus level though, and I
think there's work in progress to do just that, this driver will be updated
to benefit from it once applied. In the meantime, the code as submitted
works fine with the current DPDK code base and addresses an existing use
case for which there is no solution at this point.
This may be working but this looks like a hack to me.
If we need a platform driver why not properly work on it. If we need to improve
eal hotplug, this is a good motivation to improve it.
I agree this code looks to be a platform driver.
It is the first one of this kind.
Usually, things are managed either in a device driver, a bus driver,
or in EAL.
I also agree that hotplug should be managed in EAL and bus drivers.
Post by Ferruh Yigit
And if this logic needs to be in application let it be, your argument is to not
change the existing application but this logic may lead implementing many
unrelated things as PMD to not change application, what is the line here.
The line is hardware management.
The application should not have to implement device-specific or
platform-specific code.
The same application should be able to work on any platform.
Post by Ferruh Yigit
What is the work in progress, exact list, that will replace this solution? If
this hackish solution will prevent that real work, I am against this solution.
Is there a way to ensure this will be a temporary solution and that real work
will happen?
I think we should explicitly mark this code as temporary, or use the
EXPERIMENTAL tag. It should motivate us to implement what is needed
to completely remove this code later.

About the work in progress:
- When hotplug will be fully supported in EAL and bus drivers,
the scan part of this platform driver should be removed.
- When ethdev probe notifications will be integrated, it may
also clean a part of this code.
- We may also think how the future port ownership can improve
the behaviour of this driver.
- NetVSC is currently supported by the TAP PMD, but it may be
replaced by a new NetVSC PMD (VMBUS driver is already sent).
- We should also continue the work on the configuration file.
Such user configuration may help for platform behaviours.

As a conclusion, there are a lot of improvements in progress,
and I am really happy to see Hyper-V supported in DPDK.
I think this driver must be only a step towards a first class support,
like KVM/Qemu/vhost/virtio.
As there is no API implied here, I am OK to progress step by step.
Adrien Mazarguil
2017-12-21 16:19:15 UTC
Permalink
Disclaimer: I agree with Thomas's suggestions in his reply [1] to your
message, I'm replying below as well to provide more details of my own and
clarify the motivations behind this approach a bit more.
Post by Ferruh Yigit
Post by Adrien Mazarguil
Post by Ferruh Yigit
Post by Adrien Mazarguil
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.
This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.
PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.
Once initialized, the sole job of the hyperv driver is to regularly scan
for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.
<...>
Post by Adrien Mazarguil
+ RTE_ETH_FOREACH_DEV(port_id) {
<..>
Post by Adrien Mazarguil
+ ret = rte_eal_hotplug_remove(bus->name, dev->name);
<..>
Post by Adrien Mazarguil
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
Overall why this logic implemented as network PMD?
Yes technically you can implement *anything* as PMD :), but should we?
This code does eal level work (scans bus, add/remove devices), and for control
path, and not a generic solution either (specific to netvsc and failsafe).
Only device argument part of a PMD seems used, rest is unrelated to being a PMD.
Scans netvsc changes in background and reflects them into failsafe PMD...
Why this is implemented as PMD, not another entity, like bus driver perhaps?
Or indeed why this in DPDK instead of being in application?
I'll address that last question first: the point of this driver is enabling
existing applications to run within a Hyper-V environment unmodified,
because they'd otherwise need to manage two driver instances correctly on
their own in addition to hot-plug events during VM migration.
Some kind of driver generating a front end to what otherwise appears as two
distinct ethdev to applications is therefore necessary.
Currently without it, users have to manually configure failsafe properly for
each NetVSC interface on their system. Besides the inconvenience, it's not
even a possibility with DPDK applications that don't rely on EAL
command-line arguments.
As such it's more correctly defined as a "platform" driver rather than a
true PMD. It leaves VF device handling to their respective PMDs while
automatically managing the platform-specific part itself. There's no simpler
alternative when running in blacklist mode (i.e. not specifying any device
parameters on the command line).
Regarding its presence in drivers/net rather than drivers/bus, the end
result from an application standpoint is that each instance exposes a single
ethdev, even if not its own (failsafe's). Busses don't do that. It also
allows passing arguments to individual devices through --vdev if needed.
You're right about putting device detection at the bus level though, and I
think there's work in progress to do just that, this driver will be updated
to benefit from it once applied. In the meantime, the code as submitted
works fine with the current DPDK code base and addresses an existing use
case for which there is no solution at this point.
This may be working but this looks like a hack to me.
If we need a platform driver why not properly work on it. If we need to improve
eal hotplug, this is a good motivation to improve it.
Hotplug surely can be improved but I don't think that alone will be enough
for what this driver does. Here's how things are sequenced as currently
implemented:

1. DPDK application starts.

2. EAL scans for PCI devices, ethdev ports are created for relevant ones.

3. hyperv vdev scans the system for appropriate NetVSC netdevices,
instantiates failsafe PMD accordingly to create ethdev ports for each of
them.

At this stage, rte_eal_hotplug_remove() is also called on physical
devices found in 2. that will be given to failsafe (see 4.), since
they're not supposed to be seen or owned by the application (keep in mind
this happens on Hyper-V platforms only).

4. From this point on, application can use the remaining ports normally.

5. A PCI device gets plugged in, kernel recognizes it and creates a
netdevice for it.

6. hyperv's timer callback detects the new netdevice, if its properties
match NetVSC's then it proceeds to tell failsafe its location.

7. failsafe probes the given address on the appropriate bus to instantiate
another hidden ethdev out of it and primarily uses that device for TX
until it gets unplugged. Meanwhile, RX is still performed on both
underlying devices.

Let's now assume hot-plug is perfectly implemented in DPDK along with
Gaetan's netdevice bus [2] (or equivalent) with hotplug properties as well:

1. DPDK application starts.

2. EAL scans for PCI devices, ethdev ports are created for relevant ones.

3. EAL scans for net_bus devices, ethdev ports are created for relevant
ones.

4. The piece of code formerly known as the hyperv driver looks at detected
net_bus devices, finds relevant ones with NetVSC properties and promptly
kicks them out through rte_eal_hotplug_remove() (or equivalent) so that
the application doesn't get a chance to "see" them.

It then instantiates fail-safe PMD like before, with fail-safe
re-discovering devices as its own.

5. From this point on, application can use the remaining ports normally.

6. A PCI device gets plugged in, kernel recognizes it and creates a
netdevice for it.

7. EAL's net_bus hotplug handler kicks in, automatically creates a new
ethdev port out of it (note: device properties such as MAC addresses are
not known before the associated PMD is initialized and an ethdev
created).

8. The piece of code formerly known as the hyperv driver that happens to
also be listening for hotplug events sees that new ethdev port; if its
properties match NetVSC's then it proceeds to hide it before telling
failsafe its location.

9. failsafe probes the given address on the appropriate bus to instantiate
another hidden ethdev out of it and primarily uses that device for TX
until it gets unplugged. Meanwhile, RX is still performed on both
underlying devices.

Hotplug basically removes the timer callback and some of the probing code.
I agree it's perfectly fine to update this PMD once hotplug is implemented
that way. Now what about the rest?

Without a driver there's no way to orchestrate all the above. A separate
layer between applications and PMDs is necessary for that; the handover of
ethdev ports to failsafe is mandatory.
Post by Ferruh Yigit
And if this logic needs to be in application let it be, your argument is to not
change the existing application but this logic may lead implementing many
unrelated things as PMD to not change application, what is the line here.
Well, for this particular case I don't think many applications want to
retrieve multicast and some other traffic out of one ethdev and the rest
from another only when the latter is present. This complexity must be
handled by the framework, not by applications, which ideally are not
supposed to know much about the environment they're running in.

For this reason, even a specific API is out of the question.
Post by Ferruh Yigit
What is the work in progress, exact list, that will replace this solution? If
this hackish solution will prevent that real work, I am against this solution.
Is there a way to ensure this will be a temporary solution and that real work
will happen?
I think Thomas answers this question [1], I'll just add that the current
approach was developed and submitted in a way that doesn't have any impact
on public APIs precisely to avoid conflicts with other work on EAL in the
meantime.

If the hotplug subsystem evolves, this driver will catch up, particularly
since it's small and shouldn't be too complex to adapt. I volunteer for that
work once APIs are ready in any case; failing that, the experimental tag
(I'll add it for v2) means its pure and simple removal.

I'd like your opinion on the current approach to determine the next steps:

- Do you agree with the fact hotplug and platform-related functionality are
two separate problems, that the approach to implement the former doesn't
address the latter?

- About implementing the latter in DPDK as a kind of platform driver so that
applications don't need to be modified?

- If you had to choose between drivers/bus and drivers/net for it? (keep in
mind the ability to provide per-device options would be great)

[1] http://dpdk.org/ml/archives/dev/2017-December/084558.html
[2] http://dpdk.org/ml/archives/dev/2017-June/067546.html
--
Adrien Mazarguil
6WIND
Adrien Mazarguil
2017-12-18 16:46:25 UTC
Permalink
This parameter allows specifying any non-NetVSC interface to use with tap
sub-devices for development purposes.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
---
doc/guides/nics/hyperv.rst | 5 +++++
drivers/net/hyperv/hyperv.c | 26 +++++++++++++++++++-------
2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/doc/guides/nics/hyperv.rst b/doc/guides/nics/hyperv.rst
index 8f7a8b153..9b5220919 100644
--- a/doc/guides/nics/hyperv.rst
+++ b/doc/guides/nics/hyperv.rst
@@ -110,5 +110,10 @@ The following device parameters are supported:
Same as ``iface`` except a suitable NetVSC interface is located using its
MAC address.

+- ``force`` [int]
+
+ If nonzero, forces the use of specified interfaces even if not detected as
+ NetVSC.
+
Not specifying either ``iface`` or ``mac`` makes this PMD attach itself to
all NetVSC interfaces found on the system.
diff --git a/drivers/net/hyperv/hyperv.c b/drivers/net/hyperv/hyperv.c
index bad224be9..d9d9bbcd5 100644
--- a/drivers/net/hyperv/hyperv.c
+++ b/drivers/net/hyperv/hyperv.c
@@ -62,6 +62,7 @@
#define HYPERV_DRIVER net_hyperv
#define HYPERV_ARG_IFACE "iface"
#define HYPERV_ARG_MAC "mac"
+#define HYPERV_ARG_FORCE "force"
#define HYPERV_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
@@ -504,6 +505,9 @@ hyperv_alarm(void *arg)
* - struct rte_kvargs *kvargs:
* Device arguments provided to current driver instance.
*
+ * - int force:
+ * Accept specified interface even if not detected as NetVSC.
+ *
* - unsigned int specified:
* Number of specific netdevices provided as device arguments.
*
@@ -521,6 +525,7 @@ hyperv_netvsc_probe(const struct if_nameindex *iface,
{
const char *name = va_arg(ap, const char *);
struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ int force = va_arg(ap, int);
unsigned int specified = va_arg(ap, unsigned int);
unsigned int *matched = va_arg(ap, unsigned int *);
unsigned int i;
@@ -567,9 +572,11 @@ hyperv_netvsc_probe(const struct if_nameindex *iface,
if (!hyperv_iface_is_netvsc(iface)) {
if (!specified)
return 0;
- WARN("interface \"%s\" (index %u) is not NetVSC, skipping",
- iface->if_name, iface->if_index);
- return 0;
+ WARN("interface \"%s\" (index %u) is not NetVSC, %s",
+ iface->if_name, iface->if_index,
+ force ? "using anyway (forced)" : "skipping");
+ if (!force)
+ return 0;
}
/* Create interface context. */
ctx = calloc(1, sizeof(*ctx));
@@ -700,6 +707,7 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
static const char *const hyperv_arg[] = {
HYPERV_ARG_IFACE,
HYPERV_ARG_MAC,
+ HYPERV_ARG_FORCE,
NULL,
};
const char *name = rte_vdev_device_name(dev);
@@ -708,6 +716,7 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
hyperv_arg);
unsigned int specified = 0;
unsigned int matched = 0;
+ int force = 0;
unsigned int i;
int ret;

@@ -719,13 +728,15 @@ hyperv_vdev_probe(struct rte_vdev_device *dev)
for (i = 0; i != kvargs->count; ++i) {
const struct rte_kvargs_pair *pair = &kvargs->pairs[i];

- if (!strcmp(pair->key, HYPERV_ARG_IFACE) ||
- !strcmp(pair->key, HYPERV_ARG_MAC))
+ if (!strcmp(pair->key, HYPERV_ARG_FORCE))
+ force = !!atoi(pair->value);
+ else if (!strcmp(pair->key, HYPERV_ARG_IFACE) ||
+ !strcmp(pair->key, HYPERV_ARG_MAC))
++specified;
}
rte_eal_alarm_cancel(hyperv_alarm, NULL);
/* Gather interfaces. */
- ret = hyperv_foreach_iface(hyperv_netvsc_probe, name, kvargs,
+ ret = hyperv_foreach_iface(hyperv_netvsc_probe, name, kvargs, force,
specified, &matched);
if (ret < 0)
goto error;
@@ -784,4 +795,5 @@ RTE_PMD_REGISTER_VDEV(HYPERV_DRIVER, hyperv_vdev);
RTE_PMD_REGISTER_ALIAS(HYPERV_DRIVER, eth_hyperv);
RTE_PMD_REGISTER_PARAM_STRING(net_hyperv,
HYPERV_ARG_IFACE "=<string> "
- HYPERV_ARG_MAC "=<string>");
+ HYPERV_ARG_MAC "=<string> "
+ HYPERV_ARG_FORCE "=<int>");
--
2.11.0
Stephen Hemminger
2017-12-18 18:23:04 UTC
Permalink
On Mon, 18 Dec 2017 17:46:19 +0100
Post by Adrien Mazarguil
Virtual machines hosted by Hyper-V/Azure platforms are fitted with
simplified virtual network devices named NetVSC that are used for fast
communication between VM to VM, VM to hypervisor, and the outside.
They appear as standard system netdevices to user-land applications, the
main difference being they are implemented on top of VMBUS [1] instead of
emulated PCI devices.
While this reads like a case for a standard DPDK PMD, there is more to it.
To accelerate outside communication, NetVSC devices as they appear in a VM
can be paired with physical SR-IOV virtual function (VF) devices owned by
that same VM [2]. Both netdevices share the same MAC address in that case.
When paired, egress and most of the ingress traffic flow through the VF
device, while part of it (e.g. multicasts, hypervisor control data) still
flows through NetVSC. Moreover VF devices are not retained and disappear
during VM migration; from a VM standpoint, they can be hot-plugged anytime
with NetVSC acting as a fallback.
Running DPDK applications in such a context involves driving VF devices
using their dedicated PMDs in a vendor-independent fashion (to benefit from
maximum performance without writing dedicated code) while simultaneously
listening to NetVSC and handling the related hot-plug events.
This new virtual PMD (referred to as "hyperv" from this point on)
automatically coordinates the Hyper-V/Azure-specific management part
described above by relying on vendor-specific, failsafe and tap PMDs to
expose a single consolidated Ethernet device usable directly by existing
applications.
.------------------.
| DPDK application |
`--------+---------'
|
.------+------.
| DPDK ethdev |
`------+------' Control
| |
.------------+------------. v .------------.
| failsafe PMD +---------+ hyperv PMD |
`--+-------------------+--' `------------'
| |
| .........|.........
`----+----' : `----+----' : <-- Hot-pluggable
Note this diagram differs from that of the original RFC [3], with hyperv no
longer acting as a data plane layer.
This initial version of the driver only works in whitelist mode. Users have
to provide the --vdev net_hyperv EAL option at least once to trigger it.
Subsequent work will add support for blacklist mode based on automatic
detection of the host environment.
[1] http://dpdk.org/ml/archives/dev/2017-January/054165.html
[2] https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
[3] http://dpdk.org/ml/archives/dev/2017-November/082339.html
net/hyperv: introduce MS Hyper-V platform driver
net/hyperv: implement core functionality
net/hyperv: add "force" parameter
MAINTAINERS | 6 +
config/common_base | 6 +
config/common_linuxapp | 1 +
doc/guides/nics/features/hyperv.ini | 12 +
doc/guides/nics/hyperv.rst | 119 +++
doc/guides/nics/index.rst | 1 +
drivers/net/Makefile | 1 +
drivers/net/hyperv/Makefile | 58 ++
drivers/net/hyperv/hyperv.c | 799 +++++++++++++++++++++
drivers/net/hyperv/rte_pmd_hyperv_version.map | 4 +
mk/rte.app.mk | 1 +
11 files changed, 1008 insertions(+)
create mode 100644 doc/guides/nics/features/hyperv.ini
create mode 100644 doc/guides/nics/hyperv.rst
create mode 100644 drivers/net/hyperv/Makefile
create mode 100644 drivers/net/hyperv/hyperv.c
create mode 100644 drivers/net/hyperv/rte_pmd_hyperv_version.map
Please don't call this drivers/net/hyperv/
that name conflicts with the real netvsc PMD that I am working on.

Maybe vdev-netvsc?
Thomas Monjalon
2017-12-18 20:13:12 UTC
Permalink
Post by Stephen Hemminger
Please don't call this drivers/net/hyperv/
that name conflicts with the real netvsc PMD that I am working on.
Maybe vdev-netvsc?
I expect your PMD to be in drivers/net/netvsc/
Why is it conflicting with drivers/net/hyperv/ ?
Stephen Hemminger
2017-12-19 00:40:01 UTC
Permalink
Post by Thomas Monjalon
Post by Stephen Hemminger
Please don't call this drivers/net/hyperv/
that name conflicts with the real netvsc PMD that I am working on.
Maybe vdev-netvsc?
I expect your PMD to be in drivers/net/netvsc/
Why is it conflicting with drivers/net/hyperv/ ?
The naming is a bit confusing, and I am willing to change it since not
upstream.
The code uses mostly BSD driver which doesn't call itself netvsc.
Instead the BSD driver uses hn_ as a prefix for most visible data and
functions.
Have been trying to name netvsc to avoid confusion with the kernel driver.

Like any name it is completely irrelevant to functionality.
Adrien Mazarguil
2017-12-18 20:21:23 UTC
Permalink
Post by Stephen Hemminger
On Mon, 18 Dec 2017 17:46:19 +0100
Post by Adrien Mazarguil
Virtual machines hosted by Hyper-V/Azure platforms are fitted with
simplified virtual network devices named NetVSC that are used for fast
communication between VM to VM, VM to hypervisor, and the outside.
They appear as standard system netdevices to user-land applications, the
main difference being they are implemented on top of VMBUS [1] instead of
emulated PCI devices.
While this reads like a case for a standard DPDK PMD, there is more to it.
To accelerate outside communication, NetVSC devices as they appear in a VM
can be paired with physical SR-IOV virtual function (VF) devices owned by
that same VM [2]. Both netdevices share the same MAC address in that case.
When paired, egress and most of the ingress traffic flow through the VF
device, while part of it (e.g. multicasts, hypervisor control data) still
flows through NetVSC. Moreover VF devices are not retained and disappear
during VM migration; from a VM standpoint, they can be hot-plugged anytime
with NetVSC acting as a fallback.
Running DPDK applications in such a context involves driving VF devices
using their dedicated PMDs in a vendor-independent fashion (to benefit from
maximum performance without writing dedicated code) while simultaneously
listening to NetVSC and handling the related hot-plug events.
This new virtual PMD (referred to as "hyperv" from this point on)
automatically coordinates the Hyper-V/Azure-specific management part
described above by relying on vendor-specific, failsafe and tap PMDs to
expose a single consolidated Ethernet device usable directly by existing
applications.
.------------------.
| DPDK application |
`--------+---------'
|
.------+------.
| DPDK ethdev |
`------+------' Control
| |
.------------+------------. v .------------.
| failsafe PMD +---------+ hyperv PMD |
`--+-------------------+--' `------------'
| |
| .........|.........
`----+----' : `----+----' : <-- Hot-pluggable
Note this diagram differs from that of the original RFC [3], with hyperv no
longer acting as a data plane layer.
This initial version of the driver only works in whitelist mode. Users have
to provide the --vdev net_hyperv EAL option at least once to trigger it.
Subsequent work will add support for blacklist mode based on automatic
detection of the host environment.
[1] http://dpdk.org/ml/archives/dev/2017-January/054165.html
[2] https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
[3] http://dpdk.org/ml/archives/dev/2017-November/082339.html
net/hyperv: introduce MS Hyper-V platform driver
net/hyperv: implement core functionality
net/hyperv: add "force" parameter
MAINTAINERS | 6 +
config/common_base | 6 +
config/common_linuxapp | 1 +
doc/guides/nics/features/hyperv.ini | 12 +
doc/guides/nics/hyperv.rst | 119 +++
doc/guides/nics/index.rst | 1 +
drivers/net/Makefile | 1 +
drivers/net/hyperv/Makefile | 58 ++
drivers/net/hyperv/hyperv.c | 799 +++++++++++++++++++++
drivers/net/hyperv/rte_pmd_hyperv_version.map | 4 +
mk/rte.app.mk | 1 +
11 files changed, 1008 insertions(+)
create mode 100644 doc/guides/nics/features/hyperv.ini
create mode 100644 doc/guides/nics/hyperv.rst
create mode 100644 drivers/net/hyperv/Makefile
create mode 100644 drivers/net/hyperv/hyperv.c
create mode 100644 drivers/net/hyperv/rte_pmd_hyperv_version.map
Please don't call this drivers/net/hyperv/
that name conflicts with the real netvsc PMD that I am working on.
Maybe vdev-netvsc?
No problem with that, if vdev-netvsc is good for you, I can update it in v2
if needed.

I'm just curious, I was under the impression both drivers would remain kind
of complementary pending various API updates, in which case wouldn't it make
sense to use "netvsc" as the better name for the NetVSC PMD? ("hyperv" being
more a use case than a true PMD)

Otherwise I also don't mind overwriting the current "hyperv" PMD code base
with yours as soon as it's ready, this will most likely make it redundant
anyway.
--
Adrien Mazarguil
6WIND
Adrien Mazarguil
2017-12-22 18:01:28 UTC
Permalink
Virtual machines hosted by Hyper-V/Azure platforms are fitted with
simplified virtual network devices named NetVSC that are used for fast
communication between VM to VM, VM to hypervisor, and the outside.

They appear as standard system netdevices to user-land applications, the
main difference being they are implemented on top of VMBUS [1] instead of
emulated PCI devices.

While this reads like a case for a standard DPDK PMD, there is more to it.

To accelerate outside communication, NetVSC devices as they appear in a VM
can be paired with physical SR-IOV virtual function (VF) devices owned by
that same VM [2]. Both netdevices share the same MAC address in that case.

When paired, egress and most of the ingress traffic flow through the VF
device, while part of it (e.g. multicasts, hypervisor control data) still
flows through NetVSC. Moreover VF devices are not retained and disappear
during VM migration; from a VM standpoint, they can be hot-plugged anytime
with NetVSC acting as a fallback.

Running DPDK applications in such a context involves driving VF devices
using their dedicated PMDs in a vendor-independent fashion (to benefit from
maximum performance without writing dedicated code) while simultaneously
listening to NetVSC and handling the related hot-plug events.

This new virtual PMD (referred to as "vdev_netvsc" from this point on)
automatically coordinates the Hyper-V/Azure-specific management part
described above by relying on vendor-specific, failsafe and tap PMDs to
expose a single consolidated Ethernet device usable directly by existing
applications.

.------------------.
| DPDK application |
`--------+---------'
|
.------+------.
| DPDK ethdev |
`------+------' Control
| |
.------------+------------. v .-----------------.
| failsafe PMD +---------+ vdev_netvsc PMD |
`--+-------------------+--' `-----------------'
| |
| .........|.........
| : | :
.----+----. : .----+----. :
| tap PMD | : | any PMD | :
`----+----' : `----+----' : <-- Hot-pluggable
| : | :
.------+-------. : .-----+-----. :
| NetVSC-based | : | SR-IOV VF | :
| netdevice | : | device | :
`--------------' : `-----------' :
:.................:

Note this diagram differs from that of the original RFC [3], with
vdev_netvsc no longer acting as a data plane layer.

This initial version of the driver only works in whitelist mode. Users have
to provide the --vdev net_vdev_netvsc EAL option at least once to trigger
it.

Subsequent work will add support for blacklist mode based on automatic
detection of the host environment.

[1] http://dpdk.org/ml/archives/dev/2017-January/054165.html
[2] https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
[3] http://dpdk.org/ml/archives/dev/2017-November/082339.html

v2 changes:

- Renamed driver from "hyperv" to "vdev_netvsc". This change covers
documentation and symbols prefix.
- Driver is now tagged EXPERIMENTAL.
- Replaced ether_addr_from_str() with a basic sscanf() call.
- Removed debugging code (memset() poisoning).
- Fixed hyperv_iface_is_netvsc()'s buffer allocation according to comments.
- Removed hyperv_basename().
- Discarded unused variables through __rte_unused.
- Added separate but necessary free() bugfix for failsafe PMD.
- Added file descriptor input support to failsafe PMD.
- Replaced temporary bash execution; failsafe now reads device definitions
directly through a pipe without an intermediate bash one-liner.
- Expanded DEBUG/INFO/WARN/ERROR() macros as PMD_DRV_LOG().
- Added dynamic log type (pmd.vdev_netvsc).
- Modified initialization code to probe devices immediately during startup.
- Fixed several snprintf() return value checks ("ret >= sizeof(foo)" is more
appropriate than "ret >= sizeof(foo) - 1").

Adrien Mazarguil (5):
net/failsafe: fix invalid free
net/failsafe: add "fd" parameter
net/vdev_netvsc: introduce Hyper-V platform driver
net/vdev_netvsc: implement core functionality
net/vdev_netvsc: add "force" parameter

MAINTAINERS | 6 +
config/common_base | 5 +
config/common_linuxapp | 1 +
doc/guides/nics/fail_safe.rst | 9 +
doc/guides/nics/features/vdev_netvsc.ini | 12 +
doc/guides/nics/index.rst | 1 +
doc/guides/nics/vdev_netvsc.rst | 116 +++
drivers/net/Makefile | 1 +
drivers/net/failsafe/failsafe_args.c | 88 ++-
drivers/net/failsafe/failsafe_private.h | 3 +
drivers/net/vdev_netvsc/Makefile | 58 ++
.../vdev_netvsc/rte_pmd_vdev_netvsc_version.map | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 722 +++++++++++++++++++
mk/rte.app.mk | 1 +
14 files changed, 1025 insertions(+), 2 deletions(-)
create mode 100644 doc/guides/nics/features/vdev_netvsc.ini
create mode 100644 doc/guides/nics/vdev_netvsc.rst
create mode 100644 drivers/net/vdev_netvsc/Makefile
create mode 100644 drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
create mode 100644 drivers/net/vdev_netvsc/vdev_netvsc.c
--
2.11.0
Adrien Mazarguil
2017-12-22 18:01:30 UTC
Permalink
rte_free() is not supposed to work with pointers returned by calloc().

Fixes: a0194d828100 ("net/failsafe: add flexible device definition")
Cc: ***@dpdk.org
Cc: Gaetan Rivet <***@6wind.com>

Signed-off-by: Adrien Mazarguil <***@6wind.com>
---
drivers/net/failsafe/failsafe_args.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index cfc83e365..ec63ac972 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -407,7 +407,7 @@ failsafe_args_free(struct rte_eth_dev *dev)
uint8_t i;

FOREACH_SUBDEV(sdev, i, dev) {
- rte_free(sdev->cmdline);
+ free(sdev->cmdline);
sdev->cmdline = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
--
2.11.0
Adrien Mazarguil
2017-12-22 18:01:33 UTC
Permalink
This parameter enables applications to provide device definitions through
an arbitrary file descriptor number.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Cc: Gaetan Rivet <***@6wind.com>
---
doc/guides/nics/fail_safe.rst | 9 +++
drivers/net/failsafe/failsafe_args.c | 86 +++++++++++++++++++++++++++-
drivers/net/failsafe/failsafe_private.h | 3 +
3 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index c4e3d2e8d..5b1b47e56 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -106,6 +106,15 @@ Fail-safe command line parameters
All commas within the ``shell command`` are replaced by spaces before
executing the command. This helps using scripts to specify devices.

+- **fd(<file descriptor number>)** parameter
+
+ This parameter reads a device definition from an arbitrary file descriptor
+ number in ``<iface>`` format as described above.
+
+ The file descriptor is read in non-blocking mode and is never closed in
+ order to take only the last line into account (unlike ``exec()``) at every
+ probe attempt.
+
- **mac** parameter [MAC address]

This parameter allows the user to set a default MAC address to the fail-safe
diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index ec63ac972..7a8605174 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -31,7 +31,11 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
#include <string.h>
+#include <unistd.h>
#include <errno.h>

#include <rte_debug.h>
@@ -161,6 +165,73 @@ fs_execute_cmd(struct sub_device *sdev, char *cmdline)
}

static int
+fs_read_fd(struct sub_device *sdev, char *fd_str)
+{
+ FILE *fp = NULL;
+ int fd = -1;
+ /* store possible newline as well */
+ char output[DEVARGS_MAXLEN + 1];
+ int err = -ENODEV;
+ int ret;
+
+ RTE_ASSERT(fd_str != NULL || sdev->fd_str != NULL);
+ if (sdev->fd_str == NULL) {
+ sdev->fd_str = strdup(fd_str);
+ if (sdev->fd_str == NULL) {
+ ERROR("Command line allocation failed");
+ return -ENOMEM;
+ }
+ }
+ errno = 0;
+ fd = strtol(fd_str, &fd_str, 0);
+ if (errno || *fd_str || fd < 0) {
+ ERROR("Parsing FD number failed");
+ goto error;
+ }
+ /* Fiddle with copy of file descriptor */
+ fd = dup(fd);
+ if (fd == -1)
+ goto error;
+ ret = fcntl(fd, F_GETFL);
+ if (ret == -1)
+ goto error;
+ ret = fcntl(fd, F_SETFL, fd | O_NONBLOCK);
+ if (ret == -1)
+ goto error;
+ fp = fdopen(fd, "r");
+ if (!fp)
+ goto error;
+ fd = -1;
+ /* Only take the last line into account */
+ ret = 0;
+ while (fgets(output, sizeof(output), fp))
+ ++ret;
+ if (feof(fp)) {
+ if (!ret)
+ goto error;
+ } else if (ferror(fp)) {
+ if (errno != EAGAIN || !ret)
+ goto error;
+ } else if (!ret) {
+ goto error;
+ }
+ /* Line must end with a newline character */
+ fs_sanitize_cmdline(output);
+ if (output[0] == '\0')
+ goto error;
+ ret = fs_parse_device(sdev, output);
+ if (ret)
+ ERROR("Parsing device '%s' failed", output);
+ err = ret;
+error:
+ if (fp)
+ fclose(fp);
+ if (fd != -1)
+ close(fd);
+ return err;
+}
+
+static int
fs_parse_device_param(struct rte_eth_dev *dev, const char *param,
uint8_t head)
{
@@ -202,6 +273,14 @@ fs_parse_device_param(struct rte_eth_dev *dev, const char *param,
}
if (ret)
goto free_args;
+ } else if (strncmp(param, "fd", 2) == 0) {
+ ret = fs_read_fd(sdev, args);
+ if (ret == -ENODEV) {
+ DEBUG("Reading device info from FD failed");
+ ret = 0;
+ }
+ if (ret)
+ goto free_args;
} else {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
@@ -409,6 +488,8 @@ failsafe_args_free(struct rte_eth_dev *dev)
FOREACH_SUBDEV(sdev, i, dev) {
free(sdev->cmdline);
sdev->cmdline = NULL;
+ free(sdev->fd_str);
+ sdev->fd_str = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
}
@@ -424,7 +505,8 @@ fs_count_device(struct rte_eth_dev *dev, const char *param,
param[b] != '\0')
b++;
if (strncmp(param, "dev", b) != 0 &&
- strncmp(param, "exec", b) != 0) {
+ strncmp(param, "exec", b) != 0 &&
+ strncmp(param, "fd", b) != 0) {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
}
@@ -463,6 +545,8 @@ failsafe_args_parse_subs(struct rte_eth_dev *dev)
continue;
if (sdev->cmdline)
ret = fs_execute_cmd(sdev, sdev->cmdline);
+ else if (sdev->fd_str)
+ ret = fs_read_fd(sdev, sdev->fd_str);
else
ret = fs_parse_sub_device(sdev);
if (ret == 0)
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index d81cc3ca6..a0d36751f 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -48,6 +48,7 @@
#define PMD_FAILSAFE_PARAM_STRING \
"dev(<ifc>)," \
"exec(<shell command>)," \
+ "fd(<fd number>)," \
"mac=mac_addr," \
"hotplug_poll=u64" \
""
@@ -111,6 +112,8 @@ struct sub_device {
struct fs_stats stats_snapshot;
/* Some device are defined as a command line */
char *cmdline;
+ /* Others are retrieved through a file descriptor */
+ char *fd_str;
/* fail-safe device backreference */
struct rte_eth_dev *fs_dev;
/* flag calling for recollection */
--
2.11.0
Adrien Mazarguil
2017-12-22 18:01:35 UTC
Permalink
This patch lays the groundwork for this driver (draft documentation,
copyright notices, code base skeleton and build system hooks). While it can
be successfully compiled and invoked, it's an empty shell at this stage.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
---
MAINTAINERS | 6 +
config/common_base | 5 +
config/common_linuxapp | 1 +
doc/guides/nics/features/vdev_netvsc.ini | 12 ++
doc/guides/nics/index.rst | 1 +
doc/guides/nics/vdev_netvsc.rst | 46 +++++++
drivers/net/Makefile | 1 +
drivers/net/vdev_netvsc/Makefile | 54 ++++++++
.../vdev_netvsc/rte_pmd_vdev_netvsc_version.map | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 132 +++++++++++++++++++
mk/rte.app.mk | 1 +
11 files changed, 263 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 5a63b40c2..2b61c93aa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -451,6 +451,12 @@ F: drivers/net/mrvl/
F: doc/guides/nics/mrvl.rst
F: doc/guides/nics/features/mrvl.ini

+Microsoft vdev-netvsc - EXPERIMENTAL
+M: Adrien Mazarguil <***@6wind.com>
+F: drivers/net/vdev-netvsc/
+F: doc/guides/nics/vdev-netvsc.rst
+F: doc/guides/nics/features/vdev-netvsc.ini
+
Netcope szedata2
M: Matej Vido <***@cesnet.cz>
F: drivers/net/szedata2/
diff --git a/config/common_base b/config/common_base
index b8ee8f91c..ef904dfd5 100644
--- a/config/common_base
+++ b/config/common_base
@@ -280,6 +280,11 @@ CONFIG_RTE_LIBRTE_NFP_DEBUG=n
CONFIG_RTE_LIBRTE_MRVL_PMD=n

#
+# Compile virtual device driver for NetVSC on Hyper-V/Azure
+#
+CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=n
+
+#
# Compile burst-oriented Broadcom BNXT PMD driver
#
CONFIG_RTE_LIBRTE_BNXT_PMD=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74c7d64ec..e04326224 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -47,6 +47,7 @@ CONFIG_RTE_LIBRTE_PMD_VHOST=y
CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
CONFIG_RTE_LIBRTE_PMD_TAP=y
CONFIG_RTE_LIBRTE_AVP_PMD=y
+CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=y
CONFIG_RTE_LIBRTE_NFP_PMD=y
CONFIG_RTE_LIBRTE_POWER=y
CONFIG_RTE_VIRTIO_USER=y
diff --git a/doc/guides/nics/features/vdev_netvsc.ini b/doc/guides/nics/features/vdev_netvsc.ini
new file mode 100644
index 000000000..cfc5cb93e
--- /dev/null
+++ b/doc/guides/nics/features/vdev_netvsc.ini
@@ -0,0 +1,12 @@
+;
+; Supported features of the 'vdev_netvsc' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+ARMv7 = Y
+ARMv8 = Y
+Power8 = Y
+x86-32 = Y
+x86-64 = Y
+Usage doc = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 23babe933..566604671 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -64,6 +64,7 @@ Network Interface Controller Drivers
szedata2
tap
thunderx
+ vdev_netvsc
virtio
vhost
vmxnet3
diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
new file mode 100644
index 000000000..be31b6597
--- /dev/null
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -0,0 +1,46 @@
+.. BSD LICENSE
+ Copyright 2017 6WIND S.A.
+ Copyright 2017 Mellanox
+
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+
+ * Redistributions of source code must retain the above copyright
+ notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions and the following disclaimer in
+ the documentation and/or other materials provided with the
+ distribution.
+ * Neither the name of 6WIND S.A. nor the names of its
+ contributors may be used to endorse or promote products derived
+ from this software without specific prior written permission.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+VDEV_NETVSC poll mode driver
+============================
+
+The VDEV_NETVSC PMD (librte_pmd_vdev_netvsc) provides support for NetVSC
+interfaces and associated SR-IOV virtual function (VF) devices found in
+Linux virtual machines running on Microsoft Hyper-V_ (including Azure)
+platforms.
+
+.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
+
+Build options
+-------------
+
+- ``CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD`` (default ``y``)
+
+ Toggle compilation of this driver.
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ef09b4e16..dc41ed11e 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -66,6 +66,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += sfc
DIRS-$(CONFIG_RTE_LIBRTE_PMD_SZEDATA2) += szedata2
DIRS-$(CONFIG_RTE_LIBRTE_PMD_TAP) += tap
DIRS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += thunderx
+DIRS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc
DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio
DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += vmxnet3

diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
new file mode 100644
index 000000000..e53050fe1
--- /dev/null
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -0,0 +1,54 @@
+# BSD LICENSE
+#
+# Copyright 2017 6WIND S.A.
+# Copyright 2017 Mellanox
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in
+# the documentation and/or other materials provided with the
+# distribution.
+# * Neither the name of 6WIND S.A. nor the names of its
+# contributors may be used to endorse or promote products derived
+# from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# Properties of the generated library.
+LIB = librte_pmd_vdev_netvsc.a
+LIBABIVER := 1
+EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
+
+# Additional compilation flags.
+CFLAGS += -O3
+CFLAGS += -g
+CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += $(WERROR_FLAGS)
+
+# Dependencies.
+LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_eal
+LDLIBS += -lrte_ethdev
+LDLIBS += -lrte_kvargs
+
+# Source files.
+SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map b/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
new file mode 100644
index 000000000..179140fb8
--- /dev/null
+++ b/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
@@ -0,0 +1,4 @@
+DPDK_18.02 {
+
+ local: *;
+};
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
new file mode 100644
index 000000000..3b73482da
--- /dev/null
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -0,0 +1,132 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright 2017 6WIND S.A.
+ * Copyright 2017 Mellanox
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of 6WIND S.A. nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stddef.h>
+
+#include <rte_bus_vdev.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#define VDEV_NETVSC_DRIVER net_vdev_netvsc
+#define VDEV_NETVSC_ARG_IFACE "iface"
+#define VDEV_NETVSC_ARG_MAC "mac"
+
+#define PMD_DRV_LOG(level, ...) \
+ rte_log(RTE_LOG_ ## level, \
+ vdev_netvsc_logtype, \
+ RTE_FMT(RTE_STR(VDEV_NETVSC_DRIVER) ": " \
+ RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+/** Driver-specific log messages type. */
+static int vdev_netvsc_logtype;
+
+/** Number of PMD instances relying on context list. */
+static unsigned int vdev_netvsc_ctx_inst;
+
+/**
+ * Probe NetVSC interfaces.
+ *
+ * @param dev
+ * Virtual device context for PMD instance.
+ *
+ * @return
+ * Always 0, even in case of errors.
+ */
+static int
+vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
+{
+ static const char *const vdev_netvsc_arg[] = {
+ VDEV_NETVSC_ARG_IFACE,
+ VDEV_NETVSC_ARG_MAC,
+ NULL,
+ };
+ const char *name = rte_vdev_device_name(dev);
+ const char *args = rte_vdev_device_args(dev);
+ struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
+ vdev_netvsc_arg);
+
+ PMD_DRV_LOG(DEBUG,
+ "invoked as \"%s\", using arguments \"%s\"",
+ name, args);
+ if (!kvargs) {
+ PMD_DRV_LOG(ERR, "cannot parse arguments list");
+ goto error;
+ }
+error:
+ if (kvargs)
+ rte_kvargs_free(kvargs);
+ ++vdev_netvsc_ctx_inst;
+ return 0;
+}
+
+/**
+ * Remove PMD instance.
+ *
+ * @param dev
+ * Virtual device context for PMD instance.
+ *
+ * @return
+ * Always 0.
+ */
+static int
+vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
+{
+ --vdev_netvsc_ctx_inst;
+ return 0;
+}
+
+/** Virtual device descriptor. */
+static struct rte_vdev_driver vdev_netvsc_vdev = {
+ .probe = vdev_netvsc_vdev_probe,
+ .remove = vdev_netvsc_vdev_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(VDEV_NETVSC_DRIVER, vdev_netvsc_vdev);
+RTE_PMD_REGISTER_ALIAS(VDEV_NETVSC_DRIVER, eth_vdev_netvsc);
+RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
+ VDEV_NETVSC_ARG_IFACE "=<string> "
+ VDEV_NETVSC_ARG_MAC "=<string>");
+
+/** Initialize driver log type. */
+static void
+vdev_netvsc_init_log(void)
+{
+ vdev_netvsc_logtype = rte_log_register("pmd.vdev_netvsc");
+ if (vdev_netvsc_logtype >= 0)
+ rte_log_set_level(vdev_netvsc_logtype, RTE_LOG_NOTICE);
+}
+
+RTE_INIT(vdev_netvsc_init_log);
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 6a6a7452e..3ae521228 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -156,6 +156,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += -lrte_pmd_sfc_efx
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_SZEDATA2) += -lrte_pmd_szedata2 -lsze2
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_TAP) += -lrte_pmd_tap
_LDLIBS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += -lrte_pmd_thunderx_nicvf
+_LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
_LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += -lrte_pmd_virtio
ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += -lrte_pmd_vhost
--
2.11.0
Adrien Mazarguil
2017-12-22 18:01:37 UTC
Permalink
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.

This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.

PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.

Once initialized, the sole job of the vdev_netvsc driver is to regularly
scan for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
---
doc/guides/nics/vdev_netvsc.rst | 65 ++++
drivers/net/vdev_netvsc/Makefile | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 581 ++++++++++++++++++++++++++++-
3 files changed, 649 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index be31b6597..73a63e552 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -38,9 +38,74 @@ platforms.

.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v

+Implementation details
+----------------------
+
+Each instance of this driver effectively needs to drive two devices: the
+NetVSC interface proper and its SR-IOV VF (referred to as "physical" from
+this point on) counterpart sharing the same MAC address.
+
+Physical devices are part of the host system and cannot be maintained during
+VM migration. From a VM standpoint they appear as hot-plug devices that come
+and go without prior notice.
+
+When the physical device is present, egress and most of the ingress traffic
+flows through it; only multicasts and other hypervisor control still flow
+through NetVSC. Otherwise, NetVSC acts as a fallback for all traffic.
+
+To avoid unnecessary code duplication and ensure maximum performance,
+handling of physical devices is left to their original PMDs; this virtual
+device driver (also known as *vdev*) manages other PMDs as summarized by the
+following block diagram::
+
+ .------------------.
+ | DPDK application |
+ `--------+---------'
+ |
+ .------+------.
+ | DPDK ethdev |
+ `------+------' Control
+ | |
+ .------------+------------. v .-----------------.
+ | failsafe PMD +---------+ vdev_netvsc PMD |
+ `--+-------------------+--' `-----------------'
+ | |
+ | .........|.........
+ | : | :
+ .----+----. : .----+----. :
+ | tap PMD | : | any PMD | :
+ `----+----' : `----+----' : <-- Hot-pluggable
+ | : | :
+ .------+-------. : .-----+-----. :
+ | NetVSC-based | : | SR-IOV VF | :
+ | netdevice | : | device | :
+ `--------------' : `-----------' :
+ :.................:
+
Build options
-------------

- ``CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD`` (default ``y``)

Toggle compilation of this driver.
+
+Run-time parameters
+-------------------
+
+To invoke this PMD, applications have to explicitly provide the
+``--vdev=net_vdev_netvsc`` EAL option.
+
+The following device parameters are supported:
+
+- ``iface`` [string]
+
+ Provide a specific NetVSC interface (netdevice) name to attach this PMD
+ to. Can be provided multiple times for additional instances.
+
+- ``mac`` [string]
+
+ Same as ``iface`` except a suitable NetVSC interface is located using its
+ MAC address.
+
+Not specifying either ``iface`` or ``mac`` makes this PMD attach itself to
+all NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
index e53050fe1..3b3fe1c56 100644
--- a/drivers/net/vdev_netvsc/Makefile
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -40,6 +40,9 @@ EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
CFLAGS += -O3
CFLAGS += -g
CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += -D_XOPEN_SOURCE=600
+CFLAGS += -D_BSD_SOURCE
+CFLAGS += -D_DEFAULT_SOURCE
CFLAGS += $(WERROR_FLAGS)

# Dependencies.
@@ -47,6 +50,7 @@ LDLIBS += -lrte_bus_vdev
LDLIBS += -lrte_eal
LDLIBS += -lrte_ethdev
LDLIBS += -lrte_kvargs
+LDLIBS += -lrte_net

# Source files.
SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 3b73482da..738196e75 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -31,17 +31,41 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <linux/sockios.h>
+#include <net/if.h>
+#include <netinet/ip.h>
+#include <stdarg.h>
#include <stddef.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/queue.h>
+#include <sys/socket.h>
+#include <unistd.h>

+#include <rte_alarm.h>
+#include <rte_bus.h>
#include <rte_bus_vdev.h>
#include <rte_common.h>
#include <rte_config.h>
+#include <rte_dev.h>
+#include <rte_errno.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
#include <rte_kvargs.h>
#include <rte_log.h>

#define VDEV_NETVSC_DRIVER net_vdev_netvsc
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_PROBE_MS 1000
+
+#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"

#define PMD_DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -53,12 +77,527 @@
/** Driver-specific log messages type. */
static int vdev_netvsc_logtype;

+/** Context structure for a vdev_netvsc instance. */
+struct vdev_netvsc_ctx {
+ LIST_ENTRY(vdev_netvsc_ctx) entry; /**< Next entry in list. */
+ unsigned int id; /**< ID used to generate unique names. */
+ char name[64]; /**< Unique name for vdev_netvsc instance. */
+ char devname[64]; /**< Fail-safe PMD instance name. */
+ char devargs[256]; /**< Fail-safe PMD instance device arguments. */
+ char if_name[IF_NAMESIZE]; /**< NetVSC netdevice name. */
+ unsigned int if_index; /**< NetVSC netdevice index. */
+ struct ether_addr if_addr; /**< NetVSC MAC address. */
+ int pipe[2]; /**< Communication pipe with fail-safe instance. */
+ char yield[256]; /**< Current device string used with fail-safe. */
+};
+
+/** Context list is common to all PMD instances. */
+static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
+ LIST_HEAD_INITIALIZER(vdev_netvsc_ctx_list);
+
+/** Number of entries in context list. */
+static unsigned int vdev_netvsc_ctx_count;
+
/** Number of PMD instances relying on context list. */
static unsigned int vdev_netvsc_ctx_inst;

/**
+ * Destroy a vdev_netvsc context instance.
+ *
+ * @param ctx
+ * Context to destroy.
+ */
+static void
+vdev_netvsc_ctx_destroy(struct vdev_netvsc_ctx *ctx)
+{
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice found on
+ * the system.
+ *
+ * @param func
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ * @param ...
+ * Variable parameter list passed as @p va_list to @p func.
+ *
+ * @return
+ * 0 when the entire list is traversed successfully, a negative error code
+ * in case or failure, or the nonzero value returned by @p func when list
+ * traversal is aborted.
+ */
+static int
+vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ PMD_DRV_LOG(ERR, "cannot retrieve system network interfaces");
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ PMD_DRV_LOG(ERR, "cannot open socket: %s", rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ PMD_DRV_LOG(WARNING,
+ "cannot retrieve information about"
+ " interface \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+error:
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ *
+ * @return
+ * A nonzero value when interface is detected as NetVSC. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[sizeof(temp) + IF_NAMESIZE];
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ ret = snprintf(path, sizeof(path), temp, iface->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(path)) {
+ rte_errno = ENOBUFS;
+ return 0;
+ }
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * @param[out] buf
+ * Output data buffer.
+ * @param size
+ * Output buffer size.
+ * @param[in] if_name
+ * Netdevice name.
+ * @param[in] relpath
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * @return
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+vdev_netvsc_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
+
+/**
+ * Probe a network interface to associate with vdev_netvsc context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the vdev_netvsc context and
+ * communicates its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - struct vdev_netvsc_ctx *ctx:
+ * Context to associate network interface with.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct vdev_netvsc_ctx *ctx = va_arg(ap, struct vdev_netvsc_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ PMD_DRV_LOG(DEBUG,
+ "NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (vdev_netvsc_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ if (strcmp(addr, "pci"))
+ return 0;
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance. */
+ if (strcmp(addr, ctx->yield))
+ PMD_DRV_LOG(DEBUG,
+ "associating PCI device \"%s\" with NetVSC"
+ " interface \"%s\" (index %u)",
+ addr, ctx->if_name, ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ PMD_DRV_LOG(WARNING,
+ "cannot associate PCI device name \"%s\" with"
+ " interface \"%s\": %s",
+ addr, ctx->if_name, rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by VDEV_NETVSC_PROBE_MS as
+ * long as an vdev_netvsc context instance exists.
+ *
+ * @param arg
+ * Ignored.
+ */
+static void
+vdev_netvsc_alarm(__rte_unused void *arg)
+{
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry) {
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ if (ret)
+ break;
+ }
+ if (!vdev_netvsc_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ PMD_DRV_LOG(ERR, "unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
+
+/**
+ * Probe a NetVSC interface to generate a vdev_netvsc context from.
+ *
+ * This function instantiates vdev_netvsc contexts either for all NetVSC
+ * devices found on the system or only a subset provided as device
+ * arguments.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - const char *name:
+ * Name associated with current driver instance.
+ *
+ * - struct rte_kvargs *kvargs:
+ * Device arguments provided to current driver instance.
+ *
+ * - unsigned int specified:
+ * Number of specific netdevices provided as device arguments.
+ *
+ * - unsigned int *matched:
+ * The number of specified netdevices matched by this function.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct vdev_netvsc_ctx *ctx;
+ uint16_t port_id;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key, VDEV_NETVSC_ARG_MAC)) {
+ struct ether_addr tmp;
+
+ if (sscanf(pair->value,
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8 ":"
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8,
+ &tmp.addr_bytes[0],
+ &tmp.addr_bytes[1],
+ &tmp.addr_bytes[2],
+ &tmp.addr_bytes[3],
+ &tmp.addr_bytes[4],
+ &tmp.addr_bytes[5]) != 6) {
+ PMD_DRV_LOG(ERR,
+ "invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (!is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ PMD_DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is already handled,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!vdev_netvsc_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ PMD_DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is not NetVSC,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
+ PMD_DRV_LOG(ERR,
+ "cannot allocate context for interface \"%s\": %s",
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = vdev_netvsc_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ PMD_DRV_LOG(ERR,
+ "cannot allocate control pipe for interface"
+ " \"%s\": %s",
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1)
+ continue;
+ ret = -errno;
+ PMD_DRV_LOG(ERR,
+ "cannot toggle non-blocking flag on control file"
+ " descriptor #%u (%d): %s",
+ i, ctx->pipe[i], rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name))
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname), "net_failsafe_%s",
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname))
+ ++i;
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "fd(%d),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs))
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ PMD_DRV_LOG(ERR,
+ "generated virtual device name or argument list"
+ " too long for interface \"%s\"",
+ ctx->if_name);
+ goto error;
+ }
+ /*
+ * Remove any competing rte_eth_dev entries sharing the same MAC
+ * address, fail-safe instances created by this PMD will handle them
+ * as sub-devices later.
+ */
+ RTE_ETH_FOREACH_DEV(port_id) {
+ struct rte_device *dev = rte_eth_devices[port_id].device;
+ struct rte_bus *bus = rte_bus_find_by_device(dev);
+ struct ether_addr tmp;
+
+ rte_eth_macaddr_get(port_id, &tmp);
+ if (!is_same_ether_addr(eth_addr, &tmp))
+ continue;
+ PMD_DRV_LOG(WARNING,
+ "removing device \"%s\" with identical MAC address"
+ " to re-create it as a fail-safe sub-device",
+ dev->name);
+ if (!bus)
+ ret = -EINVAL;
+ else
+ ret = rte_eal_hotplug_remove(bus->name, dev->name);
+ if (ret < 0) {
+ PMD_DRV_LOG(ERR, "unable to remove device \"%s\": %s",
+ dev->name, rte_strerror(-ret));
+ goto error;
+ }
+ }
+ /* Request virtual device generation. */
+ PMD_DRV_LOG(DEBUG,
+ "generating virtual device \"%s\" with arguments \"%s\"",
+ ctx->devname, ctx->devargs);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&vdev_netvsc_ctx_list, ctx, entry);
+ ++vdev_netvsc_ctx_count;
+ PMD_DRV_LOG(DEBUG,
+ "added NetVSC interface \"%s\" to context list",
+ ctx->if_name);
+ return 0;
+error:
+ if (ctx)
+ vdev_netvsc_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified device
+ * arguments and starts a periodic alarm callback to notify the resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* @param dev
* Virtual device context for PMD instance.
*
@@ -77,6 +616,10 @@ vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
vdev_netvsc_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;

PMD_DRV_LOG(DEBUG,
"invoked as \"%s\", using arguments \"%s\"",
@@ -85,6 +628,30 @@ vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
PMD_DRV_LOG(ERR, "cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ /* Gather interfaces. */
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ PMD_DRV_LOG(WARNING,
+ "some of the specified parameters did not match"
+ " recognized network interfaces");
+ /* Probe interfaces immediately. */
+ vdev_netvsc_alarm(NULL);
+ if (ret < 0) {
+ PMD_DRV_LOG(ERR, "unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
error:
if (kvargs)
rte_kvargs_free(kvargs);
@@ -95,6 +662,9 @@ vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
/**
* Remove PMD instance.
*
+ * The alarm callback and underlying vdev_netvsc context instances are only
+ * destroyed after the last PMD instance is removed.
+ *
* @param dev
* Virtual device context for PMD instance.
*
@@ -104,7 +674,16 @@ vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
static int
vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
{
- --vdev_netvsc_ctx_inst;
+ if (--vdev_netvsc_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ while (!LIST_EMPTY(&vdev_netvsc_ctx_list)) {
+ struct vdev_netvsc_ctx *ctx = LIST_FIRST(&vdev_netvsc_ctx_list);
+
+ LIST_REMOVE(ctx, entry);
+ --vdev_netvsc_ctx_count;
+ vdev_netvsc_ctx_destroy(ctx);
+ }
return 0;
}
--
2.11.0
Adrien Mazarguil
2017-12-22 18:01:39 UTC
Permalink
This parameter allows specifying any non-NetVSC interface to use with tap
sub-devices for development purposes.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
---
doc/guides/nics/vdev_netvsc.rst | 5 +++++
drivers/net/vdev_netvsc/vdev_netvsc.c | 27 +++++++++++++++++++--------
2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index 73a63e552..a0417b5ef 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -107,5 +107,10 @@ The following device parameters are supported:
Same as ``iface`` except a suitable NetVSC interface is located using its
MAC address.

+- ``force`` [int]
+
+ If nonzero, forces the use of specified interfaces even if not detected as
+ NetVSC.
+
Not specifying either ``iface`` or ``mac`` makes this PMD attach itself to
all NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 738196e75..5e426adc0 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -63,6 +63,7 @@
#define VDEV_NETVSC_DRIVER net_vdev_netvsc
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_ARG_FORCE "force"
#define VDEV_NETVSC_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
@@ -405,6 +406,9 @@ vdev_netvsc_alarm(__rte_unused void *arg)
* - struct rte_kvargs *kvargs:
* Device arguments provided to current driver instance.
*
+ * - int force:
+ * Accept specified interface even if not detected as NetVSC.
+ *
* - unsigned int specified:
* Number of specific netdevices provided as device arguments.
*
@@ -422,6 +426,7 @@ vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
{
const char *name = va_arg(ap, const char *);
struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ int force = va_arg(ap, int);
unsigned int specified = va_arg(ap, unsigned int);
unsigned int *matched = va_arg(ap, unsigned int *);
unsigned int i;
@@ -480,10 +485,11 @@ vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
if (!specified)
return 0;
PMD_DRV_LOG(WARNING,
- "interface \"%s\" (index %u) is not NetVSC,"
- " skipping",
- iface->if_name, iface->if_index);
- return 0;
+ "interface \"%s\" (index %u) is not NetVSC, %s",
+ iface->if_name, iface->if_index,
+ force ? "using anyway (forced)" : "skipping");
+ if (!force)
+ return 0;
}
/* Create interface context. */
ctx = calloc(1, sizeof(*ctx));
@@ -610,6 +616,7 @@ vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
static const char *const vdev_netvsc_arg[] = {
VDEV_NETVSC_ARG_IFACE,
VDEV_NETVSC_ARG_MAC,
+ VDEV_NETVSC_ARG_FORCE,
NULL,
};
const char *name = rte_vdev_device_name(dev);
@@ -618,6 +625,7 @@ vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
vdev_netvsc_arg);
unsigned int specified = 0;
unsigned int matched = 0;
+ int force = 0;
unsigned int i;
int ret;

@@ -631,14 +639,16 @@ vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
for (i = 0; i != kvargs->count; ++i) {
const struct rte_kvargs_pair *pair = &kvargs->pairs[i];

- if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
- !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_FORCE))
+ force = !!atoi(pair->value);
+ else if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
++specified;
}
rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
/* Gather interfaces. */
ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
- specified, &matched);
+ force, specified, &matched);
if (ret < 0)
goto error;
if (matched < specified)
@@ -697,7 +707,8 @@ RTE_PMD_REGISTER_VDEV(VDEV_NETVSC_DRIVER, vdev_netvsc_vdev);
RTE_PMD_REGISTER_ALIAS(VDEV_NETVSC_DRIVER, eth_vdev_netvsc);
RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
VDEV_NETVSC_ARG_IFACE "=<string> "
- VDEV_NETVSC_ARG_MAC "=<string>");
+ VDEV_NETVSC_ARG_MAC "=<string> "
+ VDEV_NETVSC_ARG_FORCE "=<int>");

/** Initialize driver log type. */
static void
--
2.11.0
Stephen Hemminger
2017-12-23 02:06:58 UTC
Permalink
Why does this need to be a PMD?
Maybe we need some platform infrastructure?
My definition of PMD is it can send and receive
Post by Adrien Mazarguil
Virtual machines hosted by Hyper-V/Azure platforms are fitted with
simplified virtual network devices named NetVSC that are used for fast
communication between VM to VM, VM to hypervisor, and the outside.
They appear as standard system netdevices to user-land applications, the
main difference being they are implemented on top of VMBUS [1] instead of
emulated PCI devices.
While this reads like a case for a standard DPDK PMD, there is more to it.
To accelerate outside communication, NetVSC devices as they appear in a VM
can be paired with physical SR-IOV virtual function (VF) devices owned by
that same VM [2]. Both netdevices share the same MAC address in that case.
When paired, egress and most of the ingress traffic flow through the VF
device, while part of it (e.g. multicasts, hypervisor control data) still
flows through NetVSC. Moreover VF devices are not retained and disappear
during VM migration; from a VM standpoint, they can be hot-plugged anytime
with NetVSC acting as a fallback.
Running DPDK applications in such a context involves driving VF devices
using their dedicated PMDs in a vendor-independent fashion (to benefit from
maximum performance without writing dedicated code) while simultaneously
listening to NetVSC and handling the related hot-plug events.
This new virtual PMD (referred to as "vdev_netvsc" from this point on)
automatically coordinates the Hyper-V/Azure-specific management part
described above by relying on vendor-specific, failsafe and tap PMDs to
expose a single consolidated Ethernet device usable directly by existing
applications.
.------------------.
| DPDK application |
`--------+---------'
|
.------+------.
| DPDK ethdev |
`------+------' Control
| |
.------------+------------. v .-----------------.
| failsafe PMD +---------+ vdev_netvsc PMD |
`--+-------------------+--' `-----------------'
| |
| .........|.........
`----+----' : `----+----' : <-- Hot-pluggable
Note this diagram differs from that of the original RFC [3], with
vdev_netvsc no longer acting as a data plane layer.
This initial version of the driver only works in whitelist mode. Users have
to provide the --vdev net_vdev_netvsc EAL option at least once to trigger
it.
Subsequent work will add support for blacklist mode based on automatic
detection of the host environment.
[1] http://dpdk.org/ml/archives/dev/2017-January/054165.html
[2] https://docs.microsoft.com/en-us/windows-hardware/drivers/
network/overview-of-hyper-v
[3] http://dpdk.org/ml/archives/dev/2017-November/082339.html
- Renamed driver from "hyperv" to "vdev_netvsc". This change covers
documentation and symbols prefix.
- Driver is now tagged EXPERIMENTAL.
- Replaced ether_addr_from_str() with a basic sscanf() call.
- Removed debugging code (memset() poisoning).
- Fixed hyperv_iface_is_netvsc()'s buffer allocation according to comments.
- Removed hyperv_basename().
- Discarded unused variables through __rte_unused.
- Added separate but necessary free() bugfix for failsafe PMD.
- Added file descriptor input support to failsafe PMD.
- Replaced temporary bash execution; failsafe now reads device definitions
directly through a pipe without an intermediate bash one-liner.
- Expanded DEBUG/INFO/WARN/ERROR() macros as PMD_DRV_LOG().
- Added dynamic log type (pmd.vdev_netvsc).
- Modified initialization code to probe devices immediately during startup.
- Fixed several snprintf() return value checks ("ret >= sizeof(foo)" is more
appropriate than "ret >= sizeof(foo) - 1").
net/failsafe: fix invalid free
net/failsafe: add "fd" parameter
net/vdev_netvsc: introduce Hyper-V platform driver
net/vdev_netvsc: implement core functionality
net/vdev_netvsc: add "force" parameter
MAINTAINERS | 6 +
config/common_base | 5 +
config/common_linuxapp | 1 +
doc/guides/nics/fail_safe.rst | 9 +
doc/guides/nics/features/vdev_netvsc.ini | 12 +
doc/guides/nics/index.rst | 1 +
doc/guides/nics/vdev_netvsc.rst | 116 +++
drivers/net/Makefile | 1 +
drivers/net/failsafe/failsafe_args.c | 88 ++-
drivers/net/failsafe/failsafe_private.h | 3 +
drivers/net/vdev_netvsc/Makefile | 58 ++
.../vdev_netvsc/rte_pmd_vdev_netvsc_version.map | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 722 +++++++++++++++++++
mk/rte.app.mk | 1 +
14 files changed, 1025 insertions(+), 2 deletions(-)
create mode 100644 doc/guides/nics/features/vdev_netvsc.ini
create mode 100644 doc/guides/nics/vdev_netvsc.rst
create mode 100644 drivers/net/vdev_netvsc/Makefile
create mode 100644 drivers/net/vdev_netvsc/rte_
pmd_vdev_netvsc_version.map
create mode 100644 drivers/net/vdev_netvsc/vdev_netvsc.c
--
2.11.0
Thomas Monjalon
2017-12-23 14:28:06 UTC
Permalink
Post by Stephen Hemminger
Why does this need to be a PMD?
It needs to be a driver on top of buses.
Post by Stephen Hemminger
Maybe we need some platform infrastructure?
What would be such infrastructure? A new driver type?
Something like drivers/platform/?

I am not sure it is required for this driver given it is
most probably only a temporary driver waiting for the NetVSC PMD
and a full hotplug support in DPDK internals.

I think we should create such new infrastructure only when we are
sure it is needed permanently for some drivers.
Post by Stephen Hemminger
My definition of PMD is it can send and receive
It is the definition of an ethdev driver, yes.
Matan Azrad
2018-01-09 14:47:25 UTC
Permalink
Virtual machines hosted by Hyper-V/Azure platforms are fitted with simplified virtual network devices named NetVSC that are used for fast communication between VM to VM, VM to hypervisor, and the outside.

They appear as standard system netdevices to user-land applications, the main difference being they are implemented on top of VMBUS instead of emulated PCI devices.

While this reads like a case for a standard DPDK PMD, there is more to it.

To accelerate outside communication, NetVSC devices as they appear in a VM can be paired with physical SR-IOV virtual function (VF) devices owned by that same VM. Both netdevices share the same MAC address in that case.

When paired, egress and most of the ingress traffic flow through the VF device, while part of it (e.g. multicasts, hypervisor control data) still flows through NetVSC. Moreover VF devices are not retained and disappear during VM migration; from a VM standpoint, they can be hot-plugged anytime with NetVSC acting as a fallback.

Running DPDK applications in such a context involves driving VF devices using their dedicated PMDs in a vendor-independent fashion (to benefit from maximum performance without writing dedicated code) while simultaneously listening to NetVSC and handling the related hot-plug events.

This new virtual driver (referred to as "vdev_netvsc" from this point on) automatically coordinates the Hyper-V/Azure-specific management part described above by relying on vendor-specific, failsafe and tap PMDs to expose a single consolidated Ethernet device usable directly by existing applications.

.------------------.
| DPDK application |
`--------+---------'
|
.------+------.
| DPDK ethdev |
`------+------' Control
| |
.------------+------------. v .--------------------.
| failsafe PMD +---------+ vdev_netvsc driver |
`--+-------------------+--' `--------------------'
| |
| .........|.........
| : | :
.----+----. : .----+----. :
| tap PMD | : | any PMD | :
`----+----' : `----+----' : <-- Hot-pluggable
| : | :
.------+-------. : .-----+-----. :
| NetVSC-based | : | SR-IOV VF | :
| netdevice | : | device | :
`--------------' : `-----------' :
:.................:



v2 changes(Adrien):

- Renamed driver from "hyperv" to "vdev_netvsc". This change covers
documentation and symbols prefix.
- Driver is now tagged EXPERIMENTAL.
- Replaced ether_addr_from_str() with a basic sscanf() call.
- Removed debugging code (memset() poisoning).
- Fixed hyperv_iface_is_netvsc()'s buffer allocation according to comments.
- Removed hyperv_basename().
- Discarded unused variables through __rte_unused.
- Added separate but necessary free() bugfix for failsafe PMD.
- Added file descriptor input support to failsafe PMD.
- Replaced temporary bash execution; failsafe now reads device definitions
directly through a pipe without an intermediate bash one-liner.
- Expanded DEBUG/INFO/WARN/ERROR() macros as PMD_DRV_LOG().
- Added dynamic log type (pmd.vdev_netvsc).
- Modified initialization code to probe devices immediately during startup.
- Fixed several snprintf() return value checks ("ret >= sizeof(foo)" is more
appropriate than "ret >= sizeof(foo) - 1").

v3 changes(Matan):
- Fixed clang compilation in V2.
- Removed hotplug remove code from the new driver.
- Supported probed sub-devices getting in fail-safe.
- Added automatic probing for HyperV VM systems.
- Added option to ignore the automatic probing.
- Skiped routed NetVSC devices probing.
- Adjusted documentation and semantics.
- Replaced maintainer.


Adrien Mazarguil (2):
net/failsafe: fix invalid free
net/failsafe: add "fd" parameter

Matan Azrad (6):
net/failsafe: support probed sub-devices getting
net/vdev_netvsc: introduce Hyper-V platform driver
net/vdev_netvsc: implement core functionality
net/vdev_netvsc: skip routed netvsc probing
net/vdev_netvsc: add "force" parameter
net/vdev_netvsc: add automatic probing

MAINTAINERS | 6 +
config/common_base | 5 +
config/common_linuxapp | 1 +
doc/guides/nics/fail_safe.rst | 14 +
doc/guides/nics/features/vdev_netvsc.ini | 12 +
doc/guides/nics/index.rst | 1 +
doc/guides/nics/vdev_netvsc.rst | 100 +++
drivers/net/Makefile | 1 +
drivers/net/failsafe/failsafe_args.c | 88 ++-
drivers/net/failsafe/failsafe_eal.c | 60 +-
drivers/net/failsafe/failsafe_private.h | 3 +
drivers/net/vdev_netvsc/Makefile | 31 +
.../vdev_netvsc/rte_pmd_vdev_netvsc_version.map | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 746 +++++++++++++++++++++
mk/rte.app.mk | 1 +
15 files changed, 1051 insertions(+), 22 deletions(-)
create mode 100644 doc/guides/nics/features/vdev_netvsc.ini
create mode 100644 doc/guides/nics/vdev_netvsc.rst
create mode 100644 drivers/net/vdev_netvsc/Makefile
create mode 100644 drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
create mode 100644 drivers/net/vdev_netvsc/vdev_netvsc.c
--
1.8.3.1
Matan Azrad
2018-01-09 14:47:26 UTC
Permalink
From: Adrien Mazarguil <***@6wind.com>

rte_free() is not supposed to work with pointers returned by calloc().

Fixes: a0194d828100 ("net/failsafe: add flexible device definition")
Cc: ***@dpdk.org
Cc: Gaetan Rivet <***@6wind.com>

Signed-off-by: Adrien Mazarguil <***@6wind.com>
---
drivers/net/failsafe/failsafe_args.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index cfc83e3..ec63ac9 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -407,7 +407,7 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
uint8_t i;

FOREACH_SUBDEV(sdev, i, dev) {
- rte_free(sdev->cmdline);
+ free(sdev->cmdline);
sdev->cmdline = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
--
1.8.3.1
Gaëtan Rivet
2018-01-16 10:24:17 UTC
Permalink
Hi Matan,
Post by Adrien Mazarguil
rte_free() is not supposed to work with pointers returned by calloc().
Fixes: a0194d828100 ("net/failsafe: add flexible device definition")
---
drivers/net/failsafe/failsafe_args.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index cfc83e3..ec63ac9 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -407,7 +407,7 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
uint8_t i;
FOREACH_SUBDEV(sdev, i, dev) {
- rte_free(sdev->cmdline);
+ free(sdev->cmdline);
sdev->cmdline = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
--
1.8.3.1
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-09 14:47:27 UTC
Permalink
From: Adrien Mazarguil <***@6wind.com>

This parameter enables applications to provide device definitions through
an arbitrary file descriptor number.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Cc: Gaetan Rivet <***@6wind.com>
---
doc/guides/nics/fail_safe.rst | 9 ++++
drivers/net/failsafe/failsafe_args.c | 86 ++++++++++++++++++++++++++++++++-
drivers/net/failsafe/failsafe_private.h | 3 ++
3 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index c4e3d2e..5b1b47e 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -106,6 +106,15 @@ Fail-safe command line parameters
All commas within the ``shell command`` are replaced by spaces before
executing the command. This helps using scripts to specify devices.

+- **fd(<file descriptor number>)** parameter
+
+ This parameter reads a device definition from an arbitrary file descriptor
+ number in ``<iface>`` format as described above.
+
+ The file descriptor is read in non-blocking mode and is never closed in
+ order to take only the last line into account (unlike ``exec()``) at every
+ probe attempt.
+
- **mac** parameter [MAC address]

This parameter allows the user to set a default MAC address to the fail-safe
diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index ec63ac9..7a86051 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -31,7 +31,11 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
#include <string.h>
+#include <unistd.h>
#include <errno.h>

#include <rte_debug.h>
@@ -161,6 +165,73 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}

static int
+fs_read_fd(struct sub_device *sdev, char *fd_str)
+{
+ FILE *fp = NULL;
+ int fd = -1;
+ /* store possible newline as well */
+ char output[DEVARGS_MAXLEN + 1];
+ int err = -ENODEV;
+ int ret;
+
+ RTE_ASSERT(fd_str != NULL || sdev->fd_str != NULL);
+ if (sdev->fd_str == NULL) {
+ sdev->fd_str = strdup(fd_str);
+ if (sdev->fd_str == NULL) {
+ ERROR("Command line allocation failed");
+ return -ENOMEM;
+ }
+ }
+ errno = 0;
+ fd = strtol(fd_str, &fd_str, 0);
+ if (errno || *fd_str || fd < 0) {
+ ERROR("Parsing FD number failed");
+ goto error;
+ }
+ /* Fiddle with copy of file descriptor */
+ fd = dup(fd);
+ if (fd == -1)
+ goto error;
+ ret = fcntl(fd, F_GETFL);
+ if (ret == -1)
+ goto error;
+ ret = fcntl(fd, F_SETFL, fd | O_NONBLOCK);
+ if (ret == -1)
+ goto error;
+ fp = fdopen(fd, "r");
+ if (!fp)
+ goto error;
+ fd = -1;
+ /* Only take the last line into account */
+ ret = 0;
+ while (fgets(output, sizeof(output), fp))
+ ++ret;
+ if (feof(fp)) {
+ if (!ret)
+ goto error;
+ } else if (ferror(fp)) {
+ if (errno != EAGAIN || !ret)
+ goto error;
+ } else if (!ret) {
+ goto error;
+ }
+ /* Line must end with a newline character */
+ fs_sanitize_cmdline(output);
+ if (output[0] == '\0')
+ goto error;
+ ret = fs_parse_device(sdev, output);
+ if (ret)
+ ERROR("Parsing device '%s' failed", output);
+ err = ret;
+error:
+ if (fp)
+ fclose(fp);
+ if (fd != -1)
+ close(fd);
+ return err;
+}
+
+static int
fs_parse_device_param(struct rte_eth_dev *dev, const char *param,
uint8_t head)
{
@@ -202,6 +273,14 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}
if (ret)
goto free_args;
+ } else if (strncmp(param, "fd", 2) == 0) {
+ ret = fs_read_fd(sdev, args);
+ if (ret == -ENODEV) {
+ DEBUG("Reading device info from FD failed");
+ ret = 0;
+ }
+ if (ret)
+ goto free_args;
} else {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
@@ -409,6 +488,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
FOREACH_SUBDEV(sdev, i, dev) {
free(sdev->cmdline);
sdev->cmdline = NULL;
+ free(sdev->fd_str);
+ sdev->fd_str = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
}
@@ -424,7 +505,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
param[b] != '\0')
b++;
if (strncmp(param, "dev", b) != 0 &&
- strncmp(param, "exec", b) != 0) {
+ strncmp(param, "exec", b) != 0 &&
+ strncmp(param, "fd", b) != 0) {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
}
@@ -463,6 +545,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
continue;
if (sdev->cmdline)
ret = fs_execute_cmd(sdev, sdev->cmdline);
+ else if (sdev->fd_str)
+ ret = fs_read_fd(sdev, sdev->fd_str);
else
ret = fs_parse_sub_device(sdev);
if (ret == 0)
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index d81cc3c..a0d3675 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -48,6 +48,7 @@
#define PMD_FAILSAFE_PARAM_STRING \
"dev(<ifc>)," \
"exec(<shell command>)," \
+ "fd(<fd number>)," \
"mac=mac_addr," \
"hotplug_poll=u64" \
""
@@ -111,6 +112,8 @@ struct sub_device {
struct fs_stats stats_snapshot;
/* Some device are defined as a command line */
char *cmdline;
+ /* Others are retrieved through a file descriptor */
+ char *fd_str;
/* fail-safe device backreference */
struct rte_eth_dev *fs_dev;
/* flag calling for recollection */
--
1.8.3.1
Gaëtan Rivet
2018-01-16 10:54:43 UTC
Permalink
Hi Matam, Adrien,
Post by Adrien Mazarguil
This parameter enables applications to provide device definitions through
an arbitrary file descriptor number.
Ok on the principle,

<snip>
Post by Adrien Mazarguil
@@ -161,6 +165,73 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}
static int
+fs_read_fd(struct sub_device *sdev, char *fd_str)
+{
+ FILE *fp = NULL;
+ int fd = -1;
+ /* store possible newline as well */
+ char output[DEVARGS_MAXLEN + 1];
+ int err = -ENODEV;
+ int ret;
ret is used as flag older, line counter and then error reporting.
err should be the only variable used for reading errors from function
and reporting it.

It would be clearer to use descriptive names, such as "oflags" and "nl"
or "lcount". I don't really care about one additional variable in this
function, for the sake of expressiveness.
Post by Adrien Mazarguil
+
+ RTE_ASSERT(fd_str != NULL || sdev->fd_str != NULL);
+ if (sdev->fd_str == NULL) {
+ sdev->fd_str = strdup(fd_str);
+ if (sdev->fd_str == NULL) {
+ ERROR("Command line allocation failed");
+ return -ENOMEM;
+ }
+ }
+ errno = 0;
+ fd = strtol(fd_str, &fd_str, 0);
+ if (errno || *fd_str || fd < 0) {
+ ERROR("Parsing FD number failed");
+ goto error;
+ }
+ /* Fiddle with copy of file descriptor */
+ fd = dup(fd);
+ if (fd == -1)
+ goto error;
+ ret = fcntl(fd, F_GETFL);
oflags = fcntl(...);
Post by Adrien Mazarguil
+ if (ret == -1)
+ goto error;
+ ret = fcntl(fd, F_SETFL, fd | O_NONBLOCK);
err = fcntl(fd, F_SETFL, oflags | O_NONBLOCK);
Using (fd | O_NONBLOCK) is probably a mistake.
Post by Adrien Mazarguil
+ if (ret == -1)
+ goto error;
+ fp = fdopen(fd, "r");
+ if (!fp)
+ goto error;
+ fd = -1;
+ /* Only take the last line into account */
+ ret = 0;
+ while (fgets(output, sizeof(output), fp))
+ ++ret;
lcount = 0;
while (fgets(output, sizeof(output), fp))
++lcount;
Post by Adrien Mazarguil
+ if (feof(fp)) {
+ if (!ret)
+ goto error;
+ } else if (ferror(fp)) {
+ if (errno != EAGAIN || !ret)
+ goto error;
+ } else if (!ret) {
+ goto error;
+ }
These branches seems needlessly complicated:

if (lcount == 0)
goto error;
else if (ferror(fp) && errno != EAGAIN)
goto error;
Post by Adrien Mazarguil
+ /* Line must end with a newline character */
+ fs_sanitize_cmdline(output);
+ if (output[0] == '\0')
+ goto error;
+ ret = fs_parse_device(sdev, output);
+ if (ret)
+ ERROR("Parsing device '%s' failed", output);
+ err = ret;
no need to use ret instead of err here?

err = fs_parse_device(sdev, output);
if (err)
ERROR("Parsing device '%s' failed", output);

Thus allowing to remove the "ret" variable completely.
Post by Adrien Mazarguil
+ if (fp)
+ fclose(fp);
+ if (fd != -1)
+ close(fd);
+ return err;
+}
+
+static int
fs_parse_device_param(struct rte_eth_dev *dev, const char *param,
uint8_t head)
{
@@ -202,6 +273,14 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}
if (ret)
goto free_args;
+ } else if (strncmp(param, "fd", 2) == 0) {
How about strncmp(param, "fd(", 3) == 0 here?
I think I made a mistake for dev and exec device types, no reason at
this point to reiterate for fd as well.
Post by Adrien Mazarguil
+ ret = fs_read_fd(sdev, args);
+ if (ret == -ENODEV) {
+ DEBUG("Reading device info from FD failed");
+ ret = 0;
+ }
+ if (ret)
+ goto free_args;
} else {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
@@ -409,6 +488,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
FOREACH_SUBDEV(sdev, i, dev) {
free(sdev->cmdline);
sdev->cmdline = NULL;
+ free(sdev->fd_str);
+ sdev->fd_str = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
}
@@ -424,7 +505,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
param[b] != '\0')
b++;
if (strncmp(param, "dev", b) != 0 &&
- strncmp(param, "exec", b) != 0) {
+ strncmp(param, "exec", b) != 0 &&
+ strncmp(param, "fd", b) != 0) {
If the strncmp above is modified, this one should be as well for
consistency.
--
Gaëtan Rivet
6WIND
Gaëtan Rivet
2018-01-16 11:19:08 UTC
Permalink
Hi again,

made a mistake in reviewing, see below.
Post by Gaëtan Rivet
Hi Matam, Adrien,
Post by Adrien Mazarguil
This parameter enables applications to provide device definitions through
an arbitrary file descriptor number.
Ok on the principle,
<snip>
Post by Adrien Mazarguil
@@ -161,6 +165,73 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}
static int
+fs_read_fd(struct sub_device *sdev, char *fd_str)
+{
+ FILE *fp = NULL;
+ int fd = -1;
+ /* store possible newline as well */
+ char output[DEVARGS_MAXLEN + 1];
+ int err = -ENODEV;
+ int ret;
ret is used as flag older, line counter and then error reporting.
err should be the only variable used for reading errors from function
and reporting it.
It would be clearer to use descriptive names, such as "oflags" and "nl"
or "lcount". I don't really care about one additional variable in this
function, for the sake of expressiveness.
Post by Adrien Mazarguil
+
+ RTE_ASSERT(fd_str != NULL || sdev->fd_str != NULL);
+ if (sdev->fd_str == NULL) {
+ sdev->fd_str = strdup(fd_str);
+ if (sdev->fd_str == NULL) {
+ ERROR("Command line allocation failed");
+ return -ENOMEM;
+ }
+ }
+ errno = 0;
+ fd = strtol(fd_str, &fd_str, 0);
+ if (errno || *fd_str || fd < 0) {
+ ERROR("Parsing FD number failed");
+ goto error;
+ }
+ /* Fiddle with copy of file descriptor */
+ fd = dup(fd);
+ if (fd == -1)
+ goto error;
+ ret = fcntl(fd, F_GETFL);
oflags = fcntl(...);
Post by Adrien Mazarguil
+ if (ret == -1)
+ goto error;
+ ret = fcntl(fd, F_SETFL, fd | O_NONBLOCK);
err = fcntl(fd, F_SETFL, oflags | O_NONBLOCK);
Using (fd | O_NONBLOCK) is probably a mistake.
This is sneaky. err is -ENODEV and would change to -1 on error, losing
some meaning.
Post by Gaëtan Rivet
Post by Adrien Mazarguil
+ if (ret == -1)
+ goto error;
+ fp = fdopen(fd, "r");
+ if (!fp)
+ goto error;
+ fd = -1;
+ /* Only take the last line into account */
+ ret = 0;
+ while (fgets(output, sizeof(output), fp))
+ ++ret;
lcount = 0;
while (fgets(output, sizeof(output), fp))
++lcount;
Post by Adrien Mazarguil
+ if (feof(fp)) {
+ if (!ret)
+ goto error;
+ } else if (ferror(fp)) {
+ if (errno != EAGAIN || !ret)
+ goto error;
+ } else if (!ret) {
+ goto error;
+ }
if (lcount == 0)
goto error;
else if (ferror(fp) && errno != EAGAIN)
goto error;
Here err would have been set to 0 previously with the fcntl call,
meaning that jumping to error would return 0 as well.

I know Adrien wanted to avoid the usual ugly

if (error) {
err = -ENODEV;
goto error;
}

But this kind of sneakiness is not easy to parse and maintain. If
someone adds a new path of error later, this kind of subtlety *will* be
lost.

So between ugliness and maintainability, I choose maintainability (being
the maintainer, of course).
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-16 16:17:45 UTC
Permalink
Hi Gaetan

OK for all, will change it.

From: Gaëtan Rivet, Tuesday, January 16, 2018 1:19 PM
Post by Gaëtan Rivet
Hi again,
made a mistake in reviewing, see below.
Post by Gaëtan Rivet
Hi Matam, Adrien,
Post by Adrien Mazarguil
This parameter enables applications to provide device definitions
through an arbitrary file descriptor number.
Ok on the principle,
<snip>
Post by Adrien Mazarguil
@@ -161,6 +165,73 @@ typedef int (parse_cb)(struct rte_eth_dev *dev,
const char *params, }
static int
+fs_read_fd(struct sub_device *sdev, char *fd_str) {
+ FILE *fp = NULL;
+ int fd = -1;
+ /* store possible newline as well */
+ char output[DEVARGS_MAXLEN + 1];
+ int err = -ENODEV;
+ int ret;
ret is used as flag older, line counter and then error reporting.
err should be the only variable used for reading errors from function
and reporting it.
It would be clearer to use descriptive names, such as "oflags" and "nl"
or "lcount". I don't really care about one additional variable in this
function, for the sake of expressiveness.
Post by Adrien Mazarguil
+
+ RTE_ASSERT(fd_str != NULL || sdev->fd_str != NULL);
+ if (sdev->fd_str == NULL) {
+ sdev->fd_str = strdup(fd_str);
+ if (sdev->fd_str == NULL) {
+ ERROR("Command line allocation failed");
+ return -ENOMEM;
+ }
+ }
+ errno = 0;
+ fd = strtol(fd_str, &fd_str, 0);
+ if (errno || *fd_str || fd < 0) {
+ ERROR("Parsing FD number failed");
+ goto error;
+ }
+ /* Fiddle with copy of file descriptor */
+ fd = dup(fd);
+ if (fd == -1)
+ goto error;
+ ret = fcntl(fd, F_GETFL);
oflags = fcntl(...);
Post by Adrien Mazarguil
+ if (ret == -1)
+ goto error;
+ ret = fcntl(fd, F_SETFL, fd | O_NONBLOCK);
err = fcntl(fd, F_SETFL, oflags | O_NONBLOCK); Using (fd | O_NONBLOCK)
is probably a mistake.
This is sneaky. err is -ENODEV and would change to -1 on error, losing some
meaning.
Post by Gaëtan Rivet
Post by Adrien Mazarguil
+ if (ret == -1)
+ goto error;
+ fp = fdopen(fd, "r");
+ if (!fp)
+ goto error;
+ fd = -1;
+ /* Only take the last line into account */
+ ret = 0;
+ while (fgets(output, sizeof(output), fp))
+ ++ret;
lcount = 0;
while (fgets(output, sizeof(output), fp))
++lcount;
Post by Adrien Mazarguil
+ if (feof(fp)) {
+ if (!ret)
+ goto error;
+ } else if (ferror(fp)) {
+ if (errno != EAGAIN || !ret)
+ goto error;
+ } else if (!ret) {
+ goto error;
+ }
if (lcount == 0)
goto error;
else if (ferror(fp) && errno != EAGAIN)
goto error;
Here err would have been set to 0 previously with the fcntl call, meaning that
jumping to error would return 0 as well.
I know Adrien wanted to avoid the usual ugly
if (error) {
err = -ENODEV;
goto error;
}
But this kind of sneakiness is not easy to parse and maintain. If someone
adds a new path of error later, this kind of subtlety *will* be lost.
So between ugliness and maintainability, I choose maintainability (being the
maintainer, of course).
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-09 14:47:28 UTC
Permalink
Previous fail-safe code didn't support getting probed sub-devices and
failed when it tried to probe them.

Skip fail-safe sub-device probing when it already was probed.

Signed-off-by: Matan Azrad <***@mellanox.com>
Cc: Gaetan Rivet <***@6wind.com>
---
doc/guides/nics/fail_safe.rst | 5 ++++
drivers/net/failsafe/failsafe_eal.c | 60 ++++++++++++++++++++++++-------------
2 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index 5b1b47e..b89e53b 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -115,6 +115,11 @@ Fail-safe command line parameters
order to take only the last line into account (unlike ``exec()``) at every
probe attempt.

+.. note::
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD will take the device
+ as is, which means that EAL device options are taken in this case.
+
- **mac** parameter [MAC address]

This parameter allows the user to set a default MAC address to the fail-safe
diff --git a/drivers/net/failsafe/failsafe_eal.c b/drivers/net/failsafe/failsafe_eal.c
index 19d26f5..7bc7453 100644
--- a/drivers/net/failsafe/failsafe_eal.c
+++ b/drivers/net/failsafe/failsafe_eal.c
@@ -36,39 +36,59 @@
#include "failsafe_private.h"

static int
+fs_get_port_by_device_name(const char *name, uint16_t *port_id)
+{
+ uint16_t pid;
+ size_t len;
+
+ if (name == NULL) {
+ DEBUG("Null pointer is specified\n");
+ return -EINVAL;
+ }
+ len = strlen(name);
+ RTE_ETH_FOREACH_DEV(pid) {
+ if (!strncmp(name, rte_eth_devices[pid].device->name, len)) {
+ *port_id = pid;
+ return 0;
+ }
+ }
+ return -ENODEV;
+}
+
+static int
fs_bus_init(struct rte_eth_dev *dev)
{
struct sub_device *sdev;
struct rte_devargs *da;
uint8_t i;
- uint16_t j;
+ uint16_t pid;
int ret;

FOREACH_SUBDEV(sdev, i, dev) {
if (sdev->state != DEV_PARSED)
continue;
da = &sdev->devargs;
- ret = rte_eal_hotplug_add(da->bus->name,
- da->name,
- da->args);
- if (ret) {
- ERROR("sub_device %d probe failed %s%s%s", i,
- rte_errno ? "(" : "",
- rte_errno ? strerror(rte_errno) : "",
- rte_errno ? ")" : "");
- continue;
- }
- RTE_ETH_FOREACH_DEV(j) {
- if (strcmp(rte_eth_devices[j].device->name,
- da->name) == 0) {
- ETH(sdev) = &rte_eth_devices[j];
- break;
+ if (fs_get_port_by_device_name(da->name, &pid) != 0) {
+ ret = rte_eal_hotplug_add(da->bus->name,
+ da->name,
+ da->args);
+ if (ret) {
+ ERROR("sub_device %d probe failed %s%s%s", i,
+ rte_errno ? "(" : "",
+ rte_errno ? strerror(rte_errno) : "",
+ rte_errno ? ")" : "");
+ continue;
}
+ if (fs_get_port_by_device_name(da->name, &pid) != 0) {
+ ERROR("sub_device %d init went wrong", i);
+ return -ENODEV;
+ }
+ } else {
+ /* Take control of device probed by EAL options. */
+ DEBUG("Taking control of a probed sub device"
+ " %d named %s", i, da->name);
}
- if (ETH(sdev) == NULL) {
- ERROR("sub_device %d init went wrong", i);
- return -ENODEV;
- }
+ ETH(sdev) = &rte_eth_devices[pid];
SUB_ID(sdev) = i;
sdev->fs_dev = dev;
sdev->dev = ETH(sdev)->device;
--
1.8.3.1
Gaëtan Rivet
2018-01-16 11:09:20 UTC
Permalink
Hi Matan,

I'n not fond of the commit title, how about:

[PATCH v3 3/8] net/failsafe: add probed etherdev capture

?
Post by Matan Azrad
Previous fail-safe code didn't support getting probed sub-devices and
failed when it tried to probe them.
Skip fail-safe sub-device probing when it already was probed.
---
doc/guides/nics/fail_safe.rst | 5 ++++
drivers/net/failsafe/failsafe_eal.c | 60 ++++++++++++++++++++++++-------------
2 files changed, 45 insertions(+), 20 deletions(-)
diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index 5b1b47e..b89e53b 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -115,6 +115,11 @@ Fail-safe command line parameters
order to take only the last line into account (unlike ``exec()``) at every
probe attempt.
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD will take the device
+ as is, which means that EAL device options are taken in this case.
+
- **mac** parameter [MAC address]
This parameter allows the user to set a default MAC address to the fail-safe
diff --git a/drivers/net/failsafe/failsafe_eal.c b/drivers/net/failsafe/failsafe_eal.c
index 19d26f5..7bc7453 100644
--- a/drivers/net/failsafe/failsafe_eal.c
+++ b/drivers/net/failsafe/failsafe_eal.c
@@ -36,39 +36,59 @@
#include "failsafe_private.h"
static int
+fs_get_port_by_device_name(const char *name, uint16_t *port_id)
The naming convention for the failsafe driver is

namespace_object_sub-object_action()

With an ordering of objects by their scope (std, rte, failsafe, file).
Also, "get" as an action is not descriptive enough.

static int
fs_ethdev_capture(const char *name, uint16_t *port_id);
Post by Matan Azrad
+{
+ uint16_t pid;
+ size_t len;
+
+ if (name == NULL) {
+ DEBUG("Null pointer is specified\n");
+ return -EINVAL;
+ }
+ len = strlen(name);
+ RTE_ETH_FOREACH_DEV(pid) {
+ if (!strncmp(name, rte_eth_devices[pid].device->name, len)) {
+ *port_id = pid;
+ return 0;
+ }
+ }
+ return -ENODEV;
+}
+
+static int
fs_bus_init(struct rte_eth_dev *dev)
{
struct sub_device *sdev;
struct rte_devargs *da;
uint8_t i;
- uint16_t j;
+ uint16_t pid;
int ret;
FOREACH_SUBDEV(sdev, i, dev) {
if (sdev->state != DEV_PARSED)
continue;
da = &sdev->devargs;
- ret = rte_eal_hotplug_add(da->bus->name,
- da->name,
- da->args);
- if (ret) {
- ERROR("sub_device %d probe failed %s%s%s", i,
- rte_errno ? "(" : "",
- rte_errno ? strerror(rte_errno) : "",
- rte_errno ? ")" : "");
- continue;
- }
- RTE_ETH_FOREACH_DEV(j) {
- if (strcmp(rte_eth_devices[j].device->name,
- da->name) == 0) {
- ETH(sdev) = &rte_eth_devices[j];
- break;
+ if (fs_get_port_by_device_name(da->name, &pid) != 0) {
+ ret = rte_eal_hotplug_add(da->bus->name,
+ da->name,
+ da->args);
+ if (ret) {
+ ERROR("sub_device %d probe failed %s%s%s", i,
+ rte_errno ? "(" : "",
+ rte_errno ? strerror(rte_errno) : "",
+ rte_errno ? ")" : "");
+ continue;
}
+ if (fs_get_port_by_device_name(da->name, &pid) != 0) {
+ ERROR("sub_device %d init went wrong", i);
+ return -ENODEV;
+ }
+ } else {
+ /* Take control of device probed by EAL options. */
+ DEBUG("Taking control of a probed sub device"
+ " %d named %s", i, da->name);
In this case, the devargs of the probed device must be copied within the
sub-device definition and removed from the EAL using the proper
rte_devargs API.

Note that there is no rte_devargs copy function. You can use
rte_devargs_parse instead, "parsing" again the original devargs into the
sub-device one. It is necessary for complying with internal rte_devargs
requirements (da->args being malloc-ed, at the moment, but may evolve).

The rte_eal_devargs_parse function is not easy enough to use right now,
you will have to build a devargs string (using snprintf) and submit it.
I proposed a change this release for it but it will not make it for
18.02, that would have simplified your implementation.
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-16 12:27:57 UTC
Permalink
Hi Gaetan

From: Gaëtan Rivet, Tuesday, January 16, 2018 1:09 PM
Post by Gaëtan Rivet
Hi Matan,
[PATCH v3 3/8] net/failsafe: add probed etherdev capture
?
OK, no problem.
Post by Gaëtan Rivet
Post by Matan Azrad
Previous fail-safe code didn't support getting probed sub-devices and
failed when it tried to probe them.
Skip fail-safe sub-device probing when it already was probed.
---
doc/guides/nics/fail_safe.rst | 5 ++++
drivers/net/failsafe/failsafe_eal.c | 60
++++++++++++++++++++++++-------------
2 files changed, 45 insertions(+), 20 deletions(-)
diff --git a/doc/guides/nics/fail_safe.rst
b/doc/guides/nics/fail_safe.rst index 5b1b47e..b89e53b 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -115,6 +115,11 @@ Fail-safe command line parameters
order to take only the last line into account (unlike ``exec()``) at every
probe attempt.
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD will take the
device
Post by Matan Azrad
+ as is, which means that EAL device options are taken in this case.
+
- **mac** parameter [MAC address]
This parameter allows the user to set a default MAC address to the
fail-safe diff --git a/drivers/net/failsafe/failsafe_eal.c
b/drivers/net/failsafe/failsafe_eal.c
index 19d26f5..7bc7453 100644
--- a/drivers/net/failsafe/failsafe_eal.c
+++ b/drivers/net/failsafe/failsafe_eal.c
@@ -36,39 +36,59 @@
#include "failsafe_private.h"
static int
+fs_get_port_by_device_name(const char *name, uint16_t *port_id)
The naming convention for the failsafe driver is
namespace_object_sub-object_action()
OK.
Post by Gaëtan Rivet
With an ordering of objects by their scope (std, rte, failsafe, file).
Also, "get" as an action is not descriptive enough.
Isn't "get by device name" descriptive?
Post by Gaëtan Rivet
static int
fs_ethdev_capture(const char *name, uint16_t *port_id);
You miss here the main reason why we need this function instead of using rte_eth_dev_get_port_by_name.
The reason we need this function is because we want to find the device by the device name and not ethdev name.
What's about fs_port_capture_by_device_name?

Maybe comparing it to device->devargs->name is better, What do you think?
Post by Gaëtan Rivet
Post by Matan Azrad
+{
+ uint16_t pid;
+ size_t len;
+
+ if (name == NULL) {
+ DEBUG("Null pointer is specified\n");
+ return -EINVAL;
+ }
+ len = strlen(name);
+ RTE_ETH_FOREACH_DEV(pid) {
+ if (!strncmp(name, rte_eth_devices[pid].device->name,
len)) {
Post by Matan Azrad
+ *port_id = pid;
+ return 0;
+ }
+ }
+ return -ENODEV;
+}
+
+static int
fs_bus_init(struct rte_eth_dev *dev)
{
struct sub_device *sdev;
struct rte_devargs *da;
uint8_t i;
- uint16_t j;
+ uint16_t pid;
int ret;
FOREACH_SUBDEV(sdev, i, dev) {
if (sdev->state != DEV_PARSED)
continue;
da = &sdev->devargs;
- ret = rte_eal_hotplug_add(da->bus->name,
- da->name,
- da->args);
- if (ret) {
- ERROR("sub_device %d probe failed %s%s%s", i,
- rte_errno ? "(" : "",
- rte_errno ? strerror(rte_errno) : "",
- rte_errno ? ")" : "");
- continue;
- }
- RTE_ETH_FOREACH_DEV(j) {
- if (strcmp(rte_eth_devices[j].device->name,
- da->name) == 0) {
- ETH(sdev) = &rte_eth_devices[j];
- break;
+ if (fs_get_port_by_device_name(da->name, &pid) != 0) {
+ ret = rte_eal_hotplug_add(da->bus->name,
+ da->name,
+ da->args);
+ if (ret) {
+ ERROR("sub_device %d probe failed
%s%s%s", i,
Post by Matan Azrad
+ rte_errno ? "(" : "",
+ rte_errno ? strerror(rte_errno) : "",
+ rte_errno ? ")" : "");
+ continue;
}
+ if (fs_get_port_by_device_name(da->name, &pid)
!= 0) {
Post by Matan Azrad
+ ERROR("sub_device %d init went wrong", i);
+ return -ENODEV;
+ }
+ } else {
+ /* Take control of device probed by EAL options. */
+ DEBUG("Taking control of a probed sub device"
+ " %d named %s", i, da->name);
In this case, the devargs of the probed device must be copied within the sub-
device definition and removed from the EAL using the proper rte_devargs
API.
Note that there is no rte_devargs copy function. You can use
rte_devargs_parse instead, "parsing" again the original devargs into the sub-
device one. It is necessary for complying with internal rte_devargs
requirements (da->args being malloc-ed, at the moment, but may evolve).
The rte_eal_devargs_parse function is not easy enough to use right now,
you will have to build a devargs string (using snprintf) and submit it.
I proposed a change this release for it but it will not make it for 18.02, that
would have simplified your implementation.
Got you. You right we need to remove the created devargs in fail-safe parse level.
What do you think about checking it in the parse level and avoid the new devargs creation?
Also to do the copy in parse level(same method as we are doing in probe level)?
Post by Gaëtan Rivet
--
Gaëtan Rivet
6WIND
Gaëtan Rivet
2018-01-16 14:40:50 UTC
Permalink
Post by Matan Azrad
Hi Gaetan
From: Gaëtan Rivet, Tuesday, January 16, 2018 1:09 PM
Post by Gaëtan Rivet
Hi Matan,
[PATCH v3 3/8] net/failsafe: add probed etherdev capture
?
OK, no problem.
Post by Gaëtan Rivet
Post by Matan Azrad
Previous fail-safe code didn't support getting probed sub-devices and
failed when it tried to probe them.
Skip fail-safe sub-device probing when it already was probed.
---
doc/guides/nics/fail_safe.rst | 5 ++++
drivers/net/failsafe/failsafe_eal.c | 60
++++++++++++++++++++++++-------------
2 files changed, 45 insertions(+), 20 deletions(-)
diff --git a/doc/guides/nics/fail_safe.rst
b/doc/guides/nics/fail_safe.rst index 5b1b47e..b89e53b 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -115,6 +115,11 @@ Fail-safe command line parameters
order to take only the last line into account (unlike ``exec()``) at every
probe attempt.
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD will take the
device
Post by Matan Azrad
+ as is, which means that EAL device options are taken in this case.
+
- **mac** parameter [MAC address]
This parameter allows the user to set a default MAC address to the
fail-safe diff --git a/drivers/net/failsafe/failsafe_eal.c
b/drivers/net/failsafe/failsafe_eal.c
index 19d26f5..7bc7453 100644
--- a/drivers/net/failsafe/failsafe_eal.c
+++ b/drivers/net/failsafe/failsafe_eal.c
@@ -36,39 +36,59 @@
#include "failsafe_private.h"
static int
+fs_get_port_by_device_name(const char *name, uint16_t *port_id)
The naming convention for the failsafe driver is
namespace_object_sub-object_action()
OK.
Post by Gaëtan Rivet
With an ordering of objects by their scope (std, rte, failsafe, file).
Also, "get" as an action is not descriptive enough.
Isn't "get by device name" descriptive?
The endgame is capturing a device that we know we are interested in.
The device name being used for matching is an implementation detail,
which should be abstracted by using a sub-function.

Putting this in the name defeat the reason for using another function.
Post by Matan Azrad
Post by Gaëtan Rivet
static int
fs_ethdev_capture(const char *name, uint16_t *port_id);
You miss here the main reason why we need this function instead of using rte_eth_dev_get_port_by_name.
The reason we need this function is because we want to find the device by the device name and not ethdev name.
What's about fs_port_capture_by_device_name?
You are getting a port_id that is only valid for the rte_eth_devices
array, by using the ethdev iterator. You are only looking for an ethdev.

So it doesn't really matter whether you are using the ethdev name or the
device name, in the end you are capturing an ethdev
--> fs_ethdev_capture seems good for me.

Now, I guess you will say that the user would need to know that they
have to provide a device name that would be written in device->name. The
issue here is that you have a leaky abstraction for your function,
forcing this kind of consideration on your function user.

So I'd go further and will ask you to change the `const char *name` to a
`const rte_devargs *da` in the parameters.
Post by Matan Azrad
Maybe comparing it to device->devargs->name is better, What do you think?
You are touching at a pretty contentious subject here :) .

Identifying devices is not currently a well-defined function in DPDK.
Some ports (actually, only one model: ConnectX-3) will have several
ports using the same PCI slot. But even ignoring this glaring problem...

As it is, the device->name for PCI will match the name given as a
devargs, so functionally this should not change anything.

Furthermore, you will have devices probed without any devargs. The
fail-safe would thus be unable to capture non-blacklisted devices when
the PCI bus is in blacklist mode.

These not-blacklisted devices actually will have a full-PCI name (DomBDF
format), so a simple match with the one passed in your fail-safe devargs
will fail, ex:

# A physical port exists at 0000:00:02.0
testpmd --vdev="net_failsafe,dev(00:02.0)" -- -i

Would fail to capture the device 0000:00:02.0, as this is the name that
the PCI bus would give to this device, in the absence of a user-given
name.

In 18.05, or 18.08 there should be an EAL function that would be able to
identify a device given a specific ID string (very close to an
rte_devargs). Currently, this API does not exist.

You can hack your way around this for the moment, IF you really, really
want: parse your devargs, get the bus, use the bus->parse() function to
get a binary device representation, and compare bytes per bytes the
binary representation given by your devargs and by the device->name.

But this is a hack, and a pretty ugly one at that: you have
no way of knowing the size taken by this binary representation, so you
can restrict yourself to the vdev and PCI bus for the moment and take
the larger of an rte_vdev_driver pointer and an rte_pci_addr....

{
union {
rte_vdev_driver *drv;
struct rte_pci_addr pci_addr;
} bindev1, bindev2;
memset(&bindev1, 0, sizeof(bindev1));
memset(&bindev2, 0, sizeof(bindev2));
rte_eal_devargs_parse(device->name, da1);
rte_eal_devargs_parse(your_devstr, da2);
RTE_ASSERT(da1->bus == rte_bus_find_by_name("pci") ||
da1->bus == rte_bus_find_by_name("vdev"));
RTE_ASSERT(da2->bus == rte_bus_find_by_name("pci") ||
da2->bus == rte_bus_find_by_name("vdev"));
da1->bus->parse(da1->name, &bindev1);
da1->bus->parse(da2->name, &bindev2);
if (memcmp(&bindev1, &bindev2, sizeof(bindev1)) == 0) {
/* found the device */
} else {
/* not found */
}
}

So, really, really ugly. Anyway.

<snip>
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ /* Take control of device probed by EAL options. */
+ DEBUG("Taking control of a probed sub device"
+ " %d named %s", i, da->name);
In this case, the devargs of the probed device must be copied within the sub-
device definition and removed from the EAL using the proper rte_devargs
API.
Note that there is no rte_devargs copy function. You can use
rte_devargs_parse instead, "parsing" again the original devargs into the sub-
device one. It is necessary for complying with internal rte_devargs
requirements (da->args being malloc-ed, at the moment, but may evolve).
The rte_eal_devargs_parse function is not easy enough to use right now,
you will have to build a devargs string (using snprintf) and submit it.
I proposed a change this release for it but it will not make it for 18.02, that
would have simplified your implementation.
Got you. You right we need to remove the created devargs in fail-safe parse level.
What do you think about checking it in the parse level and avoid the new devargs creation?
Also to do the copy in parse level(same method as we are doing in probe level)?
Not sure I follow here, but the new rte_devargs is part of the
sub-device (it is not a pointer, but allocated alongside the
sub_device).

So keep everything here, it is the right place to deal with these
things.
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-16 16:15:36 UTC
Permalink
Hi Gaetan

From: Gaëtan Rivet, Tuesday, January 16, 2018 4:41 PM
Post by Gaëtan Rivet
Post by Matan Azrad
Hi Gaetan
From: Gaëtan Rivet, Tuesday, January 16, 2018 1:09 PM
Post by Gaëtan Rivet
Hi Matan,
[PATCH v3 3/8] net/failsafe: add probed etherdev capture
?
OK, no problem.
Post by Gaëtan Rivet
Post by Matan Azrad
Previous fail-safe code didn't support getting probed sub-devices
and failed when it tried to probe them.
Skip fail-safe sub-device probing when it already was probed.
---
doc/guides/nics/fail_safe.rst | 5 ++++
drivers/net/failsafe/failsafe_eal.c | 60
++++++++++++++++++++++++-------------
2 files changed, 45 insertions(+), 20 deletions(-)
diff --git a/doc/guides/nics/fail_safe.rst
b/doc/guides/nics/fail_safe.rst index 5b1b47e..b89e53b 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -115,6 +115,11 @@ Fail-safe command line parameters
order to take only the last line into account (unlike ``exec()``) at every
probe attempt.
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD
+ will take the
device
Post by Matan Azrad
+ as is, which means that EAL device options are taken in this case.
+
- **mac** parameter [MAC address]
This parameter allows the user to set a default MAC address to
the fail-safe diff --git a/drivers/net/failsafe/failsafe_eal.c
b/drivers/net/failsafe/failsafe_eal.c
index 19d26f5..7bc7453 100644
--- a/drivers/net/failsafe/failsafe_eal.c
+++ b/drivers/net/failsafe/failsafe_eal.c
@@ -36,39 +36,59 @@
#include "failsafe_private.h"
static int
+fs_get_port_by_device_name(const char *name, uint16_t *port_id)
The naming convention for the failsafe driver is
namespace_object_sub-object_action()
OK.
Post by Gaëtan Rivet
With an ordering of objects by their scope (std, rte, failsafe, file).
Also, "get" as an action is not descriptive enough.
Isn't "get by device name" descriptive?
The endgame is capturing a device that we know we are interested in.
The device name being used for matching is an implementation detail, which
should be abstracted by using a sub-function.
Putting this in the name defeat the reason for using another function.
Post by Matan Azrad
Post by Gaëtan Rivet
static int
fs_ethdev_capture(const char *name, uint16_t *port_id);
You miss here the main reason why we need this function instead of using
rte_eth_dev_get_port_by_name.
Post by Matan Azrad
The reason we need this function is because we want to find the device by
the device name and not ethdev name.
Post by Matan Azrad
What's about fs_port_capture_by_device_name?
You are getting a port_id that is only valid for the rte_eth_devices array, by
using the ethdev iterator. You are only looking for an ethdev.
So it doesn't really matter whether you are using the ethdev name or the
device name, in the end you are capturing an ethdev
--> fs_ethdev_capture seems good for me.
I don't think so, this function doesn't take(capture) the device, just gets its ethdev port id using the device name.
The function which actually captures the device is the fs_bus_init.
So maybe even the "capture" name looks problematic here.
The main idea of this function is just to get the port_id.
Post by Gaëtan Rivet
Now, I guess you will say that the user would need to know that they have to
provide a device name that would be written in device->name. The issue
here is that you have a leaky abstraction for your function, forcing this kind of
consideration on your function user.
So I'd go further and will ask you to change the `const char *name` to a `const
rte_devargs *da` in the parameters.
Post by Matan Azrad
Maybe comparing it to device->devargs->name is better, What do you
think?
You are touching at a pretty contentious subject here :) .
Identifying devices is not currently a well-defined function in DPDK.
Some ports (actually, only one model: ConnectX-3) will have several ports
using the same PCI slot. But even ignoring this glaring problem...
As it is, the device->name for PCI will match the name given as a devargs, so
functionally this should not change anything.
Furthermore, you will have devices probed without any devargs. The fail-
safe would thus be unable to capture non-blacklisted devices when the PCI
bus is in blacklist mode.
These not-blacklisted devices actually will have a full-PCI name (DomBDF
format), so a simple match with the one passed in your fail-safe devargs will
# A physical port exists at 0000:00:02.0
testpmd --vdev="net_failsafe,dev(00:02.0)" -- -i
Would fail to capture the device 0000:00:02.0, as this is the name that the PCI
bus would give to this device, in the absence of a user-given name.
In 18.05, or 18.08 there should be an EAL function that would be able to
identify a device given a specific ID string (very close to an rte_devargs).
Currently, this API does not exist.
You can hack your way around this for the moment, IF you really, really
want: parse your devargs, get the bus, use the bus->parse() function to get a
binary device representation, and compare bytes per bytes the binary
representation given by your devargs and by the device->name.
But this is a hack, and a pretty ugly one at that: you have no way of knowing
the size taken by this binary representation, so you can restrict yourself to
the vdev and PCI bus for the moment and take the larger of an
rte_vdev_driver pointer and an rte_pci_addr....
{
union {
rte_vdev_driver *drv;
struct rte_pci_addr pci_addr;
} bindev1, bindev2;
memset(&bindev1, 0, sizeof(bindev1));
memset(&bindev2, 0, sizeof(bindev2));
rte_eal_devargs_parse(device->name, da1);
rte_eal_devargs_parse(your_devstr, da2);
RTE_ASSERT(da1->bus == rte_bus_find_by_name("pci") ||
da1->bus == rte_bus_find_by_name("vdev"));
RTE_ASSERT(da2->bus == rte_bus_find_by_name("pci") ||
da2->bus == rte_bus_find_by_name("vdev"));
da1->bus->parse(da1->name, &bindev1);
da1->bus->parse(da2->name, &bindev2);
if (memcmp(&bindev1, &bindev2, sizeof(bindev1)) == 0) {
/* found the device */
} else {
/* not found */
}
}
So, really, really ugly. Anyway.
Yes, ugly :) Thanks for this update!
Will keep the comparison by device->name.
Post by Gaëtan Rivet
<snip>
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ /* Take control of device probed by EAL options. */
+ DEBUG("Taking control of a probed sub device"
+ " %d named %s", i, da->name);
In this case, the devargs of the probed device must be copied within
the sub- device definition and removed from the EAL using the proper
rte_devargs API.
Note that there is no rte_devargs copy function. You can use
rte_devargs_parse instead, "parsing" again the original devargs into
the sub- device one. It is necessary for complying with internal
rte_devargs requirements (da->args being malloc-ed, at the moment,
but may evolve).
Post by Matan Azrad
Post by Gaëtan Rivet
The rte_eal_devargs_parse function is not easy enough to use right
now, you will have to build a devargs string (using snprintf) and submit it.
I proposed a change this release for it but it will not make it for
18.02, that would have simplified your implementation.
Got you. You right we need to remove the created devargs in fail-safe
parse level.
Post by Matan Azrad
What do you think about checking it in the parse level and avoid the new
devargs creation?
Post by Matan Azrad
Also to do the copy in parse level(same method as we are doing in probe
level)?
Not sure I follow here, but the new rte_devargs is part of the sub-device (it is
not a pointer, but allocated alongside the sub_device).
So keep everything here, it is the right place to deal with these things.
But it will prevent the double parsing and also saves the method:
If the device already parsed - copy its devargs and continue.
If the device already probed - copy the device pointer and continue.

I think this is the right dealing, no?
Why to deal with parse level in probe level? Just keep all the parse work to parse level and the probe work to probe level.

Thanks, Matan.
Post by Gaëtan Rivet
--
Gaëtan Rivet
6WIND
Gaëtan Rivet
2018-01-16 16:54:09 UTC
Permalink
Post by Matan Azrad
Hi Gaetan
From: Gaëtan Rivet, Tuesday, January 16, 2018 4:41 PM
Post by Gaëtan Rivet
Post by Matan Azrad
Hi Gaetan
From: Gaëtan Rivet, Tuesday, January 16, 2018 1:09 PM
Post by Gaëtan Rivet
Hi Matan,
[PATCH v3 3/8] net/failsafe: add probed etherdev capture
?
OK, no problem.
Post by Gaëtan Rivet
Post by Matan Azrad
Previous fail-safe code didn't support getting probed sub-devices
and failed when it tried to probe them.
Skip fail-safe sub-device probing when it already was probed.
---
doc/guides/nics/fail_safe.rst | 5 ++++
drivers/net/failsafe/failsafe_eal.c | 60
++++++++++++++++++++++++-------------
2 files changed, 45 insertions(+), 20 deletions(-)
diff --git a/doc/guides/nics/fail_safe.rst
b/doc/guides/nics/fail_safe.rst index 5b1b47e..b89e53b 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -115,6 +115,11 @@ Fail-safe command line parameters
order to take only the last line into account (unlike ``exec()``) at every
probe attempt.
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD
+ will take the
device
Post by Matan Azrad
+ as is, which means that EAL device options are taken in this case.
+
- **mac** parameter [MAC address]
This parameter allows the user to set a default MAC address to
the fail-safe diff --git a/drivers/net/failsafe/failsafe_eal.c
b/drivers/net/failsafe/failsafe_eal.c
index 19d26f5..7bc7453 100644
--- a/drivers/net/failsafe/failsafe_eal.c
+++ b/drivers/net/failsafe/failsafe_eal.c
@@ -36,39 +36,59 @@
#include "failsafe_private.h"
static int
+fs_get_port_by_device_name(const char *name, uint16_t *port_id)
The naming convention for the failsafe driver is
namespace_object_sub-object_action()
OK.
Post by Gaëtan Rivet
With an ordering of objects by their scope (std, rte, failsafe, file).
Also, "get" as an action is not descriptive enough.
Isn't "get by device name" descriptive?
The endgame is capturing a device that we know we are interested in.
The device name being used for matching is an implementation detail, which
should be abstracted by using a sub-function.
Putting this in the name defeat the reason for using another function.
Post by Matan Azrad
Post by Gaëtan Rivet
static int
fs_ethdev_capture(const char *name, uint16_t *port_id);
You miss here the main reason why we need this function instead of using
rte_eth_dev_get_port_by_name.
Post by Matan Azrad
The reason we need this function is because we want to find the device by
the device name and not ethdev name.
Post by Matan Azrad
What's about fs_port_capture_by_device_name?
You are getting a port_id that is only valid for the rte_eth_devices array, by
using the ethdev iterator. You are only looking for an ethdev.
So it doesn't really matter whether you are using the ethdev name or the
device name, in the end you are capturing an ethdev
--> fs_ethdev_capture seems good for me.
I don't think so, this function doesn't take(capture) the device, just gets its ethdev port id using the device name.
The function which actually captures the device is the fs_bus_init.
So maybe even the "capture" name looks problematic here.
The main idea of this function is just to get the port_id.
Right :) . Call it fs_ethdev_portid_get() or fs_ethdev_find() then.
Post by Matan Azrad
Post by Gaëtan Rivet
Now, I guess you will say that the user would need to know that they have to
provide a device name that would be written in device->name. The issue
here is that you have a leaky abstraction for your function, forcing this kind of
consideration on your function user.
So I'd go further and will ask you to change the `const char *name` to a `const
rte_devargs *da` in the parameters.
Post by Matan Azrad
Maybe comparing it to device->devargs->name is better, What do you
think?
You are touching at a pretty contentious subject here :) .
Identifying devices is not currently a well-defined function in DPDK.
Some ports (actually, only one model: ConnectX-3) will have several ports
using the same PCI slot. But even ignoring this glaring problem...
As it is, the device->name for PCI will match the name given as a devargs, so
functionally this should not change anything.
Furthermore, you will have devices probed without any devargs. The fail-
safe would thus be unable to capture non-blacklisted devices when the PCI
bus is in blacklist mode.
These not-blacklisted devices actually will have a full-PCI name (DomBDF
format), so a simple match with the one passed in your fail-safe devargs will
# A physical port exists at 0000:00:02.0
testpmd --vdev="net_failsafe,dev(00:02.0)" -- -i
Would fail to capture the device 0000:00:02.0, as this is the name that the PCI
bus would give to this device, in the absence of a user-given name.
In 18.05, or 18.08 there should be an EAL function that would be able to
identify a device given a specific ID string (very close to an rte_devargs).
Currently, this API does not exist.
You can hack your way around this for the moment, IF you really, really
want: parse your devargs, get the bus, use the bus->parse() function to get a
binary device representation, and compare bytes per bytes the binary
representation given by your devargs and by the device->name.
But this is a hack, and a pretty ugly one at that: you have no way of knowing
the size taken by this binary representation, so you can restrict yourself to
the vdev and PCI bus for the moment and take the larger of an
rte_vdev_driver pointer and an rte_pci_addr....
{
union {
rte_vdev_driver *drv;
struct rte_pci_addr pci_addr;
} bindev1, bindev2;
memset(&bindev1, 0, sizeof(bindev1));
memset(&bindev2, 0, sizeof(bindev2));
rte_eal_devargs_parse(device->name, da1);
rte_eal_devargs_parse(your_devstr, da2);
RTE_ASSERT(da1->bus == rte_bus_find_by_name("pci") ||
da1->bus == rte_bus_find_by_name("vdev"));
RTE_ASSERT(da2->bus == rte_bus_find_by_name("pci") ||
da2->bus == rte_bus_find_by_name("vdev"));
da1->bus->parse(da1->name, &bindev1);
da1->bus->parse(da2->name, &bindev2);
if (memcmp(&bindev1, &bindev2, sizeof(bindev1)) == 0) {
/* found the device */
} else {
/* not found */
}
}
So, really, really ugly. Anyway.
Yes, ugly :) Thanks for this update!
Will keep the comparison by device->name.
Well as explained, above, the comparison by device->name only works with
whitelisted devices.

So either implement something broken right now that you will need to
update in 18.05, or implement it properly in 18.05 from the get go.
Post by Matan Azrad
Post by Gaëtan Rivet
<snip>
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ /* Take control of device probed by EAL options. */
+ DEBUG("Taking control of a probed sub device"
+ " %d named %s", i, da->name);
In this case, the devargs of the probed device must be copied within
the sub- device definition and removed from the EAL using the proper
rte_devargs API.
Note that there is no rte_devargs copy function. You can use
rte_devargs_parse instead, "parsing" again the original devargs into
the sub- device one. It is necessary for complying with internal
rte_devargs requirements (da->args being malloc-ed, at the moment,
but may evolve).
Post by Matan Azrad
Post by Gaëtan Rivet
The rte_eal_devargs_parse function is not easy enough to use right
now, you will have to build a devargs string (using snprintf) and submit it.
I proposed a change this release for it but it will not make it for
18.02, that would have simplified your implementation.
Got you. You right we need to remove the created devargs in fail-safe
parse level.
Post by Matan Azrad
What do you think about checking it in the parse level and avoid the new
devargs creation?
Post by Matan Azrad
Also to do the copy in parse level(same method as we are doing in probe
level)?
Not sure I follow here, but the new rte_devargs is part of the sub-device (it is
not a pointer, but allocated alongside the sub_device).
So keep everything here, it is the right place to deal with these things.
If the device already parsed - copy its devargs and continue.
If the device already probed - copy the device pointer and continue.
I think this is the right dealing, no?
Why to deal with parse level in probe level? Just keep all the parse work to parse level and the probe work to probe level.
After re-reading, I think we misunderstood each other.
You cannot remove the rte_devargs created during parsing: it is
allocated alongside the sub_device structure.

You must only remove the rte_devargs allocated by the EAL (using
rte_eal_devargs_remove()).

Before removing it, you must copy its content in the local sub_device
rte_devargs structure. I only proposed a way to do this copy that would
not deal with rte_devargs internals, as it is bound to evolve rather
soon.

Otherwise, no, I do not want to complicate the parsing operations, they
are already too complicated and too criticals. Better to keep it all
here.
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-16 17:20:27 UTC
Permalink
Hi Gaetan

From: Gaëtan Rivet, Tuesday, January 16, 2018 6:54 PM
Post by Matan Azrad
Post by Matan Azrad
Hi Gaetan
From: Gaëtan Rivet, Tuesday, January 16, 2018 4:41 PM
Post by Gaëtan Rivet
Post by Matan Azrad
Hi Gaetan
From: Gaëtan Rivet, Tuesday, January 16, 2018 1:09 PM
Post by Gaëtan Rivet
Hi Matan,
[PATCH v3 3/8] net/failsafe: add probed etherdev capture
?
OK, no problem.
Post by Gaëtan Rivet
Post by Matan Azrad
Previous fail-safe code didn't support getting probed
sub-devices and failed when it tried to probe them.
Skip fail-safe sub-device probing when it already was probed.
---
doc/guides/nics/fail_safe.rst | 5 ++++
drivers/net/failsafe/failsafe_eal.c | 60
++++++++++++++++++++++++-------------
2 files changed, 45 insertions(+), 20 deletions(-)
diff --git a/doc/guides/nics/fail_safe.rst
b/doc/guides/nics/fail_safe.rst index 5b1b47e..b89e53b 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -115,6 +115,11 @@ Fail-safe command line parameters
order to take only the last line into account (unlike ``exec()``) at
every
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
probe attempt.
+
+ In case of whitelist sub-device probed by EAL, fail-safe
+ PMD will take the
device
Post by Matan Azrad
+ as is, which means that EAL device options are taken in this case.
+
- **mac** parameter [MAC address]
This parameter allows the user to set a default MAC address
to the fail-safe diff --git
a/drivers/net/failsafe/failsafe_eal.c
b/drivers/net/failsafe/failsafe_eal.c
index 19d26f5..7bc7453 100644
--- a/drivers/net/failsafe/failsafe_eal.c
+++ b/drivers/net/failsafe/failsafe_eal.c
@@ -36,39 +36,59 @@
#include "failsafe_private.h"
static int
+fs_get_port_by_device_name(const char *name, uint16_t
+*port_id)
The naming convention for the failsafe driver is
namespace_object_sub-object_action()
OK.
Post by Gaëtan Rivet
With an ordering of objects by their scope (std, rte, failsafe, file).
Also, "get" as an action is not descriptive enough.
Isn't "get by device name" descriptive?
The endgame is capturing a device that we know we are interested in.
The device name being used for matching is an implementation detail,
which should be abstracted by using a sub-function.
Putting this in the name defeat the reason for using another function.
Post by Matan Azrad
Post by Gaëtan Rivet
static int
fs_ethdev_capture(const char *name, uint16_t *port_id);
You miss here the main reason why we need this function instead of using
rte_eth_dev_get_port_by_name.
Post by Matan Azrad
The reason we need this function is because we want to find the device by
the device name and not ethdev name.
Post by Matan Azrad
What's about fs_port_capture_by_device_name?
You are getting a port_id that is only valid for the rte_eth_devices
array, by using the ethdev iterator. You are only looking for an ethdev.
So it doesn't really matter whether you are using the ethdev name or
the device name, in the end you are capturing an ethdev
--> fs_ethdev_capture seems good for me.
I don't think so, this function doesn't take(capture) the device, just gets its
ethdev port id using the device name.
Post by Matan Azrad
The function which actually captures the device is the fs_bus_init.
So maybe even the "capture" name looks problematic here.
The main idea of this function is just to get the port_id.
Right :) . Call it fs_ethdev_portid_get() or fs_ethdev_find() then.
Sure, agree with the first one.
Post by Matan Azrad
Post by Matan Azrad
Post by Gaëtan Rivet
Now, I guess you will say that the user would need to know that they
have to provide a device name that would be written in device->name.
The issue here is that you have a leaky abstraction for your
function, forcing this kind of consideration on your function user.
So I'd go further and will ask you to change the `const char *name`
to a `const rte_devargs *da` in the parameters.
Post by Matan Azrad
Maybe comparing it to device->devargs->name is better, What do you
think?
You are touching at a pretty contentious subject here :) .
Identifying devices is not currently a well-defined function in DPDK.
Some ports (actually, only one model: ConnectX-3) will have several
ports using the same PCI slot. But even ignoring this glaring problem...
As it is, the device->name for PCI will match the name given as a
devargs, so functionally this should not change anything.
Furthermore, you will have devices probed without any devargs. The
fail- safe would thus be unable to capture non-blacklisted devices
when the PCI bus is in blacklist mode.
These not-blacklisted devices actually will have a full-PCI name
(DomBDF format), so a simple match with the one passed in your
# A physical port exists at 0000:00:02.0
testpmd --vdev="net_failsafe,dev(00:02.0)" -- -i
Would fail to capture the device 0000:00:02.0, as this is the name
that the PCI bus would give to this device, in the absence of a user-given
name.
Post by Matan Azrad
Post by Gaëtan Rivet
In 18.05, or 18.08 there should be an EAL function that would be
able to identify a device given a specific ID string (very close to an
rte_devargs).
Post by Matan Azrad
Post by Gaëtan Rivet
Currently, this API does not exist.
You can hack your way around this for the moment, IF you really, really
want: parse your devargs, get the bus, use the bus->parse() function
to get a binary device representation, and compare bytes per bytes
the binary representation given by your devargs and by the device-
name.
Post by Gaëtan Rivet
But this is a hack, and a pretty ugly one at that: you have no way
of knowing the size taken by this binary representation, so you can
restrict yourself to the vdev and PCI bus for the moment and take
the larger of an rte_vdev_driver pointer and an rte_pci_addr....
{
union {
rte_vdev_driver *drv;
struct rte_pci_addr pci_addr;
} bindev1, bindev2;
memset(&bindev1, 0, sizeof(bindev1));
memset(&bindev2, 0, sizeof(bindev2));
rte_eal_devargs_parse(device->name, da1);
rte_eal_devargs_parse(your_devstr, da2);
RTE_ASSERT(da1->bus == rte_bus_find_by_name("pci") ||
da1->bus == rte_bus_find_by_name("vdev"));
RTE_ASSERT(da2->bus == rte_bus_find_by_name("pci") ||
da2->bus == rte_bus_find_by_name("vdev"));
da1->bus->parse(da1->name, &bindev1);
da1->bus->parse(da2->name, &bindev2);
if (memcmp(&bindev1, &bindev2, sizeof(bindev1)) == 0) {
/* found the device */
} else {
/* not found */
}
}
So, really, really ugly. Anyway.
Yes, ugly :) Thanks for this update!
Will keep the comparison by device->name.
Well as explained, above, the comparison by device->name only works with
whitelisted devices.
So either implement something broken right now that you will need to
update in 18.05, or implement it properly in 18.05 from the get go.
For the current needs it is enough.
We can also say that it is the user responsibility to pass to failsafe the same names and same args as he passes for EAL(or default EAL names).
I think I emphasized it in documentation.
Post by Matan Azrad
Post by Matan Azrad
Post by Gaëtan Rivet
<snip>
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ /* Take control of device probed by EAL
options. */
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ DEBUG("Taking control of a probed sub
device"
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ " %d named %s", i, da->name);
In this case, the devargs of the probed device must be copied
within the sub- device definition and removed from the EAL using
the proper rte_devargs API.
Note that there is no rte_devargs copy function. You can use
rte_devargs_parse instead, "parsing" again the original devargs
into the sub- device one. It is necessary for complying with
internal rte_devargs requirements (da->args being malloc-ed, at
the moment,
but may evolve).
Post by Matan Azrad
Post by Gaëtan Rivet
The rte_eal_devargs_parse function is not easy enough to use
right now, you will have to build a devargs string (using snprintf) and
submit it.
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
I proposed a change this release for it but it will not make it
for 18.02, that would have simplified your implementation.
Got you. You right we need to remove the created devargs in fail-safe
parse level.
Post by Matan Azrad
What do you think about checking it in the parse level and avoid the new
devargs creation?
Post by Matan Azrad
Also to do the copy in parse level(same method as we are doing in probe
level)?
Not sure I follow here, but the new rte_devargs is part of the
sub-device (it is not a pointer, but allocated alongside the sub_device).
So keep everything here, it is the right place to deal with these things.
If the device already parsed - copy its devargs and continue.
If the device already probed - copy the device pointer and continue.
I think this is the right dealing, no?
Why to deal with parse level in probe level? Just keep all the parse work to
parse level and the probe work to probe level.
After re-reading, I think we misunderstood each other.
You cannot remove the rte_devargs created during parsing: it is allocated
alongside the sub_device structure.
You must only remove the rte_devargs allocated by the EAL (using
rte_eal_devargs_remove()).
Sure.
Post by Matan Azrad
Before removing it, you must copy its content in the local sub_device
rte_devargs structure. I only proposed a way to do this copy that would not
deal with rte_devargs internals, as it is bound to evolve rather soon.
Yes.
Post by Matan Azrad
Otherwise, no, I do not want to complicate the parsing operations, they are
already too complicated and too criticals. Better to keep it all here.
I think fs_parse_device function is not complicated and it is the natural place for devargs games.
For me this is the right place for the copy & remove devargs.
Are you insisting to put all in fs_bus_init?
Post by Matan Azrad
--
Gaëtan Rivet
6WIND
Gaëtan Rivet
2018-01-16 22:31:04 UTC
Permalink
Hi Matan,
Post by Matan Azrad
Hi Gaetan
<snip>
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
In 18.05, or 18.08 there should be an EAL function that would be
able to identify a device given a specific ID string (very close to an
rte_devargs).
Post by Matan Azrad
Post by Gaëtan Rivet
Currently, this API does not exist.
You can hack your way around this for the moment, IF you really, really
want: parse your devargs, get the bus, use the bus->parse() function
to get a binary device representation, and compare bytes per bytes
the binary representation given by your devargs and by the device-
name.
Post by Gaëtan Rivet
But this is a hack, and a pretty ugly one at that: you have no way
of knowing the size taken by this binary representation, so you can
restrict yourself to the vdev and PCI bus for the moment and take
the larger of an rte_vdev_driver pointer and an rte_pci_addr....
{
union {
rte_vdev_driver *drv;
struct rte_pci_addr pci_addr;
} bindev1, bindev2;
memset(&bindev1, 0, sizeof(bindev1));
memset(&bindev2, 0, sizeof(bindev2));
rte_eal_devargs_parse(device->name, da1);
rte_eal_devargs_parse(your_devstr, da2);
RTE_ASSERT(da1->bus == rte_bus_find_by_name("pci") ||
da1->bus == rte_bus_find_by_name("vdev"));
RTE_ASSERT(da2->bus == rte_bus_find_by_name("pci") ||
da2->bus == rte_bus_find_by_name("vdev"));
da1->bus->parse(da1->name, &bindev1);
da1->bus->parse(da2->name, &bindev2);
if (memcmp(&bindev1, &bindev2, sizeof(bindev1)) == 0) {
/* found the device */
} else {
/* not found */
}
}
So, really, really ugly. Anyway.
Yes, ugly :) Thanks for this update!
Will keep the comparison by device->name.
Well as explained, above, the comparison by device->name only works with
whitelisted devices.
So either implement something broken right now that you will need to
update in 18.05, or implement it properly in 18.05 from the get go.
For the current needs it is enough.
We can also say that it is the user responsibility to pass to failsafe the same names and same args as he passes for EAL(or default EAL names).
I think I emphasized it in documentation.
Okay, as you wish. Just be aware of this limitation.

I think this functionality is good and useful, but it needs to be made clean.
The proper function should be available soon, then this implementaion should
be cleaned up.
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
<snip>
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ /* Take control of device probed by EAL
options. */
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ DEBUG("Taking control of a probed sub
device"
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ " %d named %s", i, da->name);
In this case, the devargs of the probed device must be copied
within the sub- device definition and removed from the EAL using
the proper rte_devargs API.
Note that there is no rte_devargs copy function. You can use
rte_devargs_parse instead, "parsing" again the original devargs
into the sub- device one. It is necessary for complying with
internal rte_devargs requirements (da->args being malloc-ed, at
the moment,
but may evolve).
Post by Matan Azrad
Post by Gaëtan Rivet
The rte_eal_devargs_parse function is not easy enough to use
right now, you will have to build a devargs string (using snprintf) and
submit it.
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
I proposed a change this release for it but it will not make it
for 18.02, that would have simplified your implementation.
Got you. You right we need to remove the created devargs in fail-safe
parse level.
Post by Matan Azrad
What do you think about checking it in the parse level and avoid the new
devargs creation?
Post by Matan Azrad
Also to do the copy in parse level(same method as we are doing in probe
level)?
Not sure I follow here, but the new rte_devargs is part of the
sub-device (it is not a pointer, but allocated alongside the sub_device).
So keep everything here, it is the right place to deal with these things.
If the device already parsed - copy its devargs and continue.
If the device already probed - copy the device pointer and continue.
I think this is the right dealing, no?
Why to deal with parse level in probe level? Just keep all the parse work to
parse level and the probe work to probe level.
After re-reading, I think we misunderstood each other.
You cannot remove the rte_devargs created during parsing: it is allocated
alongside the sub_device structure.
You must only remove the rte_devargs allocated by the EAL (using
rte_eal_devargs_remove()).
Sure.
Post by Gaëtan Rivet
Before removing it, you must copy its content in the local sub_device
rte_devargs structure. I only proposed a way to do this copy that would not
deal with rte_devargs internals, as it is bound to evolve rather soon.
Yes.
Post by Gaëtan Rivet
Otherwise, no, I do not want to complicate the parsing operations, they are
already too complicated and too criticals. Better to keep it all here.
I think fs_parse_device function is not complicated and it is the natural place for devargs games.
For me this is the right place for the copy & remove devargs.
Are you insisting to put all in fs_bus_init?
You would have to put fs_ethdev_portid_find in failsafe_args, which is
mixing layers. Sorry but yes, please keep all these changes in this
file.

Thanks,
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-17 08:40:00 UTC
Permalink
Hi Gaetan

From: Gaëtan Rivet, Wednesday, January 17, 2018 12:31 AM
Post by Gaëtan Rivet
Hi Matan,
Post by Matan Azrad
Hi Gaetan
<snip>
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
In 18.05, or 18.08 there should be an EAL function that would be
able to identify a device given a specific ID string (very close to an
rte_devargs).
Post by Matan Azrad
Post by Gaëtan Rivet
Currently, this API does not exist.
You can hack your way around this for the moment, IF you really, really
want: parse your devargs, get the bus, use the bus->parse()
function to get a binary device representation, and compare
bytes per bytes the binary representation given by your devargs
and by the device-
name.
Post by Gaëtan Rivet
But this is a hack, and a pretty ugly one at that: you have no
way of knowing the size taken by this binary representation, so
you can restrict yourself to the vdev and PCI bus for the moment
and take the larger of an rte_vdev_driver pointer and an
rte_pci_addr....
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
{
union {
rte_vdev_driver *drv;
struct rte_pci_addr pci_addr;
} bindev1, bindev2;
memset(&bindev1, 0, sizeof(bindev1));
memset(&bindev2, 0, sizeof(bindev2));
rte_eal_devargs_parse(device->name, da1);
rte_eal_devargs_parse(your_devstr, da2);
RTE_ASSERT(da1->bus == rte_bus_find_by_name("pci") ||
da1->bus == rte_bus_find_by_name("vdev"));
RTE_ASSERT(da2->bus == rte_bus_find_by_name("pci") ||
da2->bus == rte_bus_find_by_name("vdev"));
da1->bus->parse(da1->name, &bindev1);
da1->bus->parse(da2->name, &bindev2);
if (memcmp(&bindev1, &bindev2, sizeof(bindev1)) == 0) {
/* found the device */
} else {
/* not found */
}
}
So, really, really ugly. Anyway.
Yes, ugly :) Thanks for this update!
Will keep the comparison by device->name.
Well as explained, above, the comparison by device->name only works
with whitelisted devices.
So either implement something broken right now that you will need to
update in 18.05, or implement it properly in 18.05 from the get go.
For the current needs it is enough.
We can also say that it is the user responsibility to pass to failsafe the same
names and same args as he passes for EAL(or default EAL names).
Post by Matan Azrad
I think I emphasized it in documentation.
Okay, as you wish. Just be aware of this limitation.
I think this functionality is good and useful, but it needs to be made clean.
The proper function should be available soon, then this implementaion
should be cleaned up.
Sure.
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
<snip>
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ /* Take control of device probed by EAL
options. */
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ DEBUG("Taking control of a probed sub
device"
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
+ " %d named %s", i, da->name);
In this case, the devargs of the probed device must be
copied within the sub- device definition and removed from
the EAL using the proper rte_devargs API.
Note that there is no rte_devargs copy function. You can use
rte_devargs_parse instead, "parsing" again the original
devargs into the sub- device one. It is necessary for
complying with internal rte_devargs requirements (da->args
being malloc-ed, at the moment,
but may evolve).
Post by Matan Azrad
Post by Gaëtan Rivet
The rte_eal_devargs_parse function is not easy enough to use
right now, you will have to build a devargs string (using
snprintf) and
submit it.
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
I proposed a change this release for it but it will not make
it for 18.02, that would have simplified your implementation.
Got you. You right we need to remove the created devargs in fail-safe
parse level.
Post by Matan Azrad
What do you think about checking it in the parse level and
avoid the new
devargs creation?
Post by Matan Azrad
Also to do the copy in parse level(same method as we are doing in probe
level)?
Not sure I follow here, but the new rte_devargs is part of the
sub-device (it is not a pointer, but allocated alongside the
sub_device).
Post by Matan Azrad
Post by Gaëtan Rivet
Post by Matan Azrad
Post by Gaëtan Rivet
So keep everything here, it is the right place to deal with these things.
If the device already parsed - copy its devargs and continue.
If the device already probed - copy the device pointer and continue.
I think this is the right dealing, no?
Why to deal with parse level in probe level? Just keep all the parse work to
parse level and the probe work to probe level.
After re-reading, I think we misunderstood each other.
You cannot remove the rte_devargs created during parsing: it is
allocated alongside the sub_device structure.
You must only remove the rte_devargs allocated by the EAL (using
rte_eal_devargs_remove()).
Sure.
Post by Gaëtan Rivet
Before removing it, you must copy its content in the local
sub_device rte_devargs structure. I only proposed a way to do this
copy that would not deal with rte_devargs internals, as it is bound to
evolve rather soon.
Post by Matan Azrad
Yes.
Post by Gaëtan Rivet
Otherwise, no, I do not want to complicate the parsing operations,
they are already too complicated and too criticals. Better to keep it all
here.
Post by Matan Azrad
I think fs_parse_device function is not complicated and it is the natural
place for devargs games.
Post by Matan Azrad
For me this is the right place for the copy & remove devargs.
Are you insisting to put all in fs_bus_init?
You would have to put fs_ethdev_portid_find in failsafe_args, which is
mixing layers. Sorry but yes, please keep all these changes in this file.
OK, Thanks man!
Post by Gaëtan Rivet
Thanks,
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-09 14:47:29 UTC
Permalink
This patch lays the groundwork for this driver (draft documentation,
copyright notices, code base skeleton and build system hooks). While it can
be successfully compiled and invoked, it's an empty shell at this stage.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
MAINTAINERS | 6 ++
config/common_base | 5 ++
config/common_linuxapp | 1 +
doc/guides/nics/features/vdev_netvsc.ini | 12 +++
doc/guides/nics/index.rst | 1 +
doc/guides/nics/vdev_netvsc.rst | 20 +++++
drivers/net/Makefile | 1 +
drivers/net/vdev_netvsc/Makefile | 27 ++++++
.../vdev_netvsc/rte_pmd_vdev_netvsc_version.map | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 99 ++++++++++++++++++++++
mk/rte.app.mk | 1 +
11 files changed, 177 insertions(+)
create mode 100644 doc/guides/nics/features/vdev_netvsc.ini
create mode 100644 doc/guides/nics/vdev_netvsc.rst
create mode 100644 drivers/net/vdev_netvsc/Makefile
create mode 100644 drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
create mode 100644 drivers/net/vdev_netvsc/vdev_netvsc.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f0baeb4..07be8cb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -451,6 +451,12 @@ F: drivers/net/mrvl/
F: doc/guides/nics/mrvl.rst
F: doc/guides/nics/features/mrvl.ini

+Microsoft vdev-netvsc - EXPERIMENTAL
+M: Matan Azrad <***@mellanox.com>
+F: drivers/net/vdev-netvsc/
+F: doc/guides/nics/vdev-netvsc.rst
+F: doc/guides/nics/features/vdev-netvsc.ini
+
Netcope szedata2
M: Matej Vido <***@cesnet.cz>
F: drivers/net/szedata2/
diff --git a/config/common_base b/config/common_base
index e74febe..1c6629e 100644
--- a/config/common_base
+++ b/config/common_base
@@ -281,6 +281,11 @@ CONFIG_RTE_LIBRTE_NFP_DEBUG=n
CONFIG_RTE_LIBRTE_MRVL_PMD=n

#
+# Compile virtual device driver for NetVSC on Hyper-V/Azure
+#
+CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=n
+
+#
# Compile burst-oriented Broadcom BNXT PMD driver
#
CONFIG_RTE_LIBRTE_BNXT_PMD=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74c7d64..e043262 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -47,6 +47,7 @@ CONFIG_RTE_LIBRTE_PMD_VHOST=y
CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
CONFIG_RTE_LIBRTE_PMD_TAP=y
CONFIG_RTE_LIBRTE_AVP_PMD=y
+CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=y
CONFIG_RTE_LIBRTE_NFP_PMD=y
CONFIG_RTE_LIBRTE_POWER=y
CONFIG_RTE_VIRTIO_USER=y
diff --git a/doc/guides/nics/features/vdev_netvsc.ini b/doc/guides/nics/features/vdev_netvsc.ini
new file mode 100644
index 0000000..cfc5cb9
--- /dev/null
+++ b/doc/guides/nics/features/vdev_netvsc.ini
@@ -0,0 +1,12 @@
+;
+; Supported features of the 'vdev_netvsc' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+ARMv7 = Y
+ARMv8 = Y
+Power8 = Y
+x86-32 = Y
+x86-64 = Y
+Usage doc = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 23babe9..5666046 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -64,6 +64,7 @@ Network Interface Controller Drivers
szedata2
tap
thunderx
+ vdev_netvsc
virtio
vhost
vmxnet3
diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
new file mode 100644
index 0000000..a952908
--- /dev/null
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: BSD-3-Clause
+ Copyright 2017 6WIND S.A.
+ Copyright 2017 Mellanox Technologies, Ltd.
+
+VDEV_NETVSC driver
+==================
+
+The VDEV_NETVSC driver (librte_pmd_vdev_netvsc) provides support for NetVSC
+interfaces and associated SR-IOV virtual function (VF) devices found in
+Linux virtual machines running on Microsoft Hyper-V_ (including Azure)
+platforms.
+
+.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
+
+Build options
+-------------
+
+- ``CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD`` (default ``y``)
+
+ Toggle compilation of this driver.
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ef09b4e..dc41ed1 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -66,6 +66,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += sfc
DIRS-$(CONFIG_RTE_LIBRTE_PMD_SZEDATA2) += szedata2
DIRS-$(CONFIG_RTE_LIBRTE_PMD_TAP) += tap
DIRS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += thunderx
+DIRS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc
DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio
DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += vmxnet3

diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
new file mode 100644
index 0000000..2fb059d
--- /dev/null
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -0,0 +1,27 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright 2017 6WIND S.A.
+# Copyright 2017 Mellanox Technologies, Ltd.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# Properties of the generated library.
+LIB = librte_pmd_vdev_netvsc.a
+LIBABIVER := 1
+EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
+
+# Additional compilation flags.
+CFLAGS += -O3
+CFLAGS += -g
+CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += $(WERROR_FLAGS)
+
+# Dependencies.
+LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_eal
+LDLIBS += -lrte_ethdev
+LDLIBS += -lrte_kvargs
+
+# Source files.
+SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map b/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
new file mode 100644
index 0000000..179140f
--- /dev/null
+++ b/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
@@ -0,0 +1,4 @@
+DPDK_18.02 {
+
+ local: *;
+};
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
new file mode 100644
index 0000000..e895b32
--- /dev/null
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2017 6WIND S.A.
+ * Copyright 2017 Mellanox Technologies, Ltd.
+ */
+
+#include <stddef.h>
+
+#include <rte_bus_vdev.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#define VDEV_NETVSC_DRIVER net_vdev_netvsc
+#define VDEV_NETVSC_ARG_IFACE "iface"
+#define VDEV_NETVSC_ARG_MAC "mac"
+
+#define DRV_LOG(level, ...) \
+ rte_log(RTE_LOG_ ## level, \
+ vdev_netvsc_logtype, \
+ RTE_FMT(RTE_STR(VDEV_NETVSC_DRIVER) ": " \
+ RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+/** Driver-specific log messages type. */
+static int vdev_netvsc_logtype;
+
+/** Number of driver instances relying on context list. */
+static unsigned int vdev_netvsc_ctx_inst;
+
+/**
+ * Probe NetVSC interfaces.
+ *
+ * @param dev
+ * Virtual device context for driver instance.
+ *
+ * @return
+ * Always 0, even in case of errors.
+ */
+static int
+vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
+{
+ static const char *const vdev_netvsc_arg[] = {
+ VDEV_NETVSC_ARG_IFACE,
+ VDEV_NETVSC_ARG_MAC,
+ NULL,
+ };
+ const char *name = rte_vdev_device_name(dev);
+ const char *args = rte_vdev_device_args(dev);
+ struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
+ vdev_netvsc_arg);
+
+ DRV_LOG(DEBUG, "invoked as \"%s\", using arguments \"%s\"", name, args);
+ if (!kvargs) {
+ DRV_LOG(ERR, "cannot parse arguments list");
+ goto error;
+ }
+error:
+ if (kvargs)
+ rte_kvargs_free(kvargs);
+ ++vdev_netvsc_ctx_inst;
+ return 0;
+}
+
+/**
+ * Remove driver instance.
+ *
+ * @param dev
+ * Virtual device context for driver instance.
+ *
+ * @return
+ * Always 0.
+ */
+static int
+vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
+{
+ --vdev_netvsc_ctx_inst;
+ return 0;
+}
+
+/** Virtual device descriptor. */
+static struct rte_vdev_driver vdev_netvsc_vdev = {
+ .probe = vdev_netvsc_vdev_probe,
+ .remove = vdev_netvsc_vdev_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(VDEV_NETVSC_DRIVER, vdev_netvsc_vdev);
+RTE_PMD_REGISTER_ALIAS(VDEV_NETVSC_DRIVER, eth_vdev_netvsc);
+RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
+ VDEV_NETVSC_ARG_IFACE "=<string> "
+ VDEV_NETVSC_ARG_MAC "=<string>");
+
+/** Initialize driver log type. */
+RTE_INIT(vdev_netvsc_init_log)
+{
+ vdev_netvsc_logtype = rte_log_register("pmd.vdev_netvsc");
+ if (vdev_netvsc_logtype >= 0)
+ rte_log_set_level(vdev_netvsc_logtype, RTE_LOG_NOTICE);
+}
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 6a6a745..3ae5212 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -156,6 +156,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += -lrte_pmd_sfc_efx
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_SZEDATA2) += -lrte_pmd_szedata2 -lsze2
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_TAP) += -lrte_pmd_tap
_LDLIBS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += -lrte_pmd_thunderx_nicvf
+_LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
_LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += -lrte_pmd_virtio
ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += -lrte_pmd_vhost
--
1.8.3.1
Matan Azrad
2018-01-09 14:47:30 UTC
Permalink
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.

This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.

PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.

Once initialized, the sole job of the vdev_netvsc driver is to regularly
scan for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 70 +++++
drivers/net/vdev_netvsc/Makefile | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 544 +++++++++++++++++++++++++++++++++-
3 files changed, 617 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index a952908..fde1fb8 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -12,9 +12,79 @@ platforms.

.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v

+Implementation details
+----------------------
+
+Each instance of this driver effectively needs to drive two devices: the
+NetVSC interface proper and its SR-IOV VF (referred to as "physical" from
+this point on) counterpart sharing the same MAC address.
+
+Physical devices are part of the host system and cannot be maintained during
+VM migration. From a VM standpoint they appear as hot-plug devices that come
+and go without prior notice.
+
+When the physical device is present, egress and most of the ingress traffic
+flows through it; only multicasts and other hypervisor control still flow
+through NetVSC. Otherwise, NetVSC acts as a fallback for all traffic.
+
+To avoid unnecessary code duplication and ensure maximum performance,
+handling of physical devices is left to their original PMDs; this virtual
+device driver (also known as *vdev*) manages other PMDs as summarized by the
+following block diagram::
+
+ .------------------.
+ | DPDK application |
+ `--------+---------'
+ |
+ .------+------.
+ | DPDK ethdev |
+ `------+------' Control
+ | |
+ .------------+------------. v .--------------------.
+ | failsafe PMD +---------+ vdev_netvsc driver |
+ `--+-------------------+--' `--------------------'
+ | |
+ | .........|.........
+ | : | :
+ .----+----. : .----+----. :
+ | tap PMD | : | any PMD | :
+ `----+----' : `----+----' : <-- Hot-pluggable
+ | : | :
+ .------+-------. : .-----+-----. :
+ | NetVSC-based | : | SR-IOV VF | :
+ | netdevice | : | device | :
+ `--------------' : `-----------' :
+ :.................:
+
+
+This driver implementation may be temporary and should be improved or removed
+either when hot-plug will be fully supported in EAL and bus drivers or when
+a new NetVSC driver will be integrated.
+
Build options
-------------

- ``CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD`` (default ``y``)

Toggle compilation of this driver.
+
+Run-time parameters
+-------------------
+
+To invoke this driver, applications have to explicitly provide the
+``--vdev=net_vdev_netvsc`` EAL option.
+
+The following device parameters are supported:
+
+- ``iface`` [string]
+
+ Provide a specific NetVSC interface (netdevice) name to attach this driver
+ to. Can be provided multiple times for additional instances.
+
+- ``mac`` [string]
+
+ Same as ``iface`` except a suitable NetVSC interface is located using its
+ MAC address.
+
+Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
+all NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
index 2fb059d..f2b2ac5 100644
--- a/drivers/net/vdev_netvsc/Makefile
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -13,6 +13,9 @@ EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
CFLAGS += -O3
CFLAGS += -g
CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += -D_XOPEN_SOURCE=600
+CFLAGS += -D_BSD_SOURCE
+CFLAGS += -D_DEFAULT_SOURCE
CFLAGS += $(WERROR_FLAGS)

# Dependencies.
@@ -20,6 +23,7 @@ LDLIBS += -lrte_bus_vdev
LDLIBS += -lrte_eal
LDLIBS += -lrte_ethdev
LDLIBS += -lrte_kvargs
+LDLIBS += -lrte_net

# Source files.
SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index e895b32..3d8895b 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -3,17 +3,41 @@
* Copyright 2017 Mellanox Technologies, Ltd.
*/

+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <linux/sockios.h>
+#include <net/if.h>
+#include <netinet/ip.h>
+#include <stdarg.h>
#include <stddef.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/queue.h>
+#include <sys/socket.h>
+#include <unistd.h>

+#include <rte_alarm.h>
+#include <rte_bus.h>
#include <rte_bus_vdev.h>
#include <rte_common.h>
#include <rte_config.h>
+#include <rte_dev.h>
+#include <rte_errno.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
#include <rte_kvargs.h>
#include <rte_log.h>

#define VDEV_NETVSC_DRIVER net_vdev_netvsc
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_PROBE_MS 1000
+
+#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"

#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -25,12 +49,490 @@
/** Driver-specific log messages type. */
static int vdev_netvsc_logtype;

+/** Context structure for a vdev_netvsc instance. */
+struct vdev_netvsc_ctx {
+ LIST_ENTRY(vdev_netvsc_ctx) entry; /**< Next entry in list. */
+ unsigned int id; /**< ID used to generate unique names. */
+ char name[64]; /**< Unique name for vdev_netvsc instance. */
+ char devname[64]; /**< Fail-safe PMD instance name. */
+ char devargs[256]; /**< Fail-safe PMD instance device arguments. */
+ char if_name[IF_NAMESIZE]; /**< NetVSC netdevice name. */
+ unsigned int if_index; /**< NetVSC netdevice index. */
+ struct ether_addr if_addr; /**< NetVSC MAC address. */
+ int pipe[2]; /**< Communication pipe with fail-safe instance. */
+ char yield[256]; /**< Current device string used with fail-safe. */
+};
+
+/** Context list is common to all driver instances. */
+static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
+ LIST_HEAD_INITIALIZER(vdev_netvsc_ctx_list);
+
+/** Number of entries in context list. */
+static unsigned int vdev_netvsc_ctx_count;
+
/** Number of driver instances relying on context list. */
static unsigned int vdev_netvsc_ctx_inst;

/**
+ * Destroy a vdev_netvsc context instance.
+ *
+ * @param ctx
+ * Context to destroy.
+ */
+static void
+vdev_netvsc_ctx_destroy(struct vdev_netvsc_ctx *ctx)
+{
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice found on
+ * the system.
+ *
+ * @param func
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ * @param ...
+ * Variable parameter list passed as @p va_list to @p func.
+ *
+ * @return
+ * 0 when the entire list is traversed successfully, a negative error code
+ * in case or failure, or the nonzero value returned by @p func when list
+ * traversal is aborted.
+ */
+static int
+vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "cannot retrieve system network interfaces");
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot open socket: %s", rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ DRV_LOG(WARNING, "cannot retrieve information about"
+ " interface \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+error:
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ *
+ * @return
+ * A nonzero value when interface is detected as NetVSC. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[sizeof(temp) + IF_NAMESIZE];
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ ret = snprintf(path, sizeof(path), temp, iface->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(path)) {
+ rte_errno = ENOBUFS;
+ return 0;
+ }
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * @param[out] buf
+ * Output data buffer.
+ * @param size
+ * Output buffer size.
+ * @param[in] if_name
+ * Netdevice name.
+ * @param[in] relpath
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * @return
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+vdev_netvsc_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
+
+/**
+ * Probe a network interface to associate with vdev_netvsc context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the vdev_netvsc context and
+ * communicates its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - struct vdev_netvsc_ctx *ctx:
+ * Context to associate network interface with.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct vdev_netvsc_ctx *ctx = va_arg(ap, struct vdev_netvsc_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DRV_LOG(DEBUG,
+ "NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (vdev_netvsc_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ if (strcmp(addr, "pci"))
+ return 0;
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance. */
+ if (strcmp(addr, ctx->yield))
+ DRV_LOG(DEBUG, "associating PCI device \"%s\" with NetVSC"
+ " interface \"%s\" (index %u)", addr, ctx->if_name,
+ ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ DRV_LOG(WARNING, "cannot associate PCI device name \"%s\" with"
+ " interface \"%s\": %s", addr, ctx->if_name,
+ rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by VDEV_NETVSC_PROBE_MS as
+ * long as an vdev_netvsc context instance exists.
+ *
+ * @param arg
+ * Ignored.
+ */
+static void
+vdev_netvsc_alarm(__rte_unused void *arg)
+{
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry) {
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ if (ret)
+ break;
+ }
+ if (!vdev_netvsc_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
+
+/**
+ * Probe a NetVSC interface to generate a vdev_netvsc context from.
+ *
+ * This function instantiates vdev_netvsc contexts either for all NetVSC
+ * devices found on the system or only a subset provided as device
+ * arguments.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - const char *name:
+ * Name associated with current driver instance.
+ *
+ * - struct rte_kvargs *kvargs:
+ * Device arguments provided to current driver instance.
+ *
+ * - unsigned int specified:
+ * Number of specific netdevices provided as device arguments.
+ *
+ * - unsigned int *matched:
+ * The number of specified netdevices matched by this function.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key, VDEV_NETVSC_ARG_MAC)) {
+ struct ether_addr tmp;
+
+ if (sscanf(pair->value,
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8 ":"
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8,
+ &tmp.addr_bytes[0],
+ &tmp.addr_bytes[1],
+ &tmp.addr_bytes[2],
+ &tmp.addr_bytes[3],
+ &tmp.addr_bytes[4],
+ &tmp.addr_bytes[5]) != 6) {
+ DRV_LOG(ERR,
+ "invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is already handled,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!vdev_netvsc_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is not NetVSC,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot allocate context for interface \"%s\": %s",
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = vdev_netvsc_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ DRV_LOG(ERR,
+ "cannot allocate control pipe for interface \"%s\": %s",
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1)
+ continue;
+ ret = -errno;
+ DRV_LOG(ERR, "cannot toggle non-blocking flag on control file"
+ " descriptor #%u (%d): %s", i, ctx->pipe[i],
+ rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name))
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname), "net_failsafe_%s",
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname))
+ ++i;
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "fd(%d),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs))
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "generated virtual device name or argument list"
+ " too long for interface \"%s\"", ctx->if_name);
+ goto error;
+ }
+ /* Request virtual device generation. */
+ DRV_LOG(DEBUG, "generating virtual device \"%s\" with arguments \"%s\"",
+ ctx->devname, ctx->devargs);
+ vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&vdev_netvsc_ctx_list, ctx, entry);
+ ++vdev_netvsc_ctx_count;
+ DRV_LOG(DEBUG, "added NetVSC interface \"%s\" to context list",
+ ctx->if_name);
+ return 0;
+error:
+ if (ctx)
+ vdev_netvsc_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified device
+ * arguments and starts a periodic alarm callback to notify the resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* @param dev
* Virtual device context for driver instance.
*
@@ -49,12 +551,40 @@
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
vdev_netvsc_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;

DRV_LOG(DEBUG, "invoked as \"%s\", using arguments \"%s\"", name, args);
if (!kvargs) {
DRV_LOG(ERR, "cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ /* Gather interfaces. */
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ DRV_LOG(WARNING,
+ "some of the specified parameters did not match"
+ " recognized network interfaces");
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
error:
if (kvargs)
rte_kvargs_free(kvargs);
@@ -65,6 +595,9 @@
/**
* Remove driver instance.
*
+ * The alarm callback and underlying vdev_netvsc context instances are only
+ * destroyed after the last PMD instance is removed.
+ *
* @param dev
* Virtual device context for driver instance.
*
@@ -74,7 +607,16 @@
static int
vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
{
- --vdev_netvsc_ctx_inst;
+ if (--vdev_netvsc_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ while (!LIST_EMPTY(&vdev_netvsc_ctx_list)) {
+ struct vdev_netvsc_ctx *ctx = LIST_FIRST(&vdev_netvsc_ctx_list);
+
+ LIST_REMOVE(ctx, entry);
+ --vdev_netvsc_ctx_count;
+ vdev_netvsc_ctx_destroy(ctx);
+ }
return 0;
}
--
1.8.3.1
Stephen Hemminger
2018-01-09 18:49:16 UTC
Permalink
On Tue, 9 Jan 2018 14:47:30 +0000
Post by Adrien Mazarguil
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.
This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.
PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.
Once initialized, the sole job of the vdev_netvsc driver is to regularly
scan for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.
There is also the issue of how rescind is handled, but that may be more complex
than you want to deal with now. Host may rescind PCI devices for other reasons
than migration. For example, if host needs to do live upgrade of PF device driver
on host (or firmware); then it will rescind VF device from all guests and then
restore it after upgrade.
Post by Adrien Mazarguil
diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
index 2fb059d..f2b2ac5 100644
--- a/drivers/net/vdev_netvsc/Makefile
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -13,6 +13,9 @@ EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
CFLAGS += -O3
CFLAGS += -g
CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += -D_XOPEN_SOURCE=600
+CFLAGS += -D_BSD_SOURCE
+CFLAGS += -D_DEFAULT_SOURCE
These are kind of a nuisance, can't it just use same CFLAGS as other code?
Post by Adrien Mazarguil
# Source files.
SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index e895b32..3d8895b 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -3,17 +3,41 @@
* Copyright 2017 Mellanox Technologies, Ltd.
#define VDEV_NETVSC_DRIVER net_vdev_netvsc
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_PROBE_MS 1000
+
+#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -25,12 +49,490 @@
/** Driver-specific log messages type. */
static int vdev_netvsc_logtype;
+/** Context structure for a vdev_netvsc instance. */
+struct vdev_netvsc_ctx {
+ LIST_ENTRY(vdev_netvsc_ctx) entry; /**< Next entry in list. */
+ unsigned int id; /**< ID used to generate unique names. */
+ char name[64]; /**< Unique name for vdev_netvsc instance. */
+ char devname[64]; /**< Fail-safe PMD instance name. */
+ char devargs[256]; /**< Fail-safe PMD instance device arguments. */
+ char if_name[IF_NAMESIZE]; /**< NetVSC netdevice name. */
+ unsigned int if_index; /**< NetVSC netdevice index. */
+ struct ether_addr if_addr; /**< NetVSC MAC address. */
+ int pipe[2]; /**< Communication pipe with fail-safe instance. */
+ char yield[256]; /**< Current device string used with fail-safe. */
+};
Please align comments.
Post by Adrien Mazarguil
+/** Context list is common to all driver instances. */
+static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
+ LIST_HEAD_INITIALIZER(vdev_netvsc_ctx_list);
+
+/** Number of entries in context list. */
+static unsigned int vdev_netvsc_ctx_count;
+
/** Number of driver instances relying on context list. */
static unsigned int vdev_netvsc_ctx_inst;
/**
+ * Destroy a vdev_netvsc context instance.
+ *
+ * Context to destroy.
+ */
+static void
+vdev_netvsc_ctx_destroy(struct vdev_netvsc_ctx *ctx)
+{
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice found on
+ * the system.
+ *
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ *
+ * 0 when the entire list is traversed successfully, a negative error code
+ * traversal is aborted.
+ */
+static int
+vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "cannot retrieve system network interfaces");
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot open socket: %s", rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ DRV_LOG(WARNING, "cannot retrieve information about"
+ " interface \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
Skip non-ethernet interfaces where addr length != 6
Post by Adrien Mazarguil
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * A nonzero value when interface is detected as NetVSC. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[sizeof(temp) + IF_NAMESIZE];
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ ret = snprintf(path, sizeof(path), temp, iface->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(path)) {
+ rte_errno = ENOBUFS;
+ return 0;
+ }
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
This is different way to compare uuid, maybe use fgets() and uuid_compare?
Post by Adrien Mazarguil
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * Output data buffer.
+ * Output buffer size.
+ * Netdevice name.
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+vdev_netvsc_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
You might find it easier to look at directory.
/sys/bus/vmbus/drivers/hv_netvsc/
Post by Adrien Mazarguil
+
+/**
+ * Probe a network interface to associate with vdev_netvsc context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the vdev_netvsc context and
+ * communicates its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Context to associate network interface with.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct vdev_netvsc_ctx *ctx = va_arg(ap, struct vdev_netvsc_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DRV_LOG(DEBUG,
+ "NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (vdev_netvsc_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ if (strcmp(addr, "pci"))
+ return 0;
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance. */
+ if (strcmp(addr, ctx->yield))
+ DRV_LOG(DEBUG, "associating PCI device \"%s\" with NetVSC"
+ " interface \"%s\" (index %u)", addr, ctx->if_name,
+ ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ DRV_LOG(WARNING, "cannot associate PCI device name \"%s\" with"
+ " interface \"%s\": %s", addr, ctx->if_name,
+ rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by VDEV_NETVSC_PROBE_MS as
+ * long as an vdev_netvsc context instance exists.
+ *
+ * Ignored.
+ */
+static void
+vdev_netvsc_alarm(__rte_unused void *arg)
+{
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry) {
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ if (ret)
+ break;
+ }
+ if (!vdev_netvsc_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
Why not use netlink uevent?
Post by Adrien Mazarguil
+/**
+ * Probe a NetVSC interface to generate a vdev_netvsc context from.
+ *
+ * This function instantiates vdev_netvsc contexts either for all NetVSC
+ * devices found on the system or only a subset provided as device
+ * arguments.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Name associated with current driver instance.
+ *
+ * Device arguments provided to current driver instance.
+ *
+ * Number of specific netdevices provided as device arguments.
+ *
+ * The number of specified netdevices matched by this function.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key, VDEV_NETVSC_ARG_MAC)) {
+ struct ether_addr tmp;
+
+ if (sscanf(pair->value,
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8 ":"
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8,
+ &tmp.addr_bytes[0],
+ &tmp.addr_bytes[1],
+ &tmp.addr_bytes[2],
+ &tmp.addr_bytes[3],
+ &tmp.addr_bytes[4],
+ &tmp.addr_bytes[5]) != 6) {
+ DRV_LOG(ERR,
+ "invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is already handled,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!vdev_netvsc_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is not NetVSC,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot allocate context for interface \"%s\": %s",
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = vdev_netvsc_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ DRV_LOG(ERR,
+ "cannot allocate control pipe for interface \"%s\": %s",
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1)
+ continue;
+ ret = -errno;
+ DRV_LOG(ERR, "cannot toggle non-blocking flag on control file"
+ " descriptor #%u (%d): %s", i, ctx->pipe[i],
+ rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name))
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname), "net_failsafe_%s",
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname))
+ ++i;
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "fd(%d),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs))
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "generated virtual device name or argument list"
+ " too long for interface \"%s\"", ctx->if_name);
+ goto error;
+ }
+ /* Request virtual device generation. */
+ DRV_LOG(DEBUG, "generating virtual device \"%s\" with arguments \"%s\"",
+ ctx->devname, ctx->devargs);
+ vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&vdev_netvsc_ctx_list, ctx, entry);
+ ++vdev_netvsc_ctx_count;
+ DRV_LOG(DEBUG, "added NetVSC interface \"%s\" to context list",
+ ctx->if_name);
+ return 0;
+ if (ctx)
+ vdev_netvsc_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified device
+ * arguments and starts a periodic alarm callback to notify the resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* Virtual device context for driver instance.
*
@@ -49,12 +551,40 @@
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
vdev_netvsc_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;
DRV_LOG(DEBUG, "invoked as \"%s\", using arguments \"%s\"", name, args);
if (!kvargs) {
DRV_LOG(ERR, "cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ /* Gather interfaces. */
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ DRV_LOG(WARNING,
+ "some of the specified parameters did not match"
+ " recognized network interfaces");
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
if (kvargs)
rte_kvargs_free(kvargs);
@@ -65,6 +595,9 @@
/**
* Remove driver instance.
*
+ * The alarm callback and underlying vdev_netvsc context instances are only
+ * destroyed after the last PMD instance is removed.
+ *
* Virtual device context for driver instance.
*
@@ -74,7 +607,16 @@
static int
vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
{
- --vdev_netvsc_ctx_inst;
+ if (--vdev_netvsc_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ while (!LIST_EMPTY(&vdev_netvsc_ctx_list)) {
+ struct vdev_netvsc_ctx *ctx = LIST_FIRST(&vdev_netvsc_ctx_list);
+
+ LIST_REMOVE(ctx, entry);
+ --vdev_netvsc_ctx_count;
+ vdev_netvsc_ctx_destroy(ctx);
+ }
return 0;
}
Matan Azrad
2018-01-10 15:02:59 UTC
Permalink
Hi Stephan

Thank you for this quick review, please see some comments.

From: Stephen Hemminger, Tuesday, January 9, 2018 8:49 PM
Post by Stephen Hemminger
On Tue, 9 Jan 2018 14:47:30 +0000
Post by Adrien Mazarguil
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in
virtual machines hosted by Hyper-V/Azure platforms.
This driver does not manage traffic nor Ethernet devices directly; it
acts as a thin configuration layer that automatically instantiates and
controls fail-safe PMD instances combining tap and PCI sub-devices, so
that each NetVSC interface is exposed as a single consolidated port to
DPDK applications.
PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when
present and automatic fallback on NetVSC otherwise without
interruption thanks to fail-safe's hot-plug handling.
Once initialized, the sole job of the vdev_netvsc driver is to
regularly scan for PCI devices to associate with NetVSC interfaces and
feed their addresses to corresponding fail-safe instances.
There is also the issue of how rescind is handled, but that may be more
complex than you want to deal with now. Host may rescind PCI devices for
other reasons than migration. For example, if host needs to do live upgrade
of PF device driver on host (or firmware); then it will rescind VF device from
all guests and then restore it after upgrade.
Post by Adrien Mazarguil
diff --git a/drivers/net/vdev_netvsc/Makefile
b/drivers/net/vdev_netvsc/Makefile
index 2fb059d..f2b2ac5 100644
--- a/drivers/net/vdev_netvsc/Makefile
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -13,6 +13,9 @@ EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
CFLAGS += -O3 CFLAGS += -g CFLAGS += -std=c11 -pedantic -Wall
-Wextra
+CFLAGS += -D_XOPEN_SOURCE=600
+CFLAGS += -D_BSD_SOURCE
+CFLAGS += -D_DEFAULT_SOURCE
These are kind of a nuisance, can't it just use same CFLAGS as other code?
Will check.
Post by Stephen Hemminger
Post by Adrien Mazarguil
# Source files.
SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c diff
--git
Post by Adrien Mazarguil
a/drivers/net/vdev_netvsc/vdev_netvsc.c
b/drivers/net/vdev_netvsc/vdev_netvsc.c
index e895b32..3d8895b 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -3,17 +3,41 @@
* Copyright 2017 Mellanox Technologies, Ltd.
#define VDEV_NETVSC_DRIVER net_vdev_netvsc #define
VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_PROBE_MS 1000
+
+#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -25,12 +49,490 @@
/** Driver-specific log messages type. */ static int
vdev_netvsc_logtype;
+/** Context structure for a vdev_netvsc instance. */ struct
+vdev_netvsc_ctx {
+ LIST_ENTRY(vdev_netvsc_ctx) entry; /**< Next entry in list. */
+ unsigned int id; /**< ID used to generate unique names. */
+ char name[64]; /**< Unique name for vdev_netvsc instance. */
+ char devname[64]; /**< Fail-safe PMD instance name. */
+ char devargs[256]; /**< Fail-safe PMD instance device arguments. */
+ char if_name[IF_NAMESIZE]; /**< NetVSC netdevice name. */
+ unsigned int if_index; /**< NetVSC netdevice index. */
+ struct ether_addr if_addr; /**< NetVSC MAC address. */
+ int pipe[2]; /**< Communication pipe with fail-safe instance. */
+ char yield[256]; /**< Current device string used with fail-safe. */
+};
Please align comments.
Sure.
Post by Stephen Hemminger
Post by Adrien Mazarguil
+/** Context list is common to all driver instances. */ static
+LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
+ LIST_HEAD_INITIALIZER(vdev_netvsc_ctx_list);
+
+/** Number of entries in context list. */ static unsigned int
+vdev_netvsc_ctx_count;
+
/** Number of driver instances relying on context list. */ static
unsigned int vdev_netvsc_ctx_inst;
/**
+ * Destroy a vdev_netvsc context instance.
+ *
+ * Context to destroy.
+ */
+static void
+vdev_netvsc_ctx_destroy(struct vdev_netvsc_ctx *ctx) {
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice
+found on
+ * the system.
+ *
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ *
+ * 0 when the entire list is traversed successfully, a negative error code
+ * traversal is aborted.
+ */
+static int
+vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "cannot retrieve system network
interfaces");
Post by Adrien Mazarguil
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot open socket: %s",
rte_strerror(errno));
Post by Adrien Mazarguil
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name,
sizeof(req.ifr_name));
Post by Adrien Mazarguil
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ DRV_LOG(WARNING, "cannot retrieve information
about"
Post by Adrien Mazarguil
+ " interface \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
Skip non-ethernet interfaces where addr length != 6
Will check.
Post by Stephen Hemminger
Post by Adrien Mazarguil
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * A nonzero value when interface is detected as NetVSC. In case of
error,
Post by Adrien Mazarguil
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_iface_is_netvsc(const struct if_nameindex *iface) {
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[sizeof(temp) + IF_NAMESIZE];
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ ret = snprintf(path, sizeof(path), temp, iface->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(path)) {
+ rte_errno = ENOBUFS;
+ return 0;
+ }
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
This is different way to compare uuid, maybe use fgets() and uuid_compare?
Different and nice.
I don't see a reason to replace it.
Post by Stephen Hemminger
Post by Adrien Mazarguil
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * Output data buffer.
+ * Output buffer size.
+ * Netdevice name.
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+vdev_netvsc_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
You might find it easier to look at directory.
/sys/bus/vmbus/drivers/hv_netvsc/
This driver allows to run regular netdevice instead of NetVSC(as described in doc) for debug purpose(even in non-HyperV-VM machine ),
So, It doesn't make sense.
Post by Stephen Hemminger
Post by Adrien Mazarguil
+
+/**
+ * Probe a network interface to associate with vdev_netvsc context.
+ *
+ * This function determines if the network device matches the
+properties of
+ * the NetVSC interface associated with the vdev_netvsc context and
+ * communicates its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Context to associate network interface with.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct vdev_netvsc_ctx *ctx = va_arg(ap, struct vdev_netvsc_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DRV_LOG(DEBUG,
+ "NetVSC interface \"%s\" (index %u) renamed
\"%s\"",
Post by Adrien Mazarguil
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx-
if_name));
+ return 0;
+ }
+ if (vdev_netvsc_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ if (strcmp(addr, "pci"))
+ return 0;
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance. */
+ if (strcmp(addr, ctx->yield))
+ DRV_LOG(DEBUG, "associating PCI device \"%s\" with
NetVSC"
Post by Adrien Mazarguil
+ " interface \"%s\" (index %u)", addr, ctx->if_name,
+ ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ DRV_LOG(WARNING, "cannot associate PCI device name
\"%s\" with"
Post by Adrien Mazarguil
+ " interface \"%s\": %s", addr, ctx->if_name,
+ rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by
+VDEV_NETVSC_PROBE_MS as
+ * long as an vdev_netvsc context instance exists.
+ *
+ * Ignored.
+ */
+static void
+vdev_netvsc_alarm(__rte_unused void *arg) {
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry) {
+ ret =
vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
Post by Adrien Mazarguil
+ if (ret)
+ break;
+ }
+ if (!vdev_netvsc_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
Why not use netlink uevent?
As described in doc, we can improve the hotplug mechanism(here and in fail-safe) after EAL hotplug work will be done.
So, maybe in next release we will change it to use uevent by EAL hotplug.
Post by Stephen Hemminger
Post by Adrien Mazarguil
+/**
+ * Probe a NetVSC interface to generate a vdev_netvsc context from.
+ *
+ * This function instantiates vdev_netvsc contexts either for all
+NetVSC
+ * devices found on the system or only a subset provided as device
+ * arguments.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * Pointer to netdevice description structure (name and index).
+ *
+ * Name associated with current driver instance.
+ *
+ * Device arguments provided to current driver instance.
+ *
+ * Number of specific netdevices provided as device arguments.
+ *
+ * The number of specified netdevices matched by this function.
+ *
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key,
VDEV_NETVSC_ARG_MAC)) {
Post by Adrien Mazarguil
+ struct ether_addr tmp;
+
+ if (sscanf(pair->value,
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8
":"
Post by Adrien Mazarguil
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8,
+ &tmp.addr_bytes[0],
+ &tmp.addr_bytes[1],
+ &tmp.addr_bytes[2],
+ &tmp.addr_bytes[3],
+ &tmp.addr_bytes[4],
+ &tmp.addr_bytes[5]) != 6) {
+ DRV_LOG(ERR,
+ "invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is already handled,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!vdev_netvsc_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is not NetVSC,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
%s",
Post by Adrien Mazarguil
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = vdev_netvsc_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ DRV_LOG(ERR,
%s",
Post by Adrien Mazarguil
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1)
+ continue;
+ ret = -errno;
+ DRV_LOG(ERR, "cannot toggle non-blocking flag on control
file"
Post by Adrien Mazarguil
+ " descriptor #%u (%d): %s", i, ctx->pipe[i],
+ rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name))
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname),
"net_failsafe_%s",
Post by Adrien Mazarguil
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname))
+ ++i;
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "fd(%d),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs))
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "generated virtual device name or argument
list"
Post by Adrien Mazarguil
+ " too long for interface \"%s\"", ctx->if_name);
+ goto error;
+ }
+ /* Request virtual device generation. */
+ DRV_LOG(DEBUG, "generating virtual device \"%s\" with arguments
\"%s\"",
Post by Adrien Mazarguil
+ ctx->devname, ctx->devargs);
+ vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&vdev_netvsc_ctx_list, ctx, entry);
+ ++vdev_netvsc_ctx_count;
+ DRV_LOG(DEBUG, "added NetVSC interface \"%s\" to context list",
+ ctx->if_name);
+ return 0;
+ if (ctx)
+ vdev_netvsc_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified
+ device
+ * arguments and starts a periodic alarm callback to notify the
+ resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* Virtual device context for driver instance.
*
@@ -49,12 +551,40 @@
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
vdev_netvsc_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;
DRV_LOG(DEBUG, "invoked as \"%s\", using arguments \"%s\"",
name, args);
Post by Adrien Mazarguil
if (!kvargs) {
DRV_LOG(ERR, "cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ /* Gather interfaces. */
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe,
name, kvargs,
Post by Adrien Mazarguil
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ DRV_LOG(WARNING,
+ "some of the specified parameters did not match"
+ " recognized network interfaces");
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
if (kvargs)
rte_kvargs_free(kvargs);
@@ -65,6 +595,9 @@
/**
* Remove driver instance.
*
+ * The alarm callback and underlying vdev_netvsc context instances
+ are only
+ * destroyed after the last PMD instance is removed.
+ *
* Virtual device context for driver instance.
*
@@ -74,7 +607,16 @@
static int
vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
{
Post by Adrien Mazarguil
- --vdev_netvsc_ctx_inst;
+ if (--vdev_netvsc_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ while (!LIST_EMPTY(&vdev_netvsc_ctx_list)) {
+ struct vdev_netvsc_ctx *ctx =
LIST_FIRST(&vdev_netvsc_ctx_list);
Post by Adrien Mazarguil
+
+ LIST_REMOVE(ctx, entry);
+ --vdev_netvsc_ctx_count;
+ vdev_netvsc_ctx_destroy(ctx);
+ }
return 0;
}
Thomas Monjalon
2018-01-17 16:51:45 UTC
Permalink
Post by Matan Azrad
From: Stephen Hemminger, Tuesday, January 9, 2018 8:49 PM
Post by Stephen Hemminger
On Tue, 9 Jan 2018 14:47:30 +0000
Post by Adrien Mazarguil
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
Why not use netlink uevent?
As described in doc, we can improve the hotplug mechanism(here and in fail-safe) after EAL hotplug work will be done.
So, maybe in next release we will change it to use uevent by EAL hotplug.
I don't see any progress here for one week.
Yes it is a temporary solution waiting for hotplug event callback in EAL.
Hopefully it will be possible to do such improvements in 18.05.

Am I missing something else?
Or can it be applied to next-net?
Matan Azrad
2018-01-09 14:47:31 UTC
Permalink
NetVSC netdevices which are already routed should not be probed because
they are used for management purposes by the HyperV.

prevent routed netvsc devices probing.

Signed-off-by: Raslan Darawsheh <***@mellanox.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 2 +-
drivers/net/vdev_netvsc/vdev_netvsc.c | 46 +++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index fde1fb8..f779862 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -87,4 +87,4 @@ The following device parameters are supported:
MAC address.

Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
-all NetVSC interfaces found on the system.
+all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 3d8895b..4295b92 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -38,6 +38,7 @@
#define VDEV_NETVSC_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
+#define NETVSC_MAX_ROUTE_LINE_SIZE 300

#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -192,6 +193,44 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
}

/**
+ * Determine if a network interface has a route.
+ *
+ * @param[in] name
+ * Network device name.
+ *
+ * @return
+ * A nonzero value when interface has an route. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_has_route(const char *name)
+{
+ FILE *fp;
+ int ret = 0;
+ char route[NETVSC_MAX_ROUTE_LINE_SIZE];
+ char *netdev;
+
+ fp = fopen("/proc/net/route", "r");
+ if (!fp) {
+ rte_errno = errno;
+ return 0;
+ }
+ while (fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL) {
+ netdev = strtok(route, "\t");
+ if (strcmp(netdev, name) == 0) {
+ ret = 1;
+ break;
+ }
+ /* Move file pointer to the next line. */
+ while (strchr(route, '\n') == NULL &&
+ fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL)
+ ;
+ }
+ fclose(fp);
+ return ret;
+}
+
+/**
* Retrieve network interface data from sysfs symbolic link.
*
* @param[out] buf
@@ -453,6 +492,13 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
iface->if_name, iface->if_index);
return 0;
}
+ /* Routed NetVSC should not be probed. */
+ if (vdev_netvsc_has_route(iface->if_name)) {
+ DRV_LOG(WARNING, "NetVSC interface \"%s\" (index %u) is routed",
+ iface->if_name, iface->if_index);
+ if (!specified)
+ return 0;
+ }
/* Create interface context. */
ctx = calloc(1, sizeof(*ctx));
if (!ctx) {
--
1.8.3.1
Stephen Hemminger
2018-01-09 18:51:28 UTC
Permalink
On Tue, 9 Jan 2018 14:47:31 +0000
Post by Matan Azrad
NetVSC netdevices which are already routed should not be probed because
they are used for management purposes by the HyperV.
prevent routed netvsc devices probing.
---
doc/guides/nics/vdev_netvsc.rst | 2 +-
drivers/net/vdev_netvsc/vdev_netvsc.c | 46 +++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 1 deletion(-)
diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index fde1fb8..f779862 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
MAC address.
Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
-all NetVSC interfaces found on the system.
+all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 3d8895b..4295b92 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -38,6 +38,7 @@
#define VDEV_NETVSC_PROBE_MS 1000
#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
+#define NETVSC_MAX_ROUTE_LINE_SIZE 300
#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -192,6 +193,44 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
}
/**
+ * Determine if a network interface has a route.
+ *
+ * Network device name.
+ *
+ * A nonzero value when interface has an route. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_has_route(const char *name)
+{
+ FILE *fp;
+ int ret = 0;
+ char route[NETVSC_MAX_ROUTE_LINE_SIZE];
+ char *netdev;
+
+ fp = fopen("/proc/net/route", "r");
+ if (!fp) {
+ rte_errno = errno;
+ return 0;
+ }
+ while (fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL) {
+ netdev = strtok(route, "\t");
+ if (strcmp(netdev, name) == 0) {
+ ret = 1;
+ break;
+ }
+ /* Move file pointer to the next line. */
+ while (strchr(route, '\n') == NULL &&
+ fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL)
+ ;
+ }
+ fclose(fp);
+ return ret;
+}
In many ways /proc/net/route is legacy intervace.
And system may have 1 M routes.

Maybe there is faster way to do this with netlink by looking to see if
there is an address associated with the interface.
Matan Azrad
2018-01-10 15:07:14 UTC
Permalink
Hi Stephan

From: Stephen Hemminger, Tuesday, January 9, 2018 8:51 PM
Subject: Re: [PATCH v3 6/8] net/vdev_netvsc: skip routed netvsc probing
On Tue, 9 Jan 2018 14:47:31 +0000
Post by Matan Azrad
NetVSC netdevices which are already routed should not be probed
because they are used for management purposes by the HyperV.
prevent routed netvsc devices probing.
---
doc/guides/nics/vdev_netvsc.rst | 2 +-
drivers/net/vdev_netvsc/vdev_netvsc.c | 46
+++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 1 deletion(-)
diff --git a/doc/guides/nics/vdev_netvsc.rst
b/doc/guides/nics/vdev_netvsc.rst index fde1fb8..f779862 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
MAC address.
Not specifying either ``iface`` or ``mac`` makes this driver attach
itself to -all NetVSC interfaces found on the system.
+all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c
b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 3d8895b..4295b92 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -38,6 +38,7 @@
#define VDEV_NETVSC_PROBE_MS 1000
#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
+#define NETVSC_MAX_ROUTE_LINE_SIZE 300
#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -192,6 +193,44 @@ static LIST_HEAD(, vdev_netvsc_ctx)
vdev_netvsc_ctx_list = }
/**
+ * Determine if a network interface has a route.
+ *
+ * Network device name.
+ *
+ * A nonzero value when interface has an route. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_has_route(const char *name) {
+ FILE *fp;
+ int ret = 0;
+ char route[NETVSC_MAX_ROUTE_LINE_SIZE];
+ char *netdev;
+
+ fp = fopen("/proc/net/route", "r");
+ if (!fp) {
+ rte_errno = errno;
+ return 0;
+ }
+ while (fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL) {
+ netdev = strtok(route, "\t");
+ if (strcmp(netdev, name) == 0) {
+ ret = 1;
+ break;
+ }
+ /* Move file pointer to the next line. */
+ while (strchr(route, '\n') == NULL &&
+ fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) !=
NULL)
Post by Matan Azrad
+ ;
+ }
+ fclose(fp);
+ return ret;
+}
In many ways /proc/net/route is legacy intervace.
And system may have 1 M routes.
Maybe there is faster way to do this with netlink by looking to see if there is
an address associated with the interface.
Actually this is control path, we don't care about performance very much.
But I can get other idea here, Do you have suggestion?

Thanks!
Stephen Hemminger
2018-01-10 16:43:30 UTC
Permalink
On Wed, 10 Jan 2018 15:07:14 +0000
Post by Matan Azrad
Hi Stephan
From: Stephen Hemminger, Tuesday, January 9, 2018 8:51 PM
Subject: Re: [PATCH v3 6/8] net/vdev_netvsc: skip routed netvsc probing
On Tue, 9 Jan 2018 14:47:31 +0000
Post by Matan Azrad
NetVSC netdevices which are already routed should not be probed
because they are used for management purposes by the HyperV.
prevent routed netvsc devices probing.
---
doc/guides/nics/vdev_netvsc.rst | 2 +-
drivers/net/vdev_netvsc/vdev_netvsc.c | 46
+++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 1 deletion(-)
diff --git a/doc/guides/nics/vdev_netvsc.rst
b/doc/guides/nics/vdev_netvsc.rst index fde1fb8..f779862 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
MAC address.
Not specifying either ``iface`` or ``mac`` makes this driver attach
itself to -all NetVSC interfaces found on the system.
+all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c
b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 3d8895b..4295b92 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -38,6 +38,7 @@
#define VDEV_NETVSC_PROBE_MS 1000
#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
+#define NETVSC_MAX_ROUTE_LINE_SIZE 300
#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -192,6 +193,44 @@ static LIST_HEAD(, vdev_netvsc_ctx)
vdev_netvsc_ctx_list = }
/**
+ * Determine if a network interface has a route.
+ *
+ * Network device name.
+ *
+ * A nonzero value when interface has an route. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_has_route(const char *name) {
+ FILE *fp;
+ int ret = 0;
+ char route[NETVSC_MAX_ROUTE_LINE_SIZE];
+ char *netdev;
+
+ fp = fopen("/proc/net/route", "r");
+ if (!fp) {
+ rte_errno = errno;
+ return 0;
+ }
+ while (fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL) {
+ netdev = strtok(route, "\t");
+ if (strcmp(netdev, name) == 0) {
+ ret = 1;
+ break;
+ }
+ /* Move file pointer to the next line. */
+ while (strchr(route, '\n') == NULL &&
+ fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) !=
NULL)
Post by Matan Azrad
+ ;
+ }
+ fclose(fp);
+ return ret;
+}
In many ways /proc/net/route is legacy intervace.
And system may have 1 M routes.
Maybe there is faster way to do this with netlink by looking to see if there is
an address associated with the interface.
Actually this is control path, we don't care about performance very much.
But I can get other idea here, Do you have suggestion?
Thanks!
Use netlink (or ioctl) to get interface address.
If interface has an IPv4 or IPv6 (not link local), then skip it.
Matan Azrad
2018-01-11 09:00:10 UTC
Permalink
Hi Stephan

From: Stephen Hemminger, Wednesday, January 10, 2018 6:44 PM
Post by Stephen Hemminger
On Wed, 10 Jan 2018 15:07:14 +0000
Post by Matan Azrad
Hi Stephan
From: Stephen Hemminger, Tuesday, January 9, 2018 8:51 PM
Subject: Re: [PATCH v3 6/8] net/vdev_netvsc: skip routed netvsc probing
On Tue, 9 Jan 2018 14:47:31 +0000
Post by Matan Azrad
NetVSC netdevices which are already routed should not be probed
because they are used for management purposes by the HyperV.
prevent routed netvsc devices probing.
---
doc/guides/nics/vdev_netvsc.rst | 2 +-
drivers/net/vdev_netvsc/vdev_netvsc.c | 46
+++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 1 deletion(-)
diff --git a/doc/guides/nics/vdev_netvsc.rst
b/doc/guides/nics/vdev_netvsc.rst index fde1fb8..f779862 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
MAC address.
Not specifying either ``iface`` or ``mac`` makes this driver
attach itself to -all NetVSC interfaces found on the system.
+all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c
b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 3d8895b..4295b92 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -38,6 +38,7 @@
#define VDEV_NETVSC_PROBE_MS 1000
#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
+#define NETVSC_MAX_ROUTE_LINE_SIZE 300
#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -192,6 +193,44 @@ static LIST_HEAD(, vdev_netvsc_ctx)
vdev_netvsc_ctx_list = }
/**
+ * Determine if a network interface has a route.
+ *
+ * Network device name.
+ *
+ * A nonzero value when interface has an route. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_has_route(const char *name) {
+ FILE *fp;
+ int ret = 0;
+ char route[NETVSC_MAX_ROUTE_LINE_SIZE];
+ char *netdev;
+
+ fp = fopen("/proc/net/route", "r");
+ if (!fp) {
+ rte_errno = errno;
+ return 0;
+ }
+ while (fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL) {
+ netdev = strtok(route, "\t");
+ if (strcmp(netdev, name) == 0) {
+ ret = 1;
+ break;
+ }
+ /* Move file pointer to the next line. */
+ while (strchr(route, '\n') == NULL &&
+ fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) !=
NULL)
Post by Matan Azrad
+ ;
+ }
+ fclose(fp);
+ return ret;
+}
In many ways /proc/net/route is legacy intervace.
And system may have 1 M routes.
Maybe there is faster way to do this with netlink by looking to see
if there is an address associated with the interface.
Actually this is control path, we don't care about performance very much.
But I can get other idea here, Do you have suggestion?
Thanks!
Use netlink (or ioctl) to get interface address.
If interface has an IPv4 or IPv6 (not link local), then skip it.
As I a little bit investigated I found that IPv6 getting is problematic by ioctl.
And using nelink for it, really doesn't worth the effort.
So, I suggest to keep this code simple as is in spite of the optional high latency for this function, after all it is a control path.
Thomas Monjalon
2018-01-17 16:59:59 UTC
Permalink
Post by Matan Azrad
From: Stephen Hemminger, Wednesday, January 10, 2018 6:44 PM
Post by Stephen Hemminger
Post by Matan Azrad
From: Stephen Hemminger, Tuesday, January 9, 2018 8:51 PM
Post by Stephen Hemminger
On Tue, 9 Jan 2018 14:47:31 +0000
Post by Adrien Mazarguil
+static int
+vdev_netvsc_has_route(const char *name) {
+ FILE *fp;
+ int ret = 0;
+ char route[NETVSC_MAX_ROUTE_LINE_SIZE];
+ char *netdev;
+
+ fp = fopen("/proc/net/route", "r");
+ if (!fp) {
+ rte_errno = errno;
+ return 0;
+ }
+ while (fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL) {
+ netdev = strtok(route, "\t");
+ if (strcmp(netdev, name) == 0) {
+ ret = 1;
+ break;
+ }
+ /* Move file pointer to the next line. */
+ while (strchr(route, '\n') == NULL &&
+ fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) !=
NULL)
Post by Adrien Mazarguil
+ ;
+ }
+ fclose(fp);
+ return ret;
+}
In many ways /proc/net/route is legacy intervace.
And system may have 1 M routes.
Maybe there is faster way to do this with netlink by looking to see
if there is an address associated with the interface.
Actually this is control path, we don't care about performance very much.
But I can get other idea here, Do you have suggestion?
Thanks!
Use netlink (or ioctl) to get interface address.
If interface has an IPv4 or IPv6 (not link local), then skip it.
As I a little bit investigated I found that IPv6 getting is problematic by ioctl.
And using nelink for it, really doesn't worth the effort.
So, I suggest to keep this code simple as is in spite of the optional high latency for this function, after all it is a control path.
No more comment?
So we are OK with this solution for now?

If we see real performance issue, I guess it can be fixed later.
Matan Azrad
2018-01-09 14:47:32 UTC
Permalink
This parameter allows specifying any non-NetVSC interface or routed
NetVSC interfaces to use with tap sub-devices for development purposes.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 5 +++++
drivers/net/vdev_netvsc/vdev_netvsc.c | 30 +++++++++++++++++++-----------
2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index f779862..3c26990 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -86,5 +86,10 @@ The following device parameters are supported:
Same as ``iface`` except a suitable NetVSC interface is located using its
MAC address.

+- ``force`` [int]
+
+ If nonzero, forces the use of specified interfaces even if not detected as
+ NetVSC or detected as routed NETVSC.
+
Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 4295b92..301f9b6 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -35,6 +35,7 @@
#define VDEV_NETVSC_DRIVER net_vdev_netvsc
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_ARG_FORCE "force"
#define VDEV_NETVSC_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
@@ -413,6 +414,9 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
* - struct rte_kvargs *kvargs:
* Device arguments provided to current driver instance.
*
+ * - int force:
+ * Accept specified interface even if not detected as NetVSC.
+ *
* - unsigned int specified:
* Number of specific netdevices provided as device arguments.
*
@@ -430,6 +434,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
{
const char *name = va_arg(ap, const char *);
struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ int force = va_arg(ap, int);
unsigned int specified = va_arg(ap, unsigned int);
unsigned int *matched = va_arg(ap, unsigned int *);
unsigned int i;
@@ -484,20 +489,18 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
return 0;
}
if (!vdev_netvsc_iface_is_netvsc(iface)) {
- if (!specified)
+ if (!specified || !force)
return 0;
DRV_LOG(WARNING,
- "interface \"%s\" (index %u) is not NetVSC,"
- " skipping",
+ "using non-NetVSC interface \"%s\" (index %u)",
iface->if_name, iface->if_index);
- return 0;
}
/* Routed NetVSC should not be probed. */
if (vdev_netvsc_has_route(iface->if_name)) {
- DRV_LOG(WARNING, "NetVSC interface \"%s\" (index %u) is routed",
- iface->if_name, iface->if_index);
- if (!specified)
+ if (!specified || !force)
return 0;
+ DRV_LOG(WARNING, "using routed NetVSC interface \"%s\""
+ " (index %u)", iface->if_name, iface->if_index);
}
/* Create interface context. */
ctx = calloc(1, sizeof(*ctx));
@@ -591,6 +594,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
static const char *const vdev_netvsc_arg[] = {
VDEV_NETVSC_ARG_IFACE,
VDEV_NETVSC_ARG_MAC,
+ VDEV_NETVSC_ARG_FORCE,
NULL,
};
const char *name = rte_vdev_device_name(dev);
@@ -599,6 +603,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
vdev_netvsc_arg);
unsigned int specified = 0;
unsigned int matched = 0;
+ int force = 0;
unsigned int i;
int ret;

@@ -610,14 +615,16 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
for (i = 0; i != kvargs->count; ++i) {
const struct rte_kvargs_pair *pair = &kvargs->pairs[i];

- if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
- !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_FORCE))
+ force = !!atoi(pair->value);
+ else if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
++specified;
}
rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
/* Gather interfaces. */
ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
- specified, &matched);
+ force, specified, &matched);
if (ret < 0)
goto error;
if (matched < specified)
@@ -676,7 +683,8 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
RTE_PMD_REGISTER_ALIAS(VDEV_NETVSC_DRIVER, eth_vdev_netvsc);
RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
VDEV_NETVSC_ARG_IFACE "=<string> "
- VDEV_NETVSC_ARG_MAC "=<string>");
+ VDEV_NETVSC_ARG_MAC "=<string> "
+ VDEV_NETVSC_ARG_FORCE "=<int>");

/** Initialize driver log type. */
RTE_INIT(vdev_netvsc_init_log)
--
1.8.3.1
Matan Azrad
2018-01-09 14:47:33 UTC
Permalink
Using DPDK in Hyper-V VM systems requires vdev_netvsc driver to pair
the NetVSC netdev device with the same MAC address PCI device by
fail-safe PMD.

Add vdev_netvsc custom scan in vdev bus to allow automatic probing in
Hyper-V VM systems unless it was already specified by command line.

Add "ignore" parameter to disable this auto-detection.

Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 9 ++++--
drivers/net/vdev_netvsc/vdev_netvsc.c | 55 +++++++++++++++++++++++++++++++++--
2 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index 3c26990..55d130a 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -71,8 +71,8 @@ Build options
Run-time parameters
-------------------

-To invoke this driver, applications have to explicitly provide the
-``--vdev=net_vdev_netvsc`` EAL option.
+This driver is invoked automatically in Hyper-V VM systems unless the user
+invoked it by command line using ``--vdev=net_vdev_netvsc`` EAL option.

The following device parameters are supported:

@@ -91,5 +91,10 @@ The following device parameters are supported:
If nonzero, forces the use of specified interfaces even if not detected as
NetVSC or detected as routed NETVSC.

+- ``ignore`` [int]
+
+ If nonzero, ignores the driver runnig (actually used to disable the
+ auto-detection in Hyper-V VM).
+
Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 301f9b6..0897c3d 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -29,13 +29,16 @@
#include <rte_errno.h>
#include <rte_ethdev.h>
#include <rte_ether.h>
+#include <rte_hypervisor.h>
#include <rte_kvargs.h>
#include <rte_log.h>

#define VDEV_NETVSC_DRIVER net_vdev_netvsc
+#define VDEV_NETVSC_DRIVER_NAME RTE_STR(VDEV_NETVSC_DRIVER)
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
#define VDEV_NETVSC_ARG_FORCE "force"
+#define VDEV_NETVSC_ARG_IGNORE "ignore"
#define VDEV_NETVSC_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
@@ -44,7 +47,7 @@
#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
vdev_netvsc_logtype, \
- RTE_FMT(RTE_STR(VDEV_NETVSC_DRIVER) ": " \
+ RTE_FMT(VDEV_NETVSC_DRIVER_NAME ": " \
RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
RTE_FMT_TAIL(__VA_ARGS__,)))

@@ -595,6 +598,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
VDEV_NETVSC_ARG_IFACE,
VDEV_NETVSC_ARG_MAC,
VDEV_NETVSC_ARG_FORCE,
+ VDEV_NETVSC_ARG_IGNORE,
NULL,
};
const char *name = rte_vdev_device_name(dev);
@@ -604,6 +608,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
unsigned int specified = 0;
unsigned int matched = 0;
int force = 0;
+ int ignore = 0;
unsigned int i;
int ret;

@@ -617,10 +622,17 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =

if (!strcmp(pair->key, VDEV_NETVSC_ARG_FORCE))
force = !!atoi(pair->value);
+ else if (!strcmp(pair->key, VDEV_NETVSC_ARG_IGNORE))
+ ignore = !!atoi(pair->value);
else if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
!strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
++specified;
}
+ if (ignore) {
+ if (kvargs)
+ rte_kvargs_free(kvargs);
+ return 0;
+ }
rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
/* Gather interfaces. */
ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
@@ -684,7 +696,8 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
VDEV_NETVSC_ARG_IFACE "=<string> "
VDEV_NETVSC_ARG_MAC "=<string> "
- VDEV_NETVSC_ARG_FORCE "=<int>");
+ VDEV_NETVSC_ARG_FORCE "=<int> "
+ VDEV_NETVSC_ARG_IGNORE "=<int>");

/** Initialize driver log type. */
RTE_INIT(vdev_netvsc_init_log)
@@ -693,3 +706,41 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
if (vdev_netvsc_logtype >= 0)
rte_log_set_level(vdev_netvsc_logtype, RTE_LOG_NOTICE);
}
+
+/** Compare function for vdev find device operation. */
+static int
+vdev_netvsc_cmp_rte_device(const struct rte_device *dev1,
+ __rte_unused const void *_dev2)
+{
+ return strcmp(dev1->devargs->name, VDEV_NETVSC_DRIVER_NAME);
+}
+
+/**
+ * A callback called by vdev bus scan function to ensure this driver probing
+ * automatically in Hyper-V VM system unless it already exists in the
+ * devargs list.
+ */
+static void
+vdev_netvsc_scan_callback(__rte_unused void *arg)
+{
+ struct rte_vdev_device *dev;
+ struct rte_devargs *devargs;
+ struct rte_bus *vbus = rte_bus_find_by_name("vdev");
+
+ TAILQ_FOREACH(devargs, &devargs_list, next)
+ if (!strcmp(devargs->name, VDEV_NETVSC_DRIVER_NAME))
+ return;
+ dev = (struct rte_vdev_device *)vbus->find_device(NULL,
+ vdev_netvsc_cmp_rte_device, VDEV_NETVSC_DRIVER_NAME);
+ if (dev)
+ return;
+ if (rte_eal_devargs_add(RTE_DEVTYPE_VIRTUAL, VDEV_NETVSC_DRIVER_NAME))
+ DRV_LOG(ERR, "unable to add netvsc devargs.");
+}
+
+/** Initialize the custom scan. */
+RTE_INIT(vdev_netvsc_custom_scan_add)
+{
+ if (rte_hypervisor_get() == RTE_HYPERVISOR_HYPERV)
+ rte_vdev_add_custom_scan(vdev_netvsc_scan_callback, NULL);
+}
--
1.8.3.1
Matan Azrad
2018-01-18 08:43:38 UTC
Permalink
Virtual machines hosted by Hyper-V/Azure platforms are fitted with simplified virtual network devices named NetVSC that are used for fast communication between VM to VM, VM to hypervisor, and the outside.

They appear as standard system netdevices to user-land applications, the main difference being they are implemented on top of VMBUS instead of emulated PCI devices.

While this reads like a case for a standard DPDK PMD, there is more to it.

To accelerate outside communication, NetVSC devices as they appear in a VM can be paired with physical SR-IOV virtual function (VF) devices owned by that same VM. Both netdevices share the same MAC address in that case.

When paired, egress and most of the ingress traffic flow through the VF device, while part of it (e.g. multicasts, hypervisor control data) still flows through NetVSC. Moreover VF devices are not retained and disappear during VM migration; from a VM standpoint, they can be hot-plugged anytime with NetVSC acting as a fallback.

Running DPDK applications in such a context involves driving VF devices using their dedicated PMDs in a vendor-independent fashion (to benefit from maximum performance without writing dedicated code) while simultaneously listening to NetVSC and handling the related hot-plug events.

This new virtual driver (referred to as "vdev_netvsc" from this point on) automatically coordinates the Hyper-V/Azure-specific management part described above by relying on vendor-specific, failsafe and tap PMDs to expose a single consolidated Ethernet device usable directly by existing applications.

.------------------.
| DPDK application |
`--------+---------'
|
.------+------.
| DPDK ethdev |
`------+------' Control
| |
.------------+------------. v .--------------------.
| failsafe PMD +---------+ vdev_netvsc driver |
`--+-------------------+--' `--------------------'
| |
| .........|.........
| : | :
.----+----. : .----+----. :
| tap PMD | : | any PMD | :
`----+----' : `----+----' : <-- Hot-pluggable
| : | :
.------+-------. : .-----+-----. :
| NetVSC-based | : | SR-IOV VF | :
| netdevice | : | device | :
`--------------' : `-----------' :
:.................:



v2 changes(Adrien):

- Renamed driver from "hyperv" to "vdev_netvsc". This change covers
documentation and symbols prefix.
- Driver is now tagged EXPERIMENTAL.
- Replaced ether_addr_from_str() with a basic sscanf() call.
- Removed debugging code (memset() poisoning).
- Fixed hyperv_iface_is_netvsc()'s buffer allocation according to comments.
- Removed hyperv_basename().
- Discarded unused variables through __rte_unused.
- Added separate but necessary free() bugfix for failsafe PMD.
- Added file descriptor input support to failsafe PMD.
- Replaced temporary bash execution; failsafe now reads device definitions
directly through a pipe without an intermediate bash one-liner.
- Expanded DEBUG/INFO/WARN/ERROR() macros as PMD_DRV_LOG().
- Added dynamic log type (pmd.vdev_netvsc).
- Modified initialization code to probe devices immediately during startup.
- Fixed several snprintf() return value checks ("ret >= sizeof(foo)" is more
appropriate than "ret >= sizeof(foo) - 1").

v3 changes(Matan):
- Fixed clang compilation in V2.
- Removed hotplug remove code from the new driver.
- Supported probed sub-devices getting in fail-safe.
- Added automatic probing for HyperV VM systems.
- Added option to ignore the automatic probing.
- Skiped routed NetVSC devices probing.
- Adjusted documentation and semantics.
- Replaced maintainer.

v4 changes(Matan):
- Align descriptions of context struct(Stephen suggestion).
- Skip non-ethernet devices in netdev loop(Stephen suggestion).
- Use different variable names in "add fd parameter"(Gaetan suggestion).
- Change name of get port id function in "add automatic probing"(Gaetan suggestion).
- Update internal fail-safe devargs in case of probed device(Gaetan suggestion).
- use deferent commit title instead of "support probed sub-devices getting"(Gaetan suggestion).


Adrien Mazarguil (1):
net/failsafe: fix invalid free

Matan Azrad (7):
net/failsafe: add "fd" parameter
net/failsafe: add probed etherdev capture
net/vdev_netvsc: introduce Hyper-V platform driver
net/vdev_netvsc: implement core functionality
net/vdev_netvsc: skip routed netvsc probing
net/vdev_netvsc: add "force" parameter
net/vdev_netvsc: add automatic probing

MAINTAINERS | 6 +
config/common_base | 5 +
config/common_linuxapp | 1 +
doc/guides/nics/fail_safe.rst | 14 +
doc/guides/nics/features/vdev_netvsc.ini | 12 +
doc/guides/nics/index.rst | 1 +
doc/guides/nics/vdev_netvsc.rst | 100 +++
drivers/net/Makefile | 1 +
drivers/net/failsafe/failsafe_args.c | 84 ++-
drivers/net/failsafe/failsafe_eal.c | 78 ++-
drivers/net/failsafe/failsafe_private.h | 5 +
drivers/net/vdev_netvsc/Makefile | 31 +
.../vdev_netvsc/rte_pmd_vdev_netvsc_version.map | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 752 +++++++++++++++++++++
mk/rte.app.mk | 1 +
15 files changed, 1071 insertions(+), 24 deletions(-)
create mode 100644 doc/guides/nics/features/vdev_netvsc.ini
create mode 100644 doc/guides/nics/vdev_netvsc.rst
create mode 100644 drivers/net/vdev_netvsc/Makefile
create mode 100644 drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
create mode 100644 drivers/net/vdev_netvsc/vdev_netvsc.c
--
1.8.3.1
Matan Azrad
2018-01-18 08:43:39 UTC
Permalink
From: Adrien Mazarguil <***@6wind.com>

rte_free() is not supposed to work with pointers returned by calloc().

Fixes: a0194d828100 ("net/failsafe: add flexible device definition")
Cc: ***@dpdk.org
Cc: Gaetan Rivet <***@6wind.com>

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Acked-by: Gaetan Rivet <***@6wind.com>
---
drivers/net/failsafe/failsafe_args.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index cfc83e3..ec63ac9 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -407,7 +407,7 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
uint8_t i;

FOREACH_SUBDEV(sdev, i, dev) {
- rte_free(sdev->cmdline);
+ free(sdev->cmdline);
sdev->cmdline = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
--
1.8.3.1
Matan Azrad
2018-01-18 08:43:40 UTC
Permalink
This parameter enables applications to provide device definitions through
an arbitrary file descriptor number.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
Cc: Gaetan Rivet <***@6wind.com>
---
doc/guides/nics/fail_safe.rst | 9 ++++
drivers/net/failsafe/failsafe_args.c | 80 ++++++++++++++++++++++++++++++++-
drivers/net/failsafe/failsafe_private.h | 3 ++
3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index c4e3d2e..5b1b47e 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -106,6 +106,15 @@ Fail-safe command line parameters
All commas within the ``shell command`` are replaced by spaces before
executing the command. This helps using scripts to specify devices.

+- **fd(<file descriptor number>)** parameter
+
+ This parameter reads a device definition from an arbitrary file descriptor
+ number in ``<iface>`` format as described above.
+
+ The file descriptor is read in non-blocking mode and is never closed in
+ order to take only the last line into account (unlike ``exec()``) at every
+ probe attempt.
+
- **mac** parameter [MAC address]

This parameter allows the user to set a default MAC address to the fail-safe
diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index ec63ac9..db5235b 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -31,7 +31,11 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
#include <string.h>
+#include <unistd.h>
#include <errno.h>

#include <rte_debug.h>
@@ -161,6 +165,67 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}

static int
+fs_read_fd(struct sub_device *sdev, char *fd_str)
+{
+ FILE *fp = NULL;
+ int fd = -1;
+ /* store possible newline as well */
+ char output[DEVARGS_MAXLEN + 1];
+ int err = -ENODEV;
+ int oflags;
+ int lcount;
+
+ RTE_ASSERT(fd_str != NULL || sdev->fd_str != NULL);
+ if (sdev->fd_str == NULL) {
+ sdev->fd_str = strdup(fd_str);
+ if (sdev->fd_str == NULL) {
+ ERROR("Command line allocation failed");
+ return -ENOMEM;
+ }
+ }
+ errno = 0;
+ fd = strtol(fd_str, &fd_str, 0);
+ if (errno || *fd_str || fd < 0) {
+ ERROR("Parsing FD number failed");
+ goto error;
+ }
+ /* Fiddle with copy of file descriptor */
+ fd = dup(fd);
+ if (fd == -1)
+ goto error;
+ oflags = fcntl(fd, F_GETFL);
+ if (oflags == -1)
+ goto error;
+ if (fcntl(fd, F_SETFL, fd | O_NONBLOCK) == -1)
+ goto error;
+ fp = fdopen(fd, "r");
+ if (!fp)
+ goto error;
+ fd = -1;
+ /* Only take the last line into account */
+ lcount = 0;
+ while (fgets(output, sizeof(output), fp))
+ ++lcount;
+ if (lcount == 0)
+ goto error;
+ else if (ferror(fp) && errno != EAGAIN)
+ goto error;
+ /* Line must end with a newline character */
+ fs_sanitize_cmdline(output);
+ if (output[0] == '\0')
+ goto error;
+ err = fs_parse_device(sdev, output);
+ if (err)
+ ERROR("Parsing device '%s' failed", output);
+error:
+ if (fp)
+ fclose(fp);
+ if (fd != -1)
+ close(fd);
+ return err;
+}
+
+static int
fs_parse_device_param(struct rte_eth_dev *dev, const char *param,
uint8_t head)
{
@@ -202,6 +267,14 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}
if (ret)
goto free_args;
+ } else if (strncmp(param, "fd(", 3) == 0) {
+ ret = fs_read_fd(sdev, args);
+ if (ret == -ENODEV) {
+ DEBUG("Reading device info from FD failed");
+ ret = 0;
+ }
+ if (ret)
+ goto free_args;
} else {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
@@ -409,6 +482,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
FOREACH_SUBDEV(sdev, i, dev) {
free(sdev->cmdline);
sdev->cmdline = NULL;
+ free(sdev->fd_str);
+ sdev->fd_str = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
}
@@ -424,7 +499,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
param[b] != '\0')
b++;
if (strncmp(param, "dev", b) != 0 &&
- strncmp(param, "exec", b) != 0) {
+ strncmp(param, "exec", b) != 0 &&
+ strncmp(param, "fd(", b) != 0) {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
}
@@ -463,6 +539,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
continue;
if (sdev->cmdline)
ret = fs_execute_cmd(sdev, sdev->cmdline);
+ else if (sdev->fd_str)
+ ret = fs_read_fd(sdev, sdev->fd_str);
else
ret = fs_parse_sub_device(sdev);
if (ret == 0)
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index 54b5b91..5e04ffe 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -48,6 +48,7 @@
#define PMD_FAILSAFE_PARAM_STRING \
"dev(<ifc>)," \
"exec(<shell command>)," \
+ "fd(<fd number>)," \
"mac=mac_addr," \
"hotplug_poll=u64" \
""
@@ -112,6 +113,8 @@ struct sub_device {
struct fs_stats stats_snapshot;
/* Some device are defined as a command line */
char *cmdline;
+ /* Others are retrieved through a file descriptor */
+ char *fd_str;
/* fail-safe device backreference */
struct rte_eth_dev *fs_dev;
/* flag calling for recollection */
--
1.8.3.1
Gaëtan Rivet
2018-01-18 08:51:48 UTC
Permalink
Hi Matan,

You forgot to fix the fcntl call, see below,
Post by Adrien Mazarguil
This parameter enables applications to provide device definitions through
an arbitrary file descriptor number.
---
doc/guides/nics/fail_safe.rst | 9 ++++
drivers/net/failsafe/failsafe_args.c | 80 ++++++++++++++++++++++++++++++++-
drivers/net/failsafe/failsafe_private.h | 3 ++
3 files changed, 91 insertions(+), 1 deletion(-)
diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index c4e3d2e..5b1b47e 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -106,6 +106,15 @@ Fail-safe command line parameters
All commas within the ``shell command`` are replaced by spaces before
executing the command. This helps using scripts to specify devices.
+- **fd(<file descriptor number>)** parameter
+
+ This parameter reads a device definition from an arbitrary file descriptor
+ number in ``<iface>`` format as described above.
+
+ The file descriptor is read in non-blocking mode and is never closed in
+ order to take only the last line into account (unlike ``exec()``) at every
+ probe attempt.
+
- **mac** parameter [MAC address]
This parameter allows the user to set a default MAC address to the fail-safe
diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index ec63ac9..db5235b 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -31,7 +31,11 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
#include <string.h>
+#include <unistd.h>
#include <errno.h>
#include <rte_debug.h>
@@ -161,6 +165,67 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}
static int
+fs_read_fd(struct sub_device *sdev, char *fd_str)
+{
+ FILE *fp = NULL;
+ int fd = -1;
+ /* store possible newline as well */
+ char output[DEVARGS_MAXLEN + 1];
+ int err = -ENODEV;
+ int oflags;
+ int lcount;
+
+ RTE_ASSERT(fd_str != NULL || sdev->fd_str != NULL);
+ if (sdev->fd_str == NULL) {
+ sdev->fd_str = strdup(fd_str);
+ if (sdev->fd_str == NULL) {
+ ERROR("Command line allocation failed");
+ return -ENOMEM;
+ }
+ }
+ errno = 0;
+ fd = strtol(fd_str, &fd_str, 0);
+ if (errno || *fd_str || fd < 0) {
+ ERROR("Parsing FD number failed");
+ goto error;
+ }
+ /* Fiddle with copy of file descriptor */
+ fd = dup(fd);
+ if (fd == -1)
+ goto error;
+ oflags = fcntl(fd, F_GETFL);
+ if (oflags == -1)
+ goto error;
+ if (fcntl(fd, F_SETFL, fd | O_NONBLOCK) == -1)
fcntl(fd, F_SETFL, oflags | O_NONBLOCK); here
Post by Adrien Mazarguil
+ goto error;
+ fp = fdopen(fd, "r");
+ if (!fp)
While you're at it, here please use

if (fp != NULL)

instead.

Regards,
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-18 08:43:41 UTC
Permalink
Previous fail-safe code didn't support probed sub-devices capture and
failed when it tried to probe them.

Skip fail-safe sub-device probing when it already was probed.

Signed-off-by: Matan Azrad <***@mellanox.com>
Cc: Gaetan Rivet <***@6wind.com>
---
doc/guides/nics/fail_safe.rst | 5 +++
drivers/net/failsafe/failsafe_args.c | 2 -
drivers/net/failsafe/failsafe_eal.c | 78 ++++++++++++++++++++++++---------
drivers/net/failsafe/failsafe_private.h | 2 +
4 files changed, 65 insertions(+), 22 deletions(-)

diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index 5b1b47e..b89e53b 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -115,6 +115,11 @@ Fail-safe command line parameters
order to take only the last line into account (unlike ``exec()``) at every
probe attempt.

+.. note::
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD will take the device
+ as is, which means that EAL device options are taken in this case.
+
- **mac** parameter [MAC address]

This parameter allows the user to set a default MAC address to the fail-safe
diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index db5235b..daf5ed0 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -45,8 +45,6 @@

#include "failsafe_private.h"

-#define DEVARGS_MAXLEN 4096
-
/* Callback used when a new device is found in devargs */
typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
uint8_t head);
diff --git a/drivers/net/failsafe/failsafe_eal.c b/drivers/net/failsafe/failsafe_eal.c
index 19d26f5..33a5adf 100644
--- a/drivers/net/failsafe/failsafe_eal.c
+++ b/drivers/net/failsafe/failsafe_eal.c
@@ -36,39 +36,77 @@
#include "failsafe_private.h"

static int
+fs_ethdev_portid_get(const char *name, uint16_t *port_id)
+{
+ uint16_t pid;
+ size_t len;
+
+ if (name == NULL) {
+ DEBUG("Null pointer is specified\n");
+ return -EINVAL;
+ }
+ len = strlen(name);
+ RTE_ETH_FOREACH_DEV(pid) {
+ if (!strncmp(name, rte_eth_devices[pid].device->name, len)) {
+ *port_id = pid;
+ return 0;
+ }
+ }
+ return -ENODEV;
+}
+
+static int
fs_bus_init(struct rte_eth_dev *dev)
{
struct sub_device *sdev;
struct rte_devargs *da;
uint8_t i;
- uint16_t j;
+ uint16_t pid;
int ret;

FOREACH_SUBDEV(sdev, i, dev) {
if (sdev->state != DEV_PARSED)
continue;
da = &sdev->devargs;
- ret = rte_eal_hotplug_add(da->bus->name,
- da->name,
- da->args);
- if (ret) {
- ERROR("sub_device %d probe failed %s%s%s", i,
- rte_errno ? "(" : "",
- rte_errno ? strerror(rte_errno) : "",
- rte_errno ? ")" : "");
- continue;
- }
- RTE_ETH_FOREACH_DEV(j) {
- if (strcmp(rte_eth_devices[j].device->name,
- da->name) == 0) {
- ETH(sdev) = &rte_eth_devices[j];
- break;
+ if (fs_ethdev_portid_get(da->name, &pid) != 0) {
+ ret = rte_eal_hotplug_add(da->bus->name,
+ da->name,
+ da->args);
+ if (ret) {
+ ERROR("sub_device %d probe failed %s%s%s", i,
+ rte_errno ? "(" : "",
+ rte_errno ? strerror(rte_errno) : "",
+ rte_errno ? ")" : "");
+ continue;
}
+ if (fs_ethdev_portid_get(da->name, &pid) != 0) {
+ ERROR("sub_device %d init went wrong", i);
+ return -ENODEV;
+ }
+ } else {
+ char devstr[DEVARGS_MAXLEN] = "";
+ struct rte_devargs *probed_da =
+ rte_eth_devices[pid].device->devargs;
+
+ /* Take control of device probed by EAL options. */
+ free(da->args);
+ memset(da, 0, sizeof(*da));
+ if (probed_da != NULL)
+ snprintf(devstr, sizeof(devstr), "%s,%s",
+ probed_da->name, probed_da->args);
+ else
+ snprintf(devstr, sizeof(devstr), "%s",
+ rte_eth_devices[pid].device->name);
+ ret = rte_eal_devargs_parse(devstr, da);
+ if (ret) {
+ ERROR("Probed devargs parsing failed with code"
+ " %d", ret);
+ return ret;
+ }
+ INFO("Taking control of a probed sub device"
+ " %d named %s", i, da->name);
}
- if (ETH(sdev) == NULL) {
- ERROR("sub_device %d init went wrong", i);
- return -ENODEV;
- }
+ ETH(sdev) = &rte_eth_devices[pid];
SUB_ID(sdev) = i;
sdev->fs_dev = dev;
sdev->dev = ETH(sdev)->device;
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index 5e04ffe..9fcf72e 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -58,6 +58,8 @@
#define FAILSAFE_MAX_ETHPORTS 2
#define FAILSAFE_MAX_ETHADDR 128

+#define DEVARGS_MAXLEN 4096
+
/* TYPES */

struct rxq {
--
1.8.3.1
Gaëtan Rivet
2018-01-18 09:10:56 UTC
Permalink
Hi Matan,
Post by Matan Azrad
Previous fail-safe code didn't support probed sub-devices capture and
failed when it tried to probe them.
Skip fail-safe sub-device probing when it already was probed.
What happens when

app --vdev "net_failsafe0,dev(net_failsafe0)" -- -i

? I guess infinite recursion.
Post by Matan Azrad
---
doc/guides/nics/fail_safe.rst | 5 +++
drivers/net/failsafe/failsafe_args.c | 2 -
drivers/net/failsafe/failsafe_eal.c | 78 ++++++++++++++++++++++++---------
drivers/net/failsafe/failsafe_private.h | 2 +
4 files changed, 65 insertions(+), 22 deletions(-)
diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index 5b1b47e..b89e53b 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -115,6 +115,11 @@ Fail-safe command line parameters
order to take only the last line into account (unlike ``exec()``) at every
probe attempt.
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD will take the device
+ as is, which means that EAL device options are taken in this case.
+
This note should be right under the "dev()" parameter help I think.

If the self-capture is possible and you fix it, you should as well add a line
here about the limitation, concerning the PCI blacklist mode and the
expected PCI id format?

Something like:

--- 8< ---

When trying to use a PCI device automatically probed in blacklist mode,
the syntax for the fail-safe must be with the full PCI id:
Domain:Bus:Device.Function. See the usage example section.

.. ^^^^^^^^^^^^^ Here, an ReST reference
.. Would be nice, I don't recall
.. the exact syntax.
.. In the `Usage example` section:

#. Start testpmd, automatically probing the device 84:00.0 and using it with
the fail-safe

.. code-block:: console

$RTE_TARGET/build/app/testpmd -c 0xff -n 4 \
--vdev 'net_failsafe0,dev(0000:84:00.0),dev(net_ring0)' \
-- -i

--- >8 ---

Ensure that this is working before using this command, I haven't tested it.

Regards,
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-18 09:33:58 UTC
Permalink
Hi Gaetan

From: Gaëtan Rivet, Thursday, January 18, 2018 11:11 AM
Subject: Re: [PATCH v4 3/8] net/failsafe: add probed etherdev capture
Hi Matan,
Post by Matan Azrad
Previous fail-safe code didn't support probed sub-devices capture and
failed when it tried to probe them.
Skip fail-safe sub-device probing when it already was probed.
What happens when
app --vdev "net_failsafe0,dev(net_failsafe0)" -- -i
? I guess infinite recursion.
:) interesting

./x86_64-native-linuxapp-gcc/build/app/test-pmd/testpmd -n 4 --vdev="net_failsafe0,dev(net_failsafe0)" --vdev="net_vdev_netvsc,ignore=1" -- --burst=118 --mbcache=512 --portmask 0xf -i --nb-cores=11 --rxq=2 --txq=2 --txd=1024 --rxd=1024
EAL: Detected 12 lcore(s)
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Debug dataplane logs available - lower performance
EAL: Probing VFIO support...
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
EAL: PCI device 0002:00:02.0 on NUMA socket 0
EAL: probe driver: 15b3:1004 net_mlx4
PMD: net_mlx4: PCI information matches, using device "mlx4_0" (VF: true)
PMD: net_mlx4: 1 port(s) detected
PMD: net_mlx4: port 1 MAC address is 00:15:5d:44:4b:24
PMD: net_failsafe: Initializing Fail-safe PMD for net_failsafe0
PMD: net_failsafe: Creating fail-safe device on NUMA socket 0
PMD: net_failsafe: Taking control of a probed sub device 0 named net_failsafe0
PMD: net_failsafe: MAC address is 00:00:00:00:00:00
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=327680, size=2176, socket=0
Configuring Port 0 (socket 0)
Port 0: 00:15:5D:44:4B:24
Checking link statuses...
Done
testpmd>

Failsafe0 took control of itself (since it is already probed we don't probe it again).
Post by Matan Azrad
---
doc/guides/nics/fail_safe.rst | 5 +++
drivers/net/failsafe/failsafe_args.c | 2 -
drivers/net/failsafe/failsafe_eal.c | 78 ++++++++++++++++++++++++---
------
Post by Matan Azrad
drivers/net/failsafe/failsafe_private.h | 2 +
4 files changed, 65 insertions(+), 22 deletions(-)
diff --git a/doc/guides/nics/fail_safe.rst
b/doc/guides/nics/fail_safe.rst index 5b1b47e..b89e53b 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -115,6 +115,11 @@ Fail-safe command line parameters
order to take only the last line into account (unlike ``exec()``) at every
probe attempt.
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD will take the
device
Post by Matan Azrad
+ as is, which means that EAL device options are taken in this case.
+
This note should be right under the "dev()" parameter help I think.
OK.
If the self-capture is possible and you fix it, you should as well add a line here
about the limitation, concerning the PCI blacklist mode and the expected PCI
id format?
--- 8< ---
When trying to use a PCI device automatically probed in blacklist mode,
Domain:Bus:Device.Function. See the usage example section.
.. ^^^^^^^^^^^^^ Here, an ReST reference
.. Would be nice, I don't recall
.. the exact syntax.
#. Start testpmd, automatically probing the device 84:00.0 and using it with
the fail-safe
.. code-block:: console
$RTE_TARGET/build/app/testpmd -c 0xff -n 4 \
--vdev 'net_failsafe0,dev(0000:84:00.0),dev(net_ring0)' \
-- -i
--- >8 ---
Ok.
Ensure that this is working before using this command, I haven't tested it.
Sure.
Regards,
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-18 08:43:42 UTC
Permalink
This patch lays the groundwork for this driver (draft documentation,
copyright notices, code base skeleton and build system hooks). While it can
be successfully compiled and invoked, it's an empty shell at this stage.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
MAINTAINERS | 6 ++
config/common_base | 5 ++
config/common_linuxapp | 1 +
doc/guides/nics/features/vdev_netvsc.ini | 12 +++
doc/guides/nics/index.rst | 1 +
doc/guides/nics/vdev_netvsc.rst | 20 +++++
drivers/net/Makefile | 1 +
drivers/net/vdev_netvsc/Makefile | 27 ++++++
.../vdev_netvsc/rte_pmd_vdev_netvsc_version.map | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 99 ++++++++++++++++++++++
mk/rte.app.mk | 1 +
11 files changed, 177 insertions(+)
create mode 100644 doc/guides/nics/features/vdev_netvsc.ini
create mode 100644 doc/guides/nics/vdev_netvsc.rst
create mode 100644 drivers/net/vdev_netvsc/Makefile
create mode 100644 drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
create mode 100644 drivers/net/vdev_netvsc/vdev_netvsc.c

diff --git a/MAINTAINERS b/MAINTAINERS
index af8de4f..97efbb9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -462,6 +462,12 @@ F: drivers/net/mrvl/
F: doc/guides/nics/mrvl.rst
F: doc/guides/nics/features/mrvl.ini

+Microsoft vdev-netvsc - EXPERIMENTAL
+M: Matan Azrad <***@mellanox.com>
+F: drivers/net/vdev-netvsc/
+F: doc/guides/nics/vdev-netvsc.rst
+F: doc/guides/nics/features/vdev-netvsc.ini
+
Netcope szedata2
M: Matej Vido <***@cesnet.cz>
F: drivers/net/szedata2/
diff --git a/config/common_base b/config/common_base
index 90508a8..664ff21 100644
--- a/config/common_base
+++ b/config/common_base
@@ -279,6 +279,11 @@ CONFIG_RTE_LIBRTE_NFP_DEBUG_RX=n
CONFIG_RTE_LIBRTE_MRVL_PMD=n

#
+# Compile virtual device driver for NetVSC on Hyper-V/Azure
+#
+CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=n
+
+#
# Compile burst-oriented Broadcom BNXT PMD driver
#
CONFIG_RTE_LIBRTE_BNXT_PMD=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74c7d64..e043262 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -47,6 +47,7 @@ CONFIG_RTE_LIBRTE_PMD_VHOST=y
CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
CONFIG_RTE_LIBRTE_PMD_TAP=y
CONFIG_RTE_LIBRTE_AVP_PMD=y
+CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=y
CONFIG_RTE_LIBRTE_NFP_PMD=y
CONFIG_RTE_LIBRTE_POWER=y
CONFIG_RTE_VIRTIO_USER=y
diff --git a/doc/guides/nics/features/vdev_netvsc.ini b/doc/guides/nics/features/vdev_netvsc.ini
new file mode 100644
index 0000000..cfc5cb9
--- /dev/null
+++ b/doc/guides/nics/features/vdev_netvsc.ini
@@ -0,0 +1,12 @@
+;
+; Supported features of the 'vdev_netvsc' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+ARMv7 = Y
+ARMv8 = Y
+Power8 = Y
+x86-32 = Y
+x86-64 = Y
+Usage doc = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 23babe9..5666046 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -64,6 +64,7 @@ Network Interface Controller Drivers
szedata2
tap
thunderx
+ vdev_netvsc
virtio
vhost
vmxnet3
diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
new file mode 100644
index 0000000..a952908
--- /dev/null
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: BSD-3-Clause
+ Copyright 2017 6WIND S.A.
+ Copyright 2017 Mellanox Technologies, Ltd.
+
+VDEV_NETVSC driver
+==================
+
+The VDEV_NETVSC driver (librte_pmd_vdev_netvsc) provides support for NetVSC
+interfaces and associated SR-IOV virtual function (VF) devices found in
+Linux virtual machines running on Microsoft Hyper-V_ (including Azure)
+platforms.
+
+.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
+
+Build options
+-------------
+
+- ``CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD`` (default ``y``)
+
+ Toggle compilation of this driver.
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c2fd7f5..e112732 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -39,6 +39,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += sfc
DIRS-$(CONFIG_RTE_LIBRTE_PMD_SZEDATA2) += szedata2
DIRS-$(CONFIG_RTE_LIBRTE_PMD_TAP) += tap
DIRS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += thunderx
+DIRS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc
DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio
DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += vmxnet3

diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
new file mode 100644
index 0000000..2fb059d
--- /dev/null
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -0,0 +1,27 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright 2017 6WIND S.A.
+# Copyright 2017 Mellanox Technologies, Ltd.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# Properties of the generated library.
+LIB = librte_pmd_vdev_netvsc.a
+LIBABIVER := 1
+EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
+
+# Additional compilation flags.
+CFLAGS += -O3
+CFLAGS += -g
+CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += $(WERROR_FLAGS)
+
+# Dependencies.
+LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_eal
+LDLIBS += -lrte_ethdev
+LDLIBS += -lrte_kvargs
+
+# Source files.
+SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map b/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
new file mode 100644
index 0000000..179140f
--- /dev/null
+++ b/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
@@ -0,0 +1,4 @@
+DPDK_18.02 {
+
+ local: *;
+};
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
new file mode 100644
index 0000000..e895b32
--- /dev/null
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2017 6WIND S.A.
+ * Copyright 2017 Mellanox Technologies, Ltd.
+ */
+
+#include <stddef.h>
+
+#include <rte_bus_vdev.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#define VDEV_NETVSC_DRIVER net_vdev_netvsc
+#define VDEV_NETVSC_ARG_IFACE "iface"
+#define VDEV_NETVSC_ARG_MAC "mac"
+
+#define DRV_LOG(level, ...) \
+ rte_log(RTE_LOG_ ## level, \
+ vdev_netvsc_logtype, \
+ RTE_FMT(RTE_STR(VDEV_NETVSC_DRIVER) ": " \
+ RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+/** Driver-specific log messages type. */
+static int vdev_netvsc_logtype;
+
+/** Number of driver instances relying on context list. */
+static unsigned int vdev_netvsc_ctx_inst;
+
+/**
+ * Probe NetVSC interfaces.
+ *
+ * @param dev
+ * Virtual device context for driver instance.
+ *
+ * @return
+ * Always 0, even in case of errors.
+ */
+static int
+vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
+{
+ static const char *const vdev_netvsc_arg[] = {
+ VDEV_NETVSC_ARG_IFACE,
+ VDEV_NETVSC_ARG_MAC,
+ NULL,
+ };
+ const char *name = rte_vdev_device_name(dev);
+ const char *args = rte_vdev_device_args(dev);
+ struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
+ vdev_netvsc_arg);
+
+ DRV_LOG(DEBUG, "invoked as \"%s\", using arguments \"%s\"", name, args);
+ if (!kvargs) {
+ DRV_LOG(ERR, "cannot parse arguments list");
+ goto error;
+ }
+error:
+ if (kvargs)
+ rte_kvargs_free(kvargs);
+ ++vdev_netvsc_ctx_inst;
+ return 0;
+}
+
+/**
+ * Remove driver instance.
+ *
+ * @param dev
+ * Virtual device context for driver instance.
+ *
+ * @return
+ * Always 0.
+ */
+static int
+vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
+{
+ --vdev_netvsc_ctx_inst;
+ return 0;
+}
+
+/** Virtual device descriptor. */
+static struct rte_vdev_driver vdev_netvsc_vdev = {
+ .probe = vdev_netvsc_vdev_probe,
+ .remove = vdev_netvsc_vdev_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(VDEV_NETVSC_DRIVER, vdev_netvsc_vdev);
+RTE_PMD_REGISTER_ALIAS(VDEV_NETVSC_DRIVER, eth_vdev_netvsc);
+RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
+ VDEV_NETVSC_ARG_IFACE "=<string> "
+ VDEV_NETVSC_ARG_MAC "=<string>");
+
+/** Initialize driver log type. */
+RTE_INIT(vdev_netvsc_init_log)
+{
+ vdev_netvsc_logtype = rte_log_register("pmd.vdev_netvsc");
+ if (vdev_netvsc_logtype >= 0)
+ rte_log_set_level(vdev_netvsc_logtype, RTE_LOG_NOTICE);
+}
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 78f23c5..2f8af49 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -157,6 +157,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += -lrte_pmd_sfc_efx
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_SZEDATA2) += -lrte_pmd_szedata2 -lsze2
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_TAP) += -lrte_pmd_tap
_LDLIBS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += -lrte_pmd_thunderx_nicvf
+_LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
_LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += -lrte_pmd_virtio
ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += -lrte_pmd_vhost
--
1.8.3.1
Matan Azrad
2018-01-18 08:43:43 UTC
Permalink
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.

This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.

PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.

Once initialized, the sole job of the vdev_netvsc driver is to regularly
scan for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 70 +++++
drivers/net/vdev_netvsc/Makefile | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 550 +++++++++++++++++++++++++++++++++-
3 files changed, 623 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index a952908..fde1fb8 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -12,9 +12,79 @@ platforms.

.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v

+Implementation details
+----------------------
+
+Each instance of this driver effectively needs to drive two devices: the
+NetVSC interface proper and its SR-IOV VF (referred to as "physical" from
+this point on) counterpart sharing the same MAC address.
+
+Physical devices are part of the host system and cannot be maintained during
+VM migration. From a VM standpoint they appear as hot-plug devices that come
+and go without prior notice.
+
+When the physical device is present, egress and most of the ingress traffic
+flows through it; only multicasts and other hypervisor control still flow
+through NetVSC. Otherwise, NetVSC acts as a fallback for all traffic.
+
+To avoid unnecessary code duplication and ensure maximum performance,
+handling of physical devices is left to their original PMDs; this virtual
+device driver (also known as *vdev*) manages other PMDs as summarized by the
+following block diagram::
+
+ .------------------.
+ | DPDK application |
+ `--------+---------'
+ |
+ .------+------.
+ | DPDK ethdev |
+ `------+------' Control
+ | |
+ .------------+------------. v .--------------------.
+ | failsafe PMD +---------+ vdev_netvsc driver |
+ `--+-------------------+--' `--------------------'
+ | |
+ | .........|.........
+ | : | :
+ .----+----. : .----+----. :
+ | tap PMD | : | any PMD | :
+ `----+----' : `----+----' : <-- Hot-pluggable
+ | : | :
+ .------+-------. : .-----+-----. :
+ | NetVSC-based | : | SR-IOV VF | :
+ | netdevice | : | device | :
+ `--------------' : `-----------' :
+ :.................:
+
+
+This driver implementation may be temporary and should be improved or removed
+either when hot-plug will be fully supported in EAL and bus drivers or when
+a new NetVSC driver will be integrated.
+
Build options
-------------

- ``CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD`` (default ``y``)

Toggle compilation of this driver.
+
+Run-time parameters
+-------------------
+
+To invoke this driver, applications have to explicitly provide the
+``--vdev=net_vdev_netvsc`` EAL option.
+
+The following device parameters are supported:
+
+- ``iface`` [string]
+
+ Provide a specific NetVSC interface (netdevice) name to attach this driver
+ to. Can be provided multiple times for additional instances.
+
+- ``mac`` [string]
+
+ Same as ``iface`` except a suitable NetVSC interface is located using its
+ MAC address.
+
+Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
+all NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
index 2fb059d..f2b2ac5 100644
--- a/drivers/net/vdev_netvsc/Makefile
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -13,6 +13,9 @@ EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
CFLAGS += -O3
CFLAGS += -g
CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += -D_XOPEN_SOURCE=600
+CFLAGS += -D_BSD_SOURCE
+CFLAGS += -D_DEFAULT_SOURCE
CFLAGS += $(WERROR_FLAGS)

# Dependencies.
@@ -20,6 +23,7 @@ LDLIBS += -lrte_bus_vdev
LDLIBS += -lrte_eal
LDLIBS += -lrte_ethdev
LDLIBS += -lrte_kvargs
+LDLIBS += -lrte_net

# Source files.
SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index e895b32..21c3265 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -3,17 +3,42 @@
* Copyright 2017 Mellanox Technologies, Ltd.
*/

+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <linux/sockios.h>
+#include <net/if.h>
+#include <net/if_arp.h>
+#include <netinet/ip.h>
+#include <stdarg.h>
#include <stddef.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/queue.h>
+#include <sys/socket.h>
+#include <unistd.h>

+#include <rte_alarm.h>
+#include <rte_bus.h>
#include <rte_bus_vdev.h>
#include <rte_common.h>
#include <rte_config.h>
+#include <rte_dev.h>
+#include <rte_errno.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
#include <rte_kvargs.h>
#include <rte_log.h>

#define VDEV_NETVSC_DRIVER net_vdev_netvsc
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_PROBE_MS 1000
+
+#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"

#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -25,12 +50,495 @@
/** Driver-specific log messages type. */
static int vdev_netvsc_logtype;

+/** Context structure for a vdev_netvsc instance. */
+struct vdev_netvsc_ctx {
+ LIST_ENTRY(vdev_netvsc_ctx) entry; /**< Next entry in list. */
+ unsigned int id; /**< Unique ID. */
+ char name[64]; /**< Unique name. */
+ char devname[64]; /**< Fail-safe instance name. */
+ char devargs[256]; /**< Fail-safe device arguments. */
+ char if_name[IF_NAMESIZE]; /**< NetVSC netdevice name. */
+ unsigned int if_index; /**< NetVSC netdevice index. */
+ struct ether_addr if_addr; /**< NetVSC MAC address. */
+ int pipe[2]; /**< Fail-safe communication pipe. */
+ char yield[256]; /**< PCI sub-device arguments. */
+};
+
+/** Context list is common to all driver instances. */
+static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
+ LIST_HEAD_INITIALIZER(vdev_netvsc_ctx_list);
+
+/** Number of entries in context list. */
+static unsigned int vdev_netvsc_ctx_count;
+
/** Number of driver instances relying on context list. */
static unsigned int vdev_netvsc_ctx_inst;

/**
+ * Destroy a vdev_netvsc context instance.
+ *
+ * @param ctx
+ * Context to destroy.
+ */
+static void
+vdev_netvsc_ctx_destroy(struct vdev_netvsc_ctx *ctx)
+{
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice found on
+ * the system.
+ *
+ * @param func
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ * @param ...
+ * Variable parameter list passed as @p va_list to @p func.
+ *
+ * @return
+ * 0 when the entire list is traversed successfully, a negative error code
+ * in case or failure, or the nonzero value returned by @p func when list
+ * traversal is aborted.
+ */
+static int
+vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "cannot retrieve system network interfaces");
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot open socket: %s", rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ DRV_LOG(WARNING, "cannot retrieve information about"
+ " interface \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
+ if (req.ifr_hwaddr.sa_family != ARPHRD_ETHER) {
+ DRV_LOG(DEBUG, "interface %s is non-ethernet device",
+ req.ifr_name);
+ continue;
+ }
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+error:
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ *
+ * @return
+ * A nonzero value when interface is detected as NetVSC. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[sizeof(temp) + IF_NAMESIZE];
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ ret = snprintf(path, sizeof(path), temp, iface->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(path)) {
+ rte_errno = ENOBUFS;
+ return 0;
+ }
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * @param[out] buf
+ * Output data buffer.
+ * @param size
+ * Output buffer size.
+ * @param[in] if_name
+ * Netdevice name.
+ * @param[in] relpath
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * @return
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+vdev_netvsc_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
+
+/**
+ * Probe a network interface to associate with vdev_netvsc context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the vdev_netvsc context and
+ * communicates its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - struct vdev_netvsc_ctx *ctx:
+ * Context to associate network interface with.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct vdev_netvsc_ctx *ctx = va_arg(ap, struct vdev_netvsc_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DRV_LOG(DEBUG,
+ "NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (vdev_netvsc_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ if (strcmp(addr, "pci"))
+ return 0;
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance. */
+ if (strcmp(addr, ctx->yield))
+ DRV_LOG(DEBUG, "associating PCI device \"%s\" with NetVSC"
+ " interface \"%s\" (index %u)", addr, ctx->if_name,
+ ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ DRV_LOG(WARNING, "cannot associate PCI device name \"%s\" with"
+ " interface \"%s\": %s", addr, ctx->if_name,
+ rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by VDEV_NETVSC_PROBE_MS as
+ * long as an vdev_netvsc context instance exists.
+ *
+ * @param arg
+ * Ignored.
+ */
+static void
+vdev_netvsc_alarm(__rte_unused void *arg)
+{
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry) {
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ if (ret)
+ break;
+ }
+ if (!vdev_netvsc_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
+
+/**
+ * Probe a NetVSC interface to generate a vdev_netvsc context from.
+ *
+ * This function instantiates vdev_netvsc contexts either for all NetVSC
+ * devices found on the system or only a subset provided as device
+ * arguments.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - const char *name:
+ * Name associated with current driver instance.
+ *
+ * - struct rte_kvargs *kvargs:
+ * Device arguments provided to current driver instance.
+ *
+ * - unsigned int specified:
+ * Number of specific netdevices provided as device arguments.
+ *
+ * - unsigned int *matched:
+ * The number of specified netdevices matched by this function.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key, VDEV_NETVSC_ARG_MAC)) {
+ struct ether_addr tmp;
+
+ if (sscanf(pair->value,
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8 ":"
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8,
+ &tmp.addr_bytes[0],
+ &tmp.addr_bytes[1],
+ &tmp.addr_bytes[2],
+ &tmp.addr_bytes[3],
+ &tmp.addr_bytes[4],
+ &tmp.addr_bytes[5]) != 6) {
+ DRV_LOG(ERR,
+ "invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is already handled,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!vdev_netvsc_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is not NetVSC,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot allocate context for interface \"%s\": %s",
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = vdev_netvsc_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ DRV_LOG(ERR,
+ "cannot allocate control pipe for interface \"%s\": %s",
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1)
+ continue;
+ ret = -errno;
+ DRV_LOG(ERR, "cannot toggle non-blocking flag on control file"
+ " descriptor #%u (%d): %s", i, ctx->pipe[i],
+ rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name))
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname), "net_failsafe_%s",
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname))
+ ++i;
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "fd(%d),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs))
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "generated virtual device name or argument list"
+ " too long for interface \"%s\"", ctx->if_name);
+ goto error;
+ }
+ /* Request virtual device generation. */
+ DRV_LOG(DEBUG, "generating virtual device \"%s\" with arguments \"%s\"",
+ ctx->devname, ctx->devargs);
+ vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&vdev_netvsc_ctx_list, ctx, entry);
+ ++vdev_netvsc_ctx_count;
+ DRV_LOG(DEBUG, "added NetVSC interface \"%s\" to context list",
+ ctx->if_name);
+ return 0;
+error:
+ if (ctx)
+ vdev_netvsc_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified device
+ * arguments and starts a periodic alarm callback to notify the resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* @param dev
* Virtual device context for driver instance.
*
@@ -49,12 +557,40 @@
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
vdev_netvsc_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;

DRV_LOG(DEBUG, "invoked as \"%s\", using arguments \"%s\"", name, args);
if (!kvargs) {
DRV_LOG(ERR, "cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ /* Gather interfaces. */
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ DRV_LOG(WARNING,
+ "some of the specified parameters did not match"
+ " recognized network interfaces");
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
error:
if (kvargs)
rte_kvargs_free(kvargs);
@@ -65,6 +601,9 @@
/**
* Remove driver instance.
*
+ * The alarm callback and underlying vdev_netvsc context instances are only
+ * destroyed after the last PMD instance is removed.
+ *
* @param dev
* Virtual device context for driver instance.
*
@@ -74,7 +613,16 @@
static int
vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
{
- --vdev_netvsc_ctx_inst;
+ if (--vdev_netvsc_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ while (!LIST_EMPTY(&vdev_netvsc_ctx_list)) {
+ struct vdev_netvsc_ctx *ctx = LIST_FIRST(&vdev_netvsc_ctx_list);
+
+ LIST_REMOVE(ctx, entry);
+ --vdev_netvsc_ctx_count;
+ vdev_netvsc_ctx_destroy(ctx);
+ }
return 0;
}
--
1.8.3.1
Matan Azrad
2018-01-18 08:43:44 UTC
Permalink
NetVSC netdevices which are already routed should not be probed because
they are used for management purposes by the HyperV.

prevent routed netvsc devices probing.

Signed-off-by: Raslan Darawsheh <***@mellanox.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 2 +-
drivers/net/vdev_netvsc/vdev_netvsc.c | 46 +++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index fde1fb8..f779862 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -87,4 +87,4 @@ The following device parameters are supported:
MAC address.

Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
-all NetVSC interfaces found on the system.
+all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 21c3265..0055d0b 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -39,6 +39,7 @@
#define VDEV_NETVSC_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
+#define NETVSC_MAX_ROUTE_LINE_SIZE 300

#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -198,6 +199,44 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
}

/**
+ * Determine if a network interface has a route.
+ *
+ * @param[in] name
+ * Network device name.
+ *
+ * @return
+ * A nonzero value when interface has an route. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_has_route(const char *name)
+{
+ FILE *fp;
+ int ret = 0;
+ char route[NETVSC_MAX_ROUTE_LINE_SIZE];
+ char *netdev;
+
+ fp = fopen("/proc/net/route", "r");
+ if (!fp) {
+ rte_errno = errno;
+ return 0;
+ }
+ while (fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL) {
+ netdev = strtok(route, "\t");
+ if (strcmp(netdev, name) == 0) {
+ ret = 1;
+ break;
+ }
+ /* Move file pointer to the next line. */
+ while (strchr(route, '\n') == NULL &&
+ fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL)
+ ;
+ }
+ fclose(fp);
+ return ret;
+}
+
+/**
* Retrieve network interface data from sysfs symbolic link.
*
* @param[out] buf
@@ -459,6 +498,13 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
iface->if_name, iface->if_index);
return 0;
}
+ /* Routed NetVSC should not be probed. */
+ if (vdev_netvsc_has_route(iface->if_name)) {
+ DRV_LOG(WARNING, "NetVSC interface \"%s\" (index %u) is routed",
+ iface->if_name, iface->if_index);
+ if (!specified)
+ return 0;
+ }
/* Create interface context. */
ctx = calloc(1, sizeof(*ctx));
if (!ctx) {
--
1.8.3.1
Matan Azrad
2018-01-18 08:43:46 UTC
Permalink
Using DPDK in Hyper-V VM systems requires vdev_netvsc driver to pair
the NetVSC netdev device with the same MAC address PCI device by
fail-safe PMD.

Add vdev_netvsc custom scan in vdev bus to allow automatic probing in
Hyper-V VM systems unless it was already specified by command line.

Add "ignore" parameter to disable this auto-detection.

Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 9 ++++--
drivers/net/vdev_netvsc/vdev_netvsc.c | 55 +++++++++++++++++++++++++++++++++--
2 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index 3c26990..55d130a 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -71,8 +71,8 @@ Build options
Run-time parameters
-------------------

-To invoke this driver, applications have to explicitly provide the
-``--vdev=net_vdev_netvsc`` EAL option.
+This driver is invoked automatically in Hyper-V VM systems unless the user
+invoked it by command line using ``--vdev=net_vdev_netvsc`` EAL option.

The following device parameters are supported:

@@ -91,5 +91,10 @@ The following device parameters are supported:
If nonzero, forces the use of specified interfaces even if not detected as
NetVSC or detected as routed NETVSC.

+- ``ignore`` [int]
+
+ If nonzero, ignores the driver runnig (actually used to disable the
+ auto-detection in Hyper-V VM).
+
Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 2d03033..a8a1a7f 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -30,13 +30,16 @@
#include <rte_errno.h>
#include <rte_ethdev.h>
#include <rte_ether.h>
+#include <rte_hypervisor.h>
#include <rte_kvargs.h>
#include <rte_log.h>

#define VDEV_NETVSC_DRIVER net_vdev_netvsc
+#define VDEV_NETVSC_DRIVER_NAME RTE_STR(VDEV_NETVSC_DRIVER)
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
#define VDEV_NETVSC_ARG_FORCE "force"
+#define VDEV_NETVSC_ARG_IGNORE "ignore"
#define VDEV_NETVSC_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
@@ -45,7 +48,7 @@
#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
vdev_netvsc_logtype, \
- RTE_FMT(RTE_STR(VDEV_NETVSC_DRIVER) ": " \
+ RTE_FMT(VDEV_NETVSC_DRIVER_NAME ": " \
RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
RTE_FMT_TAIL(__VA_ARGS__,)))

@@ -601,6 +604,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
VDEV_NETVSC_ARG_IFACE,
VDEV_NETVSC_ARG_MAC,
VDEV_NETVSC_ARG_FORCE,
+ VDEV_NETVSC_ARG_IGNORE,
NULL,
};
const char *name = rte_vdev_device_name(dev);
@@ -610,6 +614,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
unsigned int specified = 0;
unsigned int matched = 0;
int force = 0;
+ int ignore = 0;
unsigned int i;
int ret;

@@ -623,10 +628,17 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =

if (!strcmp(pair->key, VDEV_NETVSC_ARG_FORCE))
force = !!atoi(pair->value);
+ else if (!strcmp(pair->key, VDEV_NETVSC_ARG_IGNORE))
+ ignore = !!atoi(pair->value);
else if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
!strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
++specified;
}
+ if (ignore) {
+ if (kvargs)
+ rte_kvargs_free(kvargs);
+ return 0;
+ }
rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
/* Gather interfaces. */
ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
@@ -690,7 +702,8 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
VDEV_NETVSC_ARG_IFACE "=<string> "
VDEV_NETVSC_ARG_MAC "=<string> "
- VDEV_NETVSC_ARG_FORCE "=<int>");
+ VDEV_NETVSC_ARG_FORCE "=<int> "
+ VDEV_NETVSC_ARG_IGNORE "=<int>");

/** Initialize driver log type. */
RTE_INIT(vdev_netvsc_init_log)
@@ -699,3 +712,41 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
if (vdev_netvsc_logtype >= 0)
rte_log_set_level(vdev_netvsc_logtype, RTE_LOG_NOTICE);
}
+
+/** Compare function for vdev find device operation. */
+static int
+vdev_netvsc_cmp_rte_device(const struct rte_device *dev1,
+ __rte_unused const void *_dev2)
+{
+ return strcmp(dev1->devargs->name, VDEV_NETVSC_DRIVER_NAME);
+}
+
+/**
+ * A callback called by vdev bus scan function to ensure this driver probing
+ * automatically in Hyper-V VM system unless it already exists in the
+ * devargs list.
+ */
+static void
+vdev_netvsc_scan_callback(__rte_unused void *arg)
+{
+ struct rte_vdev_device *dev;
+ struct rte_devargs *devargs;
+ struct rte_bus *vbus = rte_bus_find_by_name("vdev");
+
+ TAILQ_FOREACH(devargs, &devargs_list, next)
+ if (!strcmp(devargs->name, VDEV_NETVSC_DRIVER_NAME))
+ return;
+ dev = (struct rte_vdev_device *)vbus->find_device(NULL,
+ vdev_netvsc_cmp_rte_device, VDEV_NETVSC_DRIVER_NAME);
+ if (dev)
+ return;
+ if (rte_eal_devargs_add(RTE_DEVTYPE_VIRTUAL, VDEV_NETVSC_DRIVER_NAME))
+ DRV_LOG(ERR, "unable to add netvsc devargs.");
+}
+
+/** Initialize the custom scan. */
+RTE_INIT(vdev_netvsc_custom_scan_add)
+{
+ if (rte_hypervisor_get() == RTE_HYPERVISOR_HYPERV)
+ rte_vdev_add_custom_scan(vdev_netvsc_scan_callback, NULL);
+}
--
1.8.3.1
Matan Azrad
2018-01-18 08:43:45 UTC
Permalink
This parameter allows specifying any non-NetVSC interface or routed
NetVSC interfaces to use with tap sub-devices for development purposes.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 5 +++++
drivers/net/vdev_netvsc/vdev_netvsc.c | 30 +++++++++++++++++++-----------
2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index f779862..3c26990 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -86,5 +86,10 @@ The following device parameters are supported:
Same as ``iface`` except a suitable NetVSC interface is located using its
MAC address.

+- ``force`` [int]
+
+ If nonzero, forces the use of specified interfaces even if not detected as
+ NetVSC or detected as routed NETVSC.
+
Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 0055d0b..2d03033 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -36,6 +36,7 @@
#define VDEV_NETVSC_DRIVER net_vdev_netvsc
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_ARG_FORCE "force"
#define VDEV_NETVSC_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
@@ -419,6 +420,9 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
* - struct rte_kvargs *kvargs:
* Device arguments provided to current driver instance.
*
+ * - int force:
+ * Accept specified interface even if not detected as NetVSC.
+ *
* - unsigned int specified:
* Number of specific netdevices provided as device arguments.
*
@@ -436,6 +440,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
{
const char *name = va_arg(ap, const char *);
struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ int force = va_arg(ap, int);
unsigned int specified = va_arg(ap, unsigned int);
unsigned int *matched = va_arg(ap, unsigned int *);
unsigned int i;
@@ -490,20 +495,18 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
return 0;
}
if (!vdev_netvsc_iface_is_netvsc(iface)) {
- if (!specified)
+ if (!specified || !force)
return 0;
DRV_LOG(WARNING,
- "interface \"%s\" (index %u) is not NetVSC,"
- " skipping",
+ "using non-NetVSC interface \"%s\" (index %u)",
iface->if_name, iface->if_index);
- return 0;
}
/* Routed NetVSC should not be probed. */
if (vdev_netvsc_has_route(iface->if_name)) {
- DRV_LOG(WARNING, "NetVSC interface \"%s\" (index %u) is routed",
- iface->if_name, iface->if_index);
- if (!specified)
+ if (!specified || !force)
return 0;
+ DRV_LOG(WARNING, "using routed NetVSC interface \"%s\""
+ " (index %u)", iface->if_name, iface->if_index);
}
/* Create interface context. */
ctx = calloc(1, sizeof(*ctx));
@@ -597,6 +600,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
static const char *const vdev_netvsc_arg[] = {
VDEV_NETVSC_ARG_IFACE,
VDEV_NETVSC_ARG_MAC,
+ VDEV_NETVSC_ARG_FORCE,
NULL,
};
const char *name = rte_vdev_device_name(dev);
@@ -605,6 +609,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
vdev_netvsc_arg);
unsigned int specified = 0;
unsigned int matched = 0;
+ int force = 0;
unsigned int i;
int ret;

@@ -616,14 +621,16 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
for (i = 0; i != kvargs->count; ++i) {
const struct rte_kvargs_pair *pair = &kvargs->pairs[i];

- if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
- !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_FORCE))
+ force = !!atoi(pair->value);
+ else if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
++specified;
}
rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
/* Gather interfaces. */
ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
- specified, &matched);
+ force, specified, &matched);
if (ret < 0)
goto error;
if (matched < specified)
@@ -682,7 +689,8 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
RTE_PMD_REGISTER_ALIAS(VDEV_NETVSC_DRIVER, eth_vdev_netvsc);
RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
VDEV_NETVSC_ARG_IFACE "=<string> "
- VDEV_NETVSC_ARG_MAC "=<string>");
+ VDEV_NETVSC_ARG_MAC "=<string> "
+ VDEV_NETVSC_ARG_FORCE "=<int>");

/** Initialize driver log type. */
RTE_INIT(vdev_netvsc_init_log)
--
1.8.3.1
Matan Azrad
2018-01-18 10:01:42 UTC
Permalink
From: Adrien Mazarguil <***@6wind.com>

rte_free() is not supposed to work with pointers returned by calloc().

Fixes: a0194d828100 ("net/failsafe: add flexible device definition")
Cc: ***@dpdk.org
Cc: Gaetan Rivet <***@6wind.com>

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Acked-by: Gaetan Rivet <***@6wind.com>
---
drivers/net/failsafe/failsafe_args.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index cfc83e3..ec63ac9 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -407,7 +407,7 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
uint8_t i;

FOREACH_SUBDEV(sdev, i, dev) {
- rte_free(sdev->cmdline);
+ free(sdev->cmdline);
sdev->cmdline = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
--
1.8.3.1
Matan Azrad
2018-01-18 10:01:43 UTC
Permalink
This parameter enables applications to provide device definitions through
an arbitrary file descriptor number.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
Acked-by: Gaetan Rivet <***@6wind.com>
---
doc/guides/nics/fail_safe.rst | 9 ++++
drivers/net/failsafe/failsafe_args.c | 80 ++++++++++++++++++++++++++++++++-
drivers/net/failsafe/failsafe_private.h | 3 ++
3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index c4e3d2e..5b1b47e 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -106,6 +106,15 @@ Fail-safe command line parameters
All commas within the ``shell command`` are replaced by spaces before
executing the command. This helps using scripts to specify devices.

+- **fd(<file descriptor number>)** parameter
+
+ This parameter reads a device definition from an arbitrary file descriptor
+ number in ``<iface>`` format as described above.
+
+ The file descriptor is read in non-blocking mode and is never closed in
+ order to take only the last line into account (unlike ``exec()``) at every
+ probe attempt.
+
- **mac** parameter [MAC address]

This parameter allows the user to set a default MAC address to the fail-safe
diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index ec63ac9..c711da4 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -31,7 +31,11 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
#include <string.h>
+#include <unistd.h>
#include <errno.h>

#include <rte_debug.h>
@@ -161,6 +165,67 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}

static int
+fs_read_fd(struct sub_device *sdev, char *fd_str)
+{
+ FILE *fp = NULL;
+ int fd = -1;
+ /* store possible newline as well */
+ char output[DEVARGS_MAXLEN + 1];
+ int err = -ENODEV;
+ int oflags;
+ int lcount;
+
+ RTE_ASSERT(fd_str != NULL || sdev->fd_str != NULL);
+ if (sdev->fd_str == NULL) {
+ sdev->fd_str = strdup(fd_str);
+ if (sdev->fd_str == NULL) {
+ ERROR("Command line allocation failed");
+ return -ENOMEM;
+ }
+ }
+ errno = 0;
+ fd = strtol(fd_str, &fd_str, 0);
+ if (errno || *fd_str || fd < 0) {
+ ERROR("Parsing FD number failed");
+ goto error;
+ }
+ /* Fiddle with copy of file descriptor */
+ fd = dup(fd);
+ if (fd == -1)
+ goto error;
+ oflags = fcntl(fd, F_GETFL);
+ if (oflags == -1)
+ goto error;
+ if (fcntl(fd, F_SETFL, oflags | O_NONBLOCK) == -1)
+ goto error;
+ fp = fdopen(fd, "r");
+ if (fp != NULL)
+ goto error;
+ fd = -1;
+ /* Only take the last line into account */
+ lcount = 0;
+ while (fgets(output, sizeof(output), fp))
+ ++lcount;
+ if (lcount == 0)
+ goto error;
+ else if (ferror(fp) && errno != EAGAIN)
+ goto error;
+ /* Line must end with a newline character */
+ fs_sanitize_cmdline(output);
+ if (output[0] == '\0')
+ goto error;
+ err = fs_parse_device(sdev, output);
+ if (err)
+ ERROR("Parsing device '%s' failed", output);
+error:
+ if (fp)
+ fclose(fp);
+ if (fd != -1)
+ close(fd);
+ return err;
+}
+
+static int
fs_parse_device_param(struct rte_eth_dev *dev, const char *param,
uint8_t head)
{
@@ -202,6 +267,14 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}
if (ret)
goto free_args;
+ } else if (strncmp(param, "fd(", 3) == 0) {
+ ret = fs_read_fd(sdev, args);
+ if (ret == -ENODEV) {
+ DEBUG("Reading device info from FD failed");
+ ret = 0;
+ }
+ if (ret)
+ goto free_args;
} else {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
@@ -409,6 +482,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
FOREACH_SUBDEV(sdev, i, dev) {
free(sdev->cmdline);
sdev->cmdline = NULL;
+ free(sdev->fd_str);
+ sdev->fd_str = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
}
@@ -424,7 +499,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
param[b] != '\0')
b++;
if (strncmp(param, "dev", b) != 0 &&
- strncmp(param, "exec", b) != 0) {
+ strncmp(param, "exec", b) != 0 &&
+ strncmp(param, "fd(", b) != 0) {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
}
@@ -463,6 +539,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
continue;
if (sdev->cmdline)
ret = fs_execute_cmd(sdev, sdev->cmdline);
+ else if (sdev->fd_str)
+ ret = fs_read_fd(sdev, sdev->fd_str);
else
ret = fs_parse_sub_device(sdev);
if (ret == 0)
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index 54b5b91..5e04ffe 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -48,6 +48,7 @@
#define PMD_FAILSAFE_PARAM_STRING \
"dev(<ifc>)," \
"exec(<shell command>)," \
+ "fd(<fd number>)," \
"mac=mac_addr," \
"hotplug_poll=u64" \
""
@@ -112,6 +113,8 @@ struct sub_device {
struct fs_stats stats_snapshot;
/* Some device are defined as a command line */
char *cmdline;
+ /* Others are retrieved through a file descriptor */
+ char *fd_str;
/* fail-safe device backreference */
struct rte_eth_dev *fs_dev;
/* flag calling for recollection */
--
1.8.3.1
Matan Azrad
2018-01-18 10:01:44 UTC
Permalink
Previous fail-safe code didn't support probed sub-devices capture and
failed when it tried to probe them.

Skip fail-safe sub-device probing when it already was probed.

Signed-off-by: Matan Azrad <***@mellanox.com>
Cc: Gaetan Rivet <***@6wind.com>
---
doc/guides/nics/fail_safe.rst | 17 +++++++
drivers/net/failsafe/failsafe_args.c | 2 -
drivers/net/failsafe/failsafe_eal.c | 78 ++++++++++++++++++++++++---------
drivers/net/failsafe/failsafe_private.h | 2 +
4 files changed, 77 insertions(+), 22 deletions(-)

diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index 5b1b47e..3f72b59 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -93,6 +93,14 @@ Fail-safe command line parameters
additional sub-device parameters if need be. They will be passed on to the
sub-device.

+.. note::
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD will take the device
+ as is, which means that EAL device options are taken in this case.
+ When trying to use a PCI device automatically probed in blacklist mode,
+ the syntax for the fail-safe must be with the full PCI id:
+ Domain:Bus:Device.Function. See the usage example section.
+
- **exec(<shell command>)** parameter

This parameter allows the user to provide a command to the fail-safe PMD to
@@ -169,6 +177,15 @@ This section shows some example of using **testpmd** with a fail-safe PMD.
$RTE_TARGET/build/app/testpmd -c 0xff -n 4 --no-pci \
--vdev='net_failsafe0,exec(echo 84:00.0)' -- -i

+#. Start testpmd, automatically probing the device 84:00.0 and using it with
+ the fail-safe.
+
+ .. code-block:: console
+
+ $RTE_TARGET/build/app/testpmd -c 0xff -n 4 \
+ --vdev 'net_failsafe0,dev(0000:84:00.0),dev(net_ring0)' -- -i
+
+
Using the Fail-safe PMD from an application
-------------------------------------------

diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index c711da4..583bf05 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -45,8 +45,6 @@

#include "failsafe_private.h"

-#define DEVARGS_MAXLEN 4096
-
/* Callback used when a new device is found in devargs */
typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
uint8_t head);
diff --git a/drivers/net/failsafe/failsafe_eal.c b/drivers/net/failsafe/failsafe_eal.c
index 19d26f5..33a5adf 100644
--- a/drivers/net/failsafe/failsafe_eal.c
+++ b/drivers/net/failsafe/failsafe_eal.c
@@ -36,39 +36,77 @@
#include "failsafe_private.h"

static int
+fs_ethdev_portid_get(const char *name, uint16_t *port_id)
+{
+ uint16_t pid;
+ size_t len;
+
+ if (name == NULL) {
+ DEBUG("Null pointer is specified\n");
+ return -EINVAL;
+ }
+ len = strlen(name);
+ RTE_ETH_FOREACH_DEV(pid) {
+ if (!strncmp(name, rte_eth_devices[pid].device->name, len)) {
+ *port_id = pid;
+ return 0;
+ }
+ }
+ return -ENODEV;
+}
+
+static int
fs_bus_init(struct rte_eth_dev *dev)
{
struct sub_device *sdev;
struct rte_devargs *da;
uint8_t i;
- uint16_t j;
+ uint16_t pid;
int ret;

FOREACH_SUBDEV(sdev, i, dev) {
if (sdev->state != DEV_PARSED)
continue;
da = &sdev->devargs;
- ret = rte_eal_hotplug_add(da->bus->name,
- da->name,
- da->args);
- if (ret) {
- ERROR("sub_device %d probe failed %s%s%s", i,
- rte_errno ? "(" : "",
- rte_errno ? strerror(rte_errno) : "",
- rte_errno ? ")" : "");
- continue;
- }
- RTE_ETH_FOREACH_DEV(j) {
- if (strcmp(rte_eth_devices[j].device->name,
- da->name) == 0) {
- ETH(sdev) = &rte_eth_devices[j];
- break;
+ if (fs_ethdev_portid_get(da->name, &pid) != 0) {
+ ret = rte_eal_hotplug_add(da->bus->name,
+ da->name,
+ da->args);
+ if (ret) {
+ ERROR("sub_device %d probe failed %s%s%s", i,
+ rte_errno ? "(" : "",
+ rte_errno ? strerror(rte_errno) : "",
+ rte_errno ? ")" : "");
+ continue;
}
+ if (fs_ethdev_portid_get(da->name, &pid) != 0) {
+ ERROR("sub_device %d init went wrong", i);
+ return -ENODEV;
+ }
+ } else {
+ char devstr[DEVARGS_MAXLEN] = "";
+ struct rte_devargs *probed_da =
+ rte_eth_devices[pid].device->devargs;
+
+ /* Take control of device probed by EAL options. */
+ free(da->args);
+ memset(da, 0, sizeof(*da));
+ if (probed_da != NULL)
+ snprintf(devstr, sizeof(devstr), "%s,%s",
+ probed_da->name, probed_da->args);
+ else
+ snprintf(devstr, sizeof(devstr), "%s",
+ rte_eth_devices[pid].device->name);
+ ret = rte_eal_devargs_parse(devstr, da);
+ if (ret) {
+ ERROR("Probed devargs parsing failed with code"
+ " %d", ret);
+ return ret;
+ }
+ INFO("Taking control of a probed sub device"
+ " %d named %s", i, da->name);
}
- if (ETH(sdev) == NULL) {
- ERROR("sub_device %d init went wrong", i);
- return -ENODEV;
- }
+ ETH(sdev) = &rte_eth_devices[pid];
SUB_ID(sdev) = i;
sdev->fs_dev = dev;
sdev->dev = ETH(sdev)->device;
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index 5e04ffe..9fcf72e 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -58,6 +58,8 @@
#define FAILSAFE_MAX_ETHPORTS 2
#define FAILSAFE_MAX_ETHADDR 128

+#define DEVARGS_MAXLEN 4096
+
/* TYPES */

struct rxq {
--
1.8.3.1
Gaëtan Rivet
2018-01-18 10:08:26 UTC
Permalink
Post by Matan Azrad
Previous fail-safe code didn't support probed sub-devices capture and
failed when it tried to probe them.
Skip fail-safe sub-device probing when it already was probed.
Okay, ignoring the recursive probing. It could be dangerous, with the
ownership evolutions and unforeseen side-effects, but device matching
will be reworked next release, so this new functionality will be fixed
anyway at this point.

Acked-by: Gaetan Rivet <***@6wind.com>
--
Gaëtan Rivet
6WIND
Matan Azrad
2018-01-18 10:01:46 UTC
Permalink
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.

This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.

PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.

Once initialized, the sole job of the vdev_netvsc driver is to regularly
scan for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 70 +++++
drivers/net/vdev_netvsc/Makefile | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 550 +++++++++++++++++++++++++++++++++-
3 files changed, 623 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index a952908..fde1fb8 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -12,9 +12,79 @@ platforms.

.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v

+Implementation details
+----------------------
+
+Each instance of this driver effectively needs to drive two devices: the
+NetVSC interface proper and its SR-IOV VF (referred to as "physical" from
+this point on) counterpart sharing the same MAC address.
+
+Physical devices are part of the host system and cannot be maintained during
+VM migration. From a VM standpoint they appear as hot-plug devices that come
+and go without prior notice.
+
+When the physical device is present, egress and most of the ingress traffic
+flows through it; only multicasts and other hypervisor control still flow
+through NetVSC. Otherwise, NetVSC acts as a fallback for all traffic.
+
+To avoid unnecessary code duplication and ensure maximum performance,
+handling of physical devices is left to their original PMDs; this virtual
+device driver (also known as *vdev*) manages other PMDs as summarized by the
+following block diagram::
+
+ .------------------.
+ | DPDK application |
+ `--------+---------'
+ |
+ .------+------.
+ | DPDK ethdev |
+ `------+------' Control
+ | |
+ .------------+------------. v .--------------------.
+ | failsafe PMD +---------+ vdev_netvsc driver |
+ `--+-------------------+--' `--------------------'
+ | |
+ | .........|.........
+ | : | :
+ .----+----. : .----+----. :
+ | tap PMD | : | any PMD | :
+ `----+----' : `----+----' : <-- Hot-pluggable
+ | : | :
+ .------+-------. : .-----+-----. :
+ | NetVSC-based | : | SR-IOV VF | :
+ | netdevice | : | device | :
+ `--------------' : `-----------' :
+ :.................:
+
+
+This driver implementation may be temporary and should be improved or removed
+either when hot-plug will be fully supported in EAL and bus drivers or when
+a new NetVSC driver will be integrated.
+
Build options
-------------

- ``CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD`` (default ``y``)

Toggle compilation of this driver.
+
+Run-time parameters
+-------------------
+
+To invoke this driver, applications have to explicitly provide the
+``--vdev=net_vdev_netvsc`` EAL option.
+
+The following device parameters are supported:
+
+- ``iface`` [string]
+
+ Provide a specific NetVSC interface (netdevice) name to attach this driver
+ to. Can be provided multiple times for additional instances.
+
+- ``mac`` [string]
+
+ Same as ``iface`` except a suitable NetVSC interface is located using its
+ MAC address.
+
+Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
+all NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
index 2fb059d..f2b2ac5 100644
--- a/drivers/net/vdev_netvsc/Makefile
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -13,6 +13,9 @@ EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
CFLAGS += -O3
CFLAGS += -g
CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += -D_XOPEN_SOURCE=600
+CFLAGS += -D_BSD_SOURCE
+CFLAGS += -D_DEFAULT_SOURCE
CFLAGS += $(WERROR_FLAGS)

# Dependencies.
@@ -20,6 +23,7 @@ LDLIBS += -lrte_bus_vdev
LDLIBS += -lrte_eal
LDLIBS += -lrte_ethdev
LDLIBS += -lrte_kvargs
+LDLIBS += -lrte_net

# Source files.
SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index e895b32..21c3265 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -3,17 +3,42 @@
* Copyright 2017 Mellanox Technologies, Ltd.
*/

+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <linux/sockios.h>
+#include <net/if.h>
+#include <net/if_arp.h>
+#include <netinet/ip.h>
+#include <stdarg.h>
#include <stddef.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/queue.h>
+#include <sys/socket.h>
+#include <unistd.h>

+#include <rte_alarm.h>
+#include <rte_bus.h>
#include <rte_bus_vdev.h>
#include <rte_common.h>
#include <rte_config.h>
+#include <rte_dev.h>
+#include <rte_errno.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
#include <rte_kvargs.h>
#include <rte_log.h>

#define VDEV_NETVSC_DRIVER net_vdev_netvsc
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_PROBE_MS 1000
+
+#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"

#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -25,12 +50,495 @@
/** Driver-specific log messages type. */
static int vdev_netvsc_logtype;

+/** Context structure for a vdev_netvsc instance. */
+struct vdev_netvsc_ctx {
+ LIST_ENTRY(vdev_netvsc_ctx) entry; /**< Next entry in list. */
+ unsigned int id; /**< Unique ID. */
+ char name[64]; /**< Unique name. */
+ char devname[64]; /**< Fail-safe instance name. */
+ char devargs[256]; /**< Fail-safe device arguments. */
+ char if_name[IF_NAMESIZE]; /**< NetVSC netdevice name. */
+ unsigned int if_index; /**< NetVSC netdevice index. */
+ struct ether_addr if_addr; /**< NetVSC MAC address. */
+ int pipe[2]; /**< Fail-safe communication pipe. */
+ char yield[256]; /**< PCI sub-device arguments. */
+};
+
+/** Context list is common to all driver instances. */
+static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
+ LIST_HEAD_INITIALIZER(vdev_netvsc_ctx_list);
+
+/** Number of entries in context list. */
+static unsigned int vdev_netvsc_ctx_count;
+
/** Number of driver instances relying on context list. */
static unsigned int vdev_netvsc_ctx_inst;

/**
+ * Destroy a vdev_netvsc context instance.
+ *
+ * @param ctx
+ * Context to destroy.
+ */
+static void
+vdev_netvsc_ctx_destroy(struct vdev_netvsc_ctx *ctx)
+{
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice found on
+ * the system.
+ *
+ * @param func
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ * @param ...
+ * Variable parameter list passed as @p va_list to @p func.
+ *
+ * @return
+ * 0 when the entire list is traversed successfully, a negative error code
+ * in case or failure, or the nonzero value returned by @p func when list
+ * traversal is aborted.
+ */
+static int
+vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "cannot retrieve system network interfaces");
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot open socket: %s", rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ DRV_LOG(WARNING, "cannot retrieve information about"
+ " interface \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
+ if (req.ifr_hwaddr.sa_family != ARPHRD_ETHER) {
+ DRV_LOG(DEBUG, "interface %s is non-ethernet device",
+ req.ifr_name);
+ continue;
+ }
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+error:
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ *
+ * @return
+ * A nonzero value when interface is detected as NetVSC. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[sizeof(temp) + IF_NAMESIZE];
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ ret = snprintf(path, sizeof(path), temp, iface->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(path)) {
+ rte_errno = ENOBUFS;
+ return 0;
+ }
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * @param[out] buf
+ * Output data buffer.
+ * @param size
+ * Output buffer size.
+ * @param[in] if_name
+ * Netdevice name.
+ * @param[in] relpath
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * @return
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+vdev_netvsc_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
+
+/**
+ * Probe a network interface to associate with vdev_netvsc context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the vdev_netvsc context and
+ * communicates its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - struct vdev_netvsc_ctx *ctx:
+ * Context to associate network interface with.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct vdev_netvsc_ctx *ctx = va_arg(ap, struct vdev_netvsc_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DRV_LOG(DEBUG,
+ "NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (vdev_netvsc_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ if (strcmp(addr, "pci"))
+ return 0;
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance. */
+ if (strcmp(addr, ctx->yield))
+ DRV_LOG(DEBUG, "associating PCI device \"%s\" with NetVSC"
+ " interface \"%s\" (index %u)", addr, ctx->if_name,
+ ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ DRV_LOG(WARNING, "cannot associate PCI device name \"%s\" with"
+ " interface \"%s\": %s", addr, ctx->if_name,
+ rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by VDEV_NETVSC_PROBE_MS as
+ * long as an vdev_netvsc context instance exists.
+ *
+ * @param arg
+ * Ignored.
+ */
+static void
+vdev_netvsc_alarm(__rte_unused void *arg)
+{
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry) {
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ if (ret)
+ break;
+ }
+ if (!vdev_netvsc_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
+
+/**
+ * Probe a NetVSC interface to generate a vdev_netvsc context from.
+ *
+ * This function instantiates vdev_netvsc contexts either for all NetVSC
+ * devices found on the system or only a subset provided as device
+ * arguments.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - const char *name:
+ * Name associated with current driver instance.
+ *
+ * - struct rte_kvargs *kvargs:
+ * Device arguments provided to current driver instance.
+ *
+ * - unsigned int specified:
+ * Number of specific netdevices provided as device arguments.
+ *
+ * - unsigned int *matched:
+ * The number of specified netdevices matched by this function.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key, VDEV_NETVSC_ARG_MAC)) {
+ struct ether_addr tmp;
+
+ if (sscanf(pair->value,
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8 ":"
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8,
+ &tmp.addr_bytes[0],
+ &tmp.addr_bytes[1],
+ &tmp.addr_bytes[2],
+ &tmp.addr_bytes[3],
+ &tmp.addr_bytes[4],
+ &tmp.addr_bytes[5]) != 6) {
+ DRV_LOG(ERR,
+ "invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is already handled,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!vdev_netvsc_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is not NetVSC,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot allocate context for interface \"%s\": %s",
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = vdev_netvsc_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ DRV_LOG(ERR,
+ "cannot allocate control pipe for interface \"%s\": %s",
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1)
+ continue;
+ ret = -errno;
+ DRV_LOG(ERR, "cannot toggle non-blocking flag on control file"
+ " descriptor #%u (%d): %s", i, ctx->pipe[i],
+ rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name))
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname), "net_failsafe_%s",
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname))
+ ++i;
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "fd(%d),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs))
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "generated virtual device name or argument list"
+ " too long for interface \"%s\"", ctx->if_name);
+ goto error;
+ }
+ /* Request virtual device generation. */
+ DRV_LOG(DEBUG, "generating virtual device \"%s\" with arguments \"%s\"",
+ ctx->devname, ctx->devargs);
+ vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&vdev_netvsc_ctx_list, ctx, entry);
+ ++vdev_netvsc_ctx_count;
+ DRV_LOG(DEBUG, "added NetVSC interface \"%s\" to context list",
+ ctx->if_name);
+ return 0;
+error:
+ if (ctx)
+ vdev_netvsc_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified device
+ * arguments and starts a periodic alarm callback to notify the resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* @param dev
* Virtual device context for driver instance.
*
@@ -49,12 +557,40 @@
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
vdev_netvsc_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;

DRV_LOG(DEBUG, "invoked as \"%s\", using arguments \"%s\"", name, args);
if (!kvargs) {
DRV_LOG(ERR, "cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ /* Gather interfaces. */
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ DRV_LOG(WARNING,
+ "some of the specified parameters did not match"
+ " recognized network interfaces");
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
error:
if (kvargs)
rte_kvargs_free(kvargs);
@@ -65,6 +601,9 @@
/**
* Remove driver instance.
*
+ * The alarm callback and underlying vdev_netvsc context instances are only
+ * destroyed after the last PMD instance is removed.
+ *
* @param dev
* Virtual device context for driver instance.
*
@@ -74,7 +613,16 @@
static int
vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
{
- --vdev_netvsc_ctx_inst;
+ if (--vdev_netvsc_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ while (!LIST_EMPTY(&vdev_netvsc_ctx_list)) {
+ struct vdev_netvsc_ctx *ctx = LIST_FIRST(&vdev_netvsc_ctx_list);
+
+ LIST_REMOVE(ctx, entry);
+ --vdev_netvsc_ctx_count;
+ vdev_netvsc_ctx_destroy(ctx);
+ }
return 0;
}
--
1.8.3.1
Matan Azrad
2018-01-18 10:01:47 UTC
Permalink
NetVSC netdevices which are already routed should not be probed because
they are used for management purposes by the HyperV.

prevent routed netvsc devices probing.

Signed-off-by: Raslan Darawsheh <***@mellanox.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 2 +-
drivers/net/vdev_netvsc/vdev_netvsc.c | 46 +++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index fde1fb8..f779862 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -87,4 +87,4 @@ The following device parameters are supported:
MAC address.

Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
-all NetVSC interfaces found on the system.
+all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 21c3265..0055d0b 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -39,6 +39,7 @@
#define VDEV_NETVSC_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
+#define NETVSC_MAX_ROUTE_LINE_SIZE 300

#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -198,6 +199,44 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
}

/**
+ * Determine if a network interface has a route.
+ *
+ * @param[in] name
+ * Network device name.
+ *
+ * @return
+ * A nonzero value when interface has an route. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_has_route(const char *name)
+{
+ FILE *fp;
+ int ret = 0;
+ char route[NETVSC_MAX_ROUTE_LINE_SIZE];
+ char *netdev;
+
+ fp = fopen("/proc/net/route", "r");
+ if (!fp) {
+ rte_errno = errno;
+ return 0;
+ }
+ while (fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL) {
+ netdev = strtok(route, "\t");
+ if (strcmp(netdev, name) == 0) {
+ ret = 1;
+ break;
+ }
+ /* Move file pointer to the next line. */
+ while (strchr(route, '\n') == NULL &&
+ fgets(route, NETVSC_MAX_ROUTE_LINE_SIZE, fp) != NULL)
+ ;
+ }
+ fclose(fp);
+ return ret;
+}
+
+/**
* Retrieve network interface data from sysfs symbolic link.
*
* @param[out] buf
@@ -459,6 +498,13 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
iface->if_name, iface->if_index);
return 0;
}
+ /* Routed NetVSC should not be probed. */
+ if (vdev_netvsc_has_route(iface->if_name)) {
+ DRV_LOG(WARNING, "NetVSC interface \"%s\" (index %u) is routed",
+ iface->if_name, iface->if_index);
+ if (!specified)
+ return 0;
+ }
/* Create interface context. */
ctx = calloc(1, sizeof(*ctx));
if (!ctx) {
--
1.8.3.1
Matan Azrad
2018-01-18 10:01:48 UTC
Permalink
This parameter allows specifying any non-NetVSC interface or routed
NetVSC interfaces to use with tap sub-devices for development purposes.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 5 +++++
drivers/net/vdev_netvsc/vdev_netvsc.c | 30 +++++++++++++++++++-----------
2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index f779862..3c26990 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -86,5 +86,10 @@ The following device parameters are supported:
Same as ``iface`` except a suitable NetVSC interface is located using its
MAC address.

+- ``force`` [int]
+
+ If nonzero, forces the use of specified interfaces even if not detected as
+ NetVSC or detected as routed NETVSC.
+
Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 0055d0b..2d03033 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -36,6 +36,7 @@
#define VDEV_NETVSC_DRIVER net_vdev_netvsc
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_ARG_FORCE "force"
#define VDEV_NETVSC_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
@@ -419,6 +420,9 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
* - struct rte_kvargs *kvargs:
* Device arguments provided to current driver instance.
*
+ * - int force:
+ * Accept specified interface even if not detected as NetVSC.
+ *
* - unsigned int specified:
* Number of specific netdevices provided as device arguments.
*
@@ -436,6 +440,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
{
const char *name = va_arg(ap, const char *);
struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ int force = va_arg(ap, int);
unsigned int specified = va_arg(ap, unsigned int);
unsigned int *matched = va_arg(ap, unsigned int *);
unsigned int i;
@@ -490,20 +495,18 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
return 0;
}
if (!vdev_netvsc_iface_is_netvsc(iface)) {
- if (!specified)
+ if (!specified || !force)
return 0;
DRV_LOG(WARNING,
- "interface \"%s\" (index %u) is not NetVSC,"
- " skipping",
+ "using non-NetVSC interface \"%s\" (index %u)",
iface->if_name, iface->if_index);
- return 0;
}
/* Routed NetVSC should not be probed. */
if (vdev_netvsc_has_route(iface->if_name)) {
- DRV_LOG(WARNING, "NetVSC interface \"%s\" (index %u) is routed",
- iface->if_name, iface->if_index);
- if (!specified)
+ if (!specified || !force)
return 0;
+ DRV_LOG(WARNING, "using routed NetVSC interface \"%s\""
+ " (index %u)", iface->if_name, iface->if_index);
}
/* Create interface context. */
ctx = calloc(1, sizeof(*ctx));
@@ -597,6 +600,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
static const char *const vdev_netvsc_arg[] = {
VDEV_NETVSC_ARG_IFACE,
VDEV_NETVSC_ARG_MAC,
+ VDEV_NETVSC_ARG_FORCE,
NULL,
};
const char *name = rte_vdev_device_name(dev);
@@ -605,6 +609,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
vdev_netvsc_arg);
unsigned int specified = 0;
unsigned int matched = 0;
+ int force = 0;
unsigned int i;
int ret;

@@ -616,14 +621,16 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
for (i = 0; i != kvargs->count; ++i) {
const struct rte_kvargs_pair *pair = &kvargs->pairs[i];

- if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
- !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_FORCE))
+ force = !!atoi(pair->value);
+ else if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
++specified;
}
rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
/* Gather interfaces. */
ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
- specified, &matched);
+ force, specified, &matched);
if (ret < 0)
goto error;
if (matched < specified)
@@ -682,7 +689,8 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
RTE_PMD_REGISTER_ALIAS(VDEV_NETVSC_DRIVER, eth_vdev_netvsc);
RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
VDEV_NETVSC_ARG_IFACE "=<string> "
- VDEV_NETVSC_ARG_MAC "=<string>");
+ VDEV_NETVSC_ARG_MAC "=<string> "
+ VDEV_NETVSC_ARG_FORCE "=<int>");

/** Initialize driver log type. */
RTE_INIT(vdev_netvsc_init_log)
--
1.8.3.1
Matan Azrad
2018-01-18 10:01:49 UTC
Permalink
Using DPDK in Hyper-V VM systems requires vdev_netvsc driver to pair
the NetVSC netdev device with the same MAC address PCI device by
fail-safe PMD.

Add vdev_netvsc custom scan in vdev bus to allow automatic probing in
Hyper-V VM systems unless it was already specified by command line.

Add "ignore" parameter to disable this auto-detection.

Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 9 ++++--
drivers/net/vdev_netvsc/vdev_netvsc.c | 55 +++++++++++++++++++++++++++++++++--
2 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index 3c26990..55d130a 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -71,8 +71,8 @@ Build options
Run-time parameters
-------------------

-To invoke this driver, applications have to explicitly provide the
-``--vdev=net_vdev_netvsc`` EAL option.
+This driver is invoked automatically in Hyper-V VM systems unless the user
+invoked it by command line using ``--vdev=net_vdev_netvsc`` EAL option.

The following device parameters are supported:

@@ -91,5 +91,10 @@ The following device parameters are supported:
If nonzero, forces the use of specified interfaces even if not detected as
NetVSC or detected as routed NETVSC.

+- ``ignore`` [int]
+
+ If nonzero, ignores the driver runnig (actually used to disable the
+ auto-detection in Hyper-V VM).
+
Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
all unrouted NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index 2d03033..a8a1a7f 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -30,13 +30,16 @@
#include <rte_errno.h>
#include <rte_ethdev.h>
#include <rte_ether.h>
+#include <rte_hypervisor.h>
#include <rte_kvargs.h>
#include <rte_log.h>

#define VDEV_NETVSC_DRIVER net_vdev_netvsc
+#define VDEV_NETVSC_DRIVER_NAME RTE_STR(VDEV_NETVSC_DRIVER)
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
#define VDEV_NETVSC_ARG_FORCE "force"
+#define VDEV_NETVSC_ARG_IGNORE "ignore"
#define VDEV_NETVSC_PROBE_MS 1000

#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"
@@ -45,7 +48,7 @@
#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
vdev_netvsc_logtype, \
- RTE_FMT(RTE_STR(VDEV_NETVSC_DRIVER) ": " \
+ RTE_FMT(VDEV_NETVSC_DRIVER_NAME ": " \
RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
RTE_FMT_TAIL(__VA_ARGS__,)))

@@ -601,6 +604,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
VDEV_NETVSC_ARG_IFACE,
VDEV_NETVSC_ARG_MAC,
VDEV_NETVSC_ARG_FORCE,
+ VDEV_NETVSC_ARG_IGNORE,
NULL,
};
const char *name = rte_vdev_device_name(dev);
@@ -610,6 +614,7 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
unsigned int specified = 0;
unsigned int matched = 0;
int force = 0;
+ int ignore = 0;
unsigned int i;
int ret;

@@ -623,10 +628,17 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =

if (!strcmp(pair->key, VDEV_NETVSC_ARG_FORCE))
force = !!atoi(pair->value);
+ else if (!strcmp(pair->key, VDEV_NETVSC_ARG_IGNORE))
+ ignore = !!atoi(pair->value);
else if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
!strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
++specified;
}
+ if (ignore) {
+ if (kvargs)
+ rte_kvargs_free(kvargs);
+ return 0;
+ }
rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
/* Gather interfaces. */
ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
@@ -690,7 +702,8 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
VDEV_NETVSC_ARG_IFACE "=<string> "
VDEV_NETVSC_ARG_MAC "=<string> "
- VDEV_NETVSC_ARG_FORCE "=<int>");
+ VDEV_NETVSC_ARG_FORCE "=<int> "
+ VDEV_NETVSC_ARG_IGNORE "=<int>");

/** Initialize driver log type. */
RTE_INIT(vdev_netvsc_init_log)
@@ -699,3 +712,41 @@ static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
if (vdev_netvsc_logtype >= 0)
rte_log_set_level(vdev_netvsc_logtype, RTE_LOG_NOTICE);
}
+
+/** Compare function for vdev find device operation. */
+static int
+vdev_netvsc_cmp_rte_device(const struct rte_device *dev1,
+ __rte_unused const void *_dev2)
+{
+ return strcmp(dev1->devargs->name, VDEV_NETVSC_DRIVER_NAME);
+}
+
+/**
+ * A callback called by vdev bus scan function to ensure this driver probing
+ * automatically in Hyper-V VM system unless it already exists in the
+ * devargs list.
+ */
+static void
+vdev_netvsc_scan_callback(__rte_unused void *arg)
+{
+ struct rte_vdev_device *dev;
+ struct rte_devargs *devargs;
+ struct rte_bus *vbus = rte_bus_find_by_name("vdev");
+
+ TAILQ_FOREACH(devargs, &devargs_list, next)
+ if (!strcmp(devargs->name, VDEV_NETVSC_DRIVER_NAME))
+ return;
+ dev = (struct rte_vdev_device *)vbus->find_device(NULL,
+ vdev_netvsc_cmp_rte_device, VDEV_NETVSC_DRIVER_NAME);
+ if (dev)
+ return;
+ if (rte_eal_devargs_add(RTE_DEVTYPE_VIRTUAL, VDEV_NETVSC_DRIVER_NAME))
+ DRV_LOG(ERR, "unable to add netvsc devargs.");
+}
+
+/** Initialize the custom scan. */
+RTE_INIT(vdev_netvsc_custom_scan_add)
+{
+ if (rte_hypervisor_get() == RTE_HYPERVISOR_HYPERV)
+ rte_vdev_add_custom_scan(vdev_netvsc_scan_callback, NULL);
+}
--
1.8.3.1
Matan Azrad
2018-01-18 10:01:45 UTC
Permalink
This patch lays the groundwork for this driver (draft documentation,
copyright notices, code base skeleton and build system hooks). While it can
be successfully compiled and invoked, it's an empty shell at this stage.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
MAINTAINERS | 6 ++
config/common_base | 5 ++
config/common_linuxapp | 1 +
doc/guides/nics/features/vdev_netvsc.ini | 12 +++
doc/guides/nics/index.rst | 1 +
doc/guides/nics/vdev_netvsc.rst | 20 +++++
drivers/net/Makefile | 1 +
drivers/net/vdev_netvsc/Makefile | 27 ++++++
.../vdev_netvsc/rte_pmd_vdev_netvsc_version.map | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 99 ++++++++++++++++++++++
mk/rte.app.mk | 1 +
11 files changed, 177 insertions(+)
create mode 100644 doc/guides/nics/features/vdev_netvsc.ini
create mode 100644 doc/guides/nics/vdev_netvsc.rst
create mode 100644 drivers/net/vdev_netvsc/Makefile
create mode 100644 drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
create mode 100644 drivers/net/vdev_netvsc/vdev_netvsc.c

diff --git a/MAINTAINERS b/MAINTAINERS
index af8de4f..97efbb9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -462,6 +462,12 @@ F: drivers/net/mrvl/
F: doc/guides/nics/mrvl.rst
F: doc/guides/nics/features/mrvl.ini

+Microsoft vdev-netvsc - EXPERIMENTAL
+M: Matan Azrad <***@mellanox.com>
+F: drivers/net/vdev-netvsc/
+F: doc/guides/nics/vdev-netvsc.rst
+F: doc/guides/nics/features/vdev-netvsc.ini
+
Netcope szedata2
M: Matej Vido <***@cesnet.cz>
F: drivers/net/szedata2/
diff --git a/config/common_base b/config/common_base
index 90508a8..664ff21 100644
--- a/config/common_base
+++ b/config/common_base
@@ -279,6 +279,11 @@ CONFIG_RTE_LIBRTE_NFP_DEBUG_RX=n
CONFIG_RTE_LIBRTE_MRVL_PMD=n

#
+# Compile virtual device driver for NetVSC on Hyper-V/Azure
+#
+CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=n
+
+#
# Compile burst-oriented Broadcom BNXT PMD driver
#
CONFIG_RTE_LIBRTE_BNXT_PMD=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74c7d64..e043262 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -47,6 +47,7 @@ CONFIG_RTE_LIBRTE_PMD_VHOST=y
CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
CONFIG_RTE_LIBRTE_PMD_TAP=y
CONFIG_RTE_LIBRTE_AVP_PMD=y
+CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=y
CONFIG_RTE_LIBRTE_NFP_PMD=y
CONFIG_RTE_LIBRTE_POWER=y
CONFIG_RTE_VIRTIO_USER=y
diff --git a/doc/guides/nics/features/vdev_netvsc.ini b/doc/guides/nics/features/vdev_netvsc.ini
new file mode 100644
index 0000000..cfc5cb9
--- /dev/null
+++ b/doc/guides/nics/features/vdev_netvsc.ini
@@ -0,0 +1,12 @@
+;
+; Supported features of the 'vdev_netvsc' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+ARMv7 = Y
+ARMv8 = Y
+Power8 = Y
+x86-32 = Y
+x86-64 = Y
+Usage doc = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 23babe9..5666046 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -64,6 +64,7 @@ Network Interface Controller Drivers
szedata2
tap
thunderx
+ vdev_netvsc
virtio
vhost
vmxnet3
diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
new file mode 100644
index 0000000..a952908
--- /dev/null
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: BSD-3-Clause
+ Copyright 2017 6WIND S.A.
+ Copyright 2017 Mellanox Technologies, Ltd.
+
+VDEV_NETVSC driver
+==================
+
+The VDEV_NETVSC driver (librte_pmd_vdev_netvsc) provides support for NetVSC
+interfaces and associated SR-IOV virtual function (VF) devices found in
+Linux virtual machines running on Microsoft Hyper-V_ (including Azure)
+platforms.
+
+.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
+
+Build options
+-------------
+
+- ``CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD`` (default ``y``)
+
+ Toggle compilation of this driver.
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c2fd7f5..e112732 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -39,6 +39,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += sfc
DIRS-$(CONFIG_RTE_LIBRTE_PMD_SZEDATA2) += szedata2
DIRS-$(CONFIG_RTE_LIBRTE_PMD_TAP) += tap
DIRS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += thunderx
+DIRS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc
DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio
DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += vmxnet3

diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
new file mode 100644
index 0000000..2fb059d
--- /dev/null
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -0,0 +1,27 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright 2017 6WIND S.A.
+# Copyright 2017 Mellanox Technologies, Ltd.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# Properties of the generated library.
+LIB = librte_pmd_vdev_netvsc.a
+LIBABIVER := 1
+EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
+
+# Additional compilation flags.
+CFLAGS += -O3
+CFLAGS += -g
+CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += $(WERROR_FLAGS)
+
+# Dependencies.
+LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_eal
+LDLIBS += -lrte_ethdev
+LDLIBS += -lrte_kvargs
+
+# Source files.
+SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map b/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
new file mode 100644
index 0000000..179140f
--- /dev/null
+++ b/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
@@ -0,0 +1,4 @@
+DPDK_18.02 {
+
+ local: *;
+};
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
new file mode 100644
index 0000000..e895b32
--- /dev/null
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2017 6WIND S.A.
+ * Copyright 2017 Mellanox Technologies, Ltd.
+ */
+
+#include <stddef.h>
+
+#include <rte_bus_vdev.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#define VDEV_NETVSC_DRIVER net_vdev_netvsc
+#define VDEV_NETVSC_ARG_IFACE "iface"
+#define VDEV_NETVSC_ARG_MAC "mac"
+
+#define DRV_LOG(level, ...) \
+ rte_log(RTE_LOG_ ## level, \
+ vdev_netvsc_logtype, \
+ RTE_FMT(RTE_STR(VDEV_NETVSC_DRIVER) ": " \
+ RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+/** Driver-specific log messages type. */
+static int vdev_netvsc_logtype;
+
+/** Number of driver instances relying on context list. */
+static unsigned int vdev_netvsc_ctx_inst;
+
+/**
+ * Probe NetVSC interfaces.
+ *
+ * @param dev
+ * Virtual device context for driver instance.
+ *
+ * @return
+ * Always 0, even in case of errors.
+ */
+static int
+vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
+{
+ static const char *const vdev_netvsc_arg[] = {
+ VDEV_NETVSC_ARG_IFACE,
+ VDEV_NETVSC_ARG_MAC,
+ NULL,
+ };
+ const char *name = rte_vdev_device_name(dev);
+ const char *args = rte_vdev_device_args(dev);
+ struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
+ vdev_netvsc_arg);
+
+ DRV_LOG(DEBUG, "invoked as \"%s\", using arguments \"%s\"", name, args);
+ if (!kvargs) {
+ DRV_LOG(ERR, "cannot parse arguments list");
+ goto error;
+ }
+error:
+ if (kvargs)
+ rte_kvargs_free(kvargs);
+ ++vdev_netvsc_ctx_inst;
+ return 0;
+}
+
+/**
+ * Remove driver instance.
+ *
+ * @param dev
+ * Virtual device context for driver instance.
+ *
+ * @return
+ * Always 0.
+ */
+static int
+vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
+{
+ --vdev_netvsc_ctx_inst;
+ return 0;
+}
+
+/** Virtual device descriptor. */
+static struct rte_vdev_driver vdev_netvsc_vdev = {
+ .probe = vdev_netvsc_vdev_probe,
+ .remove = vdev_netvsc_vdev_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(VDEV_NETVSC_DRIVER, vdev_netvsc_vdev);
+RTE_PMD_REGISTER_ALIAS(VDEV_NETVSC_DRIVER, eth_vdev_netvsc);
+RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
+ VDEV_NETVSC_ARG_IFACE "=<string> "
+ VDEV_NETVSC_ARG_MAC "=<string>");
+
+/** Initialize driver log type. */
+RTE_INIT(vdev_netvsc_init_log)
+{
+ vdev_netvsc_logtype = rte_log_register("pmd.vdev_netvsc");
+ if (vdev_netvsc_logtype >= 0)
+ rte_log_set_level(vdev_netvsc_logtype, RTE_LOG_NOTICE);
+}
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 78f23c5..2f8af49 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -157,6 +157,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += -lrte_pmd_sfc_efx
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_SZEDATA2) += -lrte_pmd_szedata2 -lsze2
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_TAP) += -lrte_pmd_tap
_LDLIBS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += -lrte_pmd_thunderx_nicvf
+_LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
_LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += -lrte_pmd_virtio
ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += -lrte_pmd_vhost
--
1.8.3.1
Matan Azrad
2018-01-18 13:51:38 UTC
Permalink
Virtual machines hosted by Hyper-V/Azure platforms are fitted with simplified virtual network devices named NetVSC that are used for fast communication between VM to VM, VM to hypervisor, and the outside.

They appear as standard system netdevices to user-land applications, the main difference being they are implemented on top of VMBUS instead of emulated PCI devices.

While this reads like a case for a standard DPDK PMD, there is more to it.

To accelerate outside communication, NetVSC devices as they appear in a VM can be paired with physical SR-IOV virtual function (VF) devices owned by that same VM. Both netdevices share the same MAC address in that case.

When paired, egress and most of the ingress traffic flow through the VF device, while part of it (e.g. multicasts, hypervisor control data) still flows through NetVSC. Moreover VF devices are not retained and disappear during VM migration; from a VM standpoint, they can be hot-plugged anytime with NetVSC acting as a fallback.

Running DPDK applications in such a context involves driving VF devices using their dedicated PMDs in a vendor-independent fashion (to benefit from maximum performance without writing dedicated code) while simultaneously listening to NetVSC and handling the related hot-plug events.

This new virtual driver (referred to as "vdev_netvsc" from this point on) automatically coordinates the Hyper-V/Azure-specific management part described above by relying on vendor-specific, failsafe and tap PMDs to expose a single consolidated Ethernet device usable directly by existing applications.

.------------------.
| DPDK application |
`--------+---------'
|
.------+------.
| DPDK ethdev |
`------+------' Control
| |
.------------+------------. v .--------------------.
| failsafe PMD +---------+ vdev_netvsc driver |
`--+-------------------+--' `--------------------'
| |
| .........|.........
| : | :
.----+----. : .----+----. :
| tap PMD | : | any PMD | :
`----+----' : `----+----' : <-- Hot-pluggable
| : | :
.------+-------. : .-----+-----. :
| NetVSC-based | : | SR-IOV VF | :
| netdevice | : | device | :
`--------------' : `-----------' :
:.................:



v2 changes(Adrien):

- Renamed driver from "hyperv" to "vdev_netvsc". This change covers
documentation and symbols prefix.
- Driver is now tagged EXPERIMENTAL.
- Replaced ether_addr_from_str() with a basic sscanf() call.
- Removed debugging code (memset() poisoning).
- Fixed hyperv_iface_is_netvsc()'s buffer allocation according to comments.
- Removed hyperv_basename().
- Discarded unused variables through __rte_unused.
- Added separate but necessary free() bugfix for failsafe PMD.
- Added file descriptor input support to failsafe PMD.
- Replaced temporary bash execution; failsafe now reads device definitions
directly through a pipe without an intermediate bash one-liner.
- Expanded DEBUG/INFO/WARN/ERROR() macros as PMD_DRV_LOG().
- Added dynamic log type (pmd.vdev_netvsc).
- Modified initialization code to probe devices immediately during startup.
- Fixed several snprintf() return value checks ("ret >= sizeof(foo)" is more
appropriate than "ret >= sizeof(foo) - 1").

v3 changes(Matan):
- Fixed clang compilation in V2.
- Removed hotplug remove code from the new driver.
- Supported probed sub-devices getting in fail-safe.
- Added automatic probing for HyperV VM systems.
- Added option to ignore the automatic probing.
- Skiped routed NetVSC devices probing.
- Adjusted documentation and semantics.
- Replaced maintainer.

v4 changes(Matan):
- Align descriptions of context struct(Stephen suggestion).
- Skip non-ethernet devices in netdev loop(Stephen suggestion).
- Use different variable names in "add fd parameter"(Gaetan suggestion).
- Change name of get port id function in "add automatic probing"(Gaetan suggestion).
- Update internal fail-safe devargs in case of probed device(Gaetan suggestion).
- use deferent commit title instead of "support probed sub-devices getting"(Gaetan suggestion).

v5 changes(Matan):
- Improve fail-safe documentation as Gaetan suggested.
- Fix fcntl paramenter.

v6 changes:
- fp!=NULL => fp==NULL in "add fd parameter".

Adrien Mazarguil (1):
net/failsafe: fix invalid free

Matan Azrad (7):
net/failsafe: add "fd" parameter
net/failsafe: add probed etherdev capture
net/vdev_netvsc: introduce Hyper-V platform driver
net/vdev_netvsc: implement core functionality
net/vdev_netvsc: skip routed netvsc probing
net/vdev_netvsc: add "force" parameter
net/vdev_netvsc: add automatic probing

MAINTAINERS | 6 +
config/common_base | 5 +
config/common_linuxapp | 1 +
doc/guides/nics/fail_safe.rst | 26 +
doc/guides/nics/features/vdev_netvsc.ini | 12 +
doc/guides/nics/index.rst | 1 +
doc/guides/nics/vdev_netvsc.rst | 100 +++
drivers/net/Makefile | 1 +
drivers/net/failsafe/failsafe_args.c | 84 ++-
drivers/net/failsafe/failsafe_eal.c | 78 ++-
drivers/net/failsafe/failsafe_private.h | 5 +
drivers/net/vdev_netvsc/Makefile | 31 +
.../vdev_netvsc/rte_pmd_vdev_netvsc_version.map | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 752 +++++++++++++++++++++
mk/rte.app.mk | 1 +
15 files changed, 1083 insertions(+), 24 deletions(-)
create mode 100644 doc/guides/nics/features/vdev_netvsc.ini
create mode 100644 doc/guides/nics/vdev_netvsc.rst
create mode 100644 drivers/net/vdev_netvsc/Makefile
create mode 100644 drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
create mode 100644 drivers/net/vdev_netvsc/vdev_netvsc.c
--
1.8.3.1
Matan Azrad
2018-01-18 13:51:39 UTC
Permalink
From: Adrien Mazarguil <***@6wind.com>

rte_free() is not supposed to work with pointers returned by calloc().

Fixes: a0194d828100 ("net/failsafe: add flexible device definition")
Cc: ***@dpdk.org
Cc: Gaetan Rivet <***@6wind.com>

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Acked-by: Gaetan Rivet <***@6wind.com>
---
drivers/net/failsafe/failsafe_args.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index cfc83e3..ec63ac9 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -407,7 +407,7 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
uint8_t i;

FOREACH_SUBDEV(sdev, i, dev) {
- rte_free(sdev->cmdline);
+ free(sdev->cmdline);
sdev->cmdline = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
--
1.8.3.1
Matan Azrad
2018-01-18 13:51:40 UTC
Permalink
This parameter enables applications to provide device definitions through
an arbitrary file descriptor number.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
Acked-by: Gaetan Rivet <***@6wind.com>
---
doc/guides/nics/fail_safe.rst | 9 ++++
drivers/net/failsafe/failsafe_args.c | 80 ++++++++++++++++++++++++++++++++-
drivers/net/failsafe/failsafe_private.h | 3 ++
3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index c4e3d2e..5b1b47e 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -106,6 +106,15 @@ Fail-safe command line parameters
All commas within the ``shell command`` are replaced by spaces before
executing the command. This helps using scripts to specify devices.

+- **fd(<file descriptor number>)** parameter
+
+ This parameter reads a device definition from an arbitrary file descriptor
+ number in ``<iface>`` format as described above.
+
+ The file descriptor is read in non-blocking mode and is never closed in
+ order to take only the last line into account (unlike ``exec()``) at every
+ probe attempt.
+
- **mac** parameter [MAC address]

This parameter allows the user to set a default MAC address to the fail-safe
diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index ec63ac9..a1fb3fa 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -31,7 +31,11 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
#include <string.h>
+#include <unistd.h>
#include <errno.h>

#include <rte_debug.h>
@@ -161,6 +165,67 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}

static int
+fs_read_fd(struct sub_device *sdev, char *fd_str)
+{
+ FILE *fp = NULL;
+ int fd = -1;
+ /* store possible newline as well */
+ char output[DEVARGS_MAXLEN + 1];
+ int err = -ENODEV;
+ int oflags;
+ int lcount;
+
+ RTE_ASSERT(fd_str != NULL || sdev->fd_str != NULL);
+ if (sdev->fd_str == NULL) {
+ sdev->fd_str = strdup(fd_str);
+ if (sdev->fd_str == NULL) {
+ ERROR("Command line allocation failed");
+ return -ENOMEM;
+ }
+ }
+ errno = 0;
+ fd = strtol(fd_str, &fd_str, 0);
+ if (errno || *fd_str || fd < 0) {
+ ERROR("Parsing FD number failed");
+ goto error;
+ }
+ /* Fiddle with copy of file descriptor */
+ fd = dup(fd);
+ if (fd == -1)
+ goto error;
+ oflags = fcntl(fd, F_GETFL);
+ if (oflags == -1)
+ goto error;
+ if (fcntl(fd, F_SETFL, oflags | O_NONBLOCK) == -1)
+ goto error;
+ fp = fdopen(fd, "r");
+ if (fp == NULL)
+ goto error;
+ fd = -1;
+ /* Only take the last line into account */
+ lcount = 0;
+ while (fgets(output, sizeof(output), fp))
+ ++lcount;
+ if (lcount == 0)
+ goto error;
+ else if (ferror(fp) && errno != EAGAIN)
+ goto error;
+ /* Line must end with a newline character */
+ fs_sanitize_cmdline(output);
+ if (output[0] == '\0')
+ goto error;
+ err = fs_parse_device(sdev, output);
+ if (err)
+ ERROR("Parsing device '%s' failed", output);
+error:
+ if (fp)
+ fclose(fp);
+ if (fd != -1)
+ close(fd);
+ return err;
+}
+
+static int
fs_parse_device_param(struct rte_eth_dev *dev, const char *param,
uint8_t head)
{
@@ -202,6 +267,14 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
}
if (ret)
goto free_args;
+ } else if (strncmp(param, "fd(", 3) == 0) {
+ ret = fs_read_fd(sdev, args);
+ if (ret == -ENODEV) {
+ DEBUG("Reading device info from FD failed");
+ ret = 0;
+ }
+ if (ret)
+ goto free_args;
} else {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
@@ -409,6 +482,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
FOREACH_SUBDEV(sdev, i, dev) {
free(sdev->cmdline);
sdev->cmdline = NULL;
+ free(sdev->fd_str);
+ sdev->fd_str = NULL;
free(sdev->devargs.args);
sdev->devargs.args = NULL;
}
@@ -424,7 +499,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
param[b] != '\0')
b++;
if (strncmp(param, "dev", b) != 0 &&
- strncmp(param, "exec", b) != 0) {
+ strncmp(param, "exec", b) != 0 &&
+ strncmp(param, "fd(", b) != 0) {
ERROR("Unrecognized device type: %.*s", (int)b, param);
return -EINVAL;
}
@@ -463,6 +539,8 @@ typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
continue;
if (sdev->cmdline)
ret = fs_execute_cmd(sdev, sdev->cmdline);
+ else if (sdev->fd_str)
+ ret = fs_read_fd(sdev, sdev->fd_str);
else
ret = fs_parse_sub_device(sdev);
if (ret == 0)
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index 54b5b91..5e04ffe 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -48,6 +48,7 @@
#define PMD_FAILSAFE_PARAM_STRING \
"dev(<ifc>)," \
"exec(<shell command>)," \
+ "fd(<fd number>)," \
"mac=mac_addr," \
"hotplug_poll=u64" \
""
@@ -112,6 +113,8 @@ struct sub_device {
struct fs_stats stats_snapshot;
/* Some device are defined as a command line */
char *cmdline;
+ /* Others are retrieved through a file descriptor */
+ char *fd_str;
/* fail-safe device backreference */
struct rte_eth_dev *fs_dev;
/* flag calling for recollection */
--
1.8.3.1
Matan Azrad
2018-01-18 13:51:41 UTC
Permalink
Previous fail-safe code didn't support probed sub-devices capture and
failed when it tried to probe them.

Skip fail-safe sub-device probing when it already was probed.

Signed-off-by: Matan Azrad <***@mellanox.com>
Cc: Gaetan Rivet <***@6wind.com>
---
doc/guides/nics/fail_safe.rst | 17 +++++++
drivers/net/failsafe/failsafe_args.c | 2 -
drivers/net/failsafe/failsafe_eal.c | 78 ++++++++++++++++++++++++---------
drivers/net/failsafe/failsafe_private.h | 2 +
4 files changed, 77 insertions(+), 22 deletions(-)

diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index 5b1b47e..3f72b59 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -93,6 +93,14 @@ Fail-safe command line parameters
additional sub-device parameters if need be. They will be passed on to the
sub-device.

+.. note::
+
+ In case of whitelist sub-device probed by EAL, fail-safe PMD will take the device
+ as is, which means that EAL device options are taken in this case.
+ When trying to use a PCI device automatically probed in blacklist mode,
+ the syntax for the fail-safe must be with the full PCI id:
+ Domain:Bus:Device.Function. See the usage example section.
+
- **exec(<shell command>)** parameter

This parameter allows the user to provide a command to the fail-safe PMD to
@@ -169,6 +177,15 @@ This section shows some example of using **testpmd** with a fail-safe PMD.
$RTE_TARGET/build/app/testpmd -c 0xff -n 4 --no-pci \
--vdev='net_failsafe0,exec(echo 84:00.0)' -- -i

+#. Start testpmd, automatically probing the device 84:00.0 and using it with
+ the fail-safe.
+
+ .. code-block:: console
+
+ $RTE_TARGET/build/app/testpmd -c 0xff -n 4 \
+ --vdev 'net_failsafe0,dev(0000:84:00.0),dev(net_ring0)' -- -i
+
+
Using the Fail-safe PMD from an application
-------------------------------------------

diff --git a/drivers/net/failsafe/failsafe_args.c b/drivers/net/failsafe/failsafe_args.c
index a1fb3fa..b049b75 100644
--- a/drivers/net/failsafe/failsafe_args.c
+++ b/drivers/net/failsafe/failsafe_args.c
@@ -45,8 +45,6 @@

#include "failsafe_private.h"

-#define DEVARGS_MAXLEN 4096
-
/* Callback used when a new device is found in devargs */
typedef int (parse_cb)(struct rte_eth_dev *dev, const char *params,
uint8_t head);
diff --git a/drivers/net/failsafe/failsafe_eal.c b/drivers/net/failsafe/failsafe_eal.c
index 19d26f5..33a5adf 100644
--- a/drivers/net/failsafe/failsafe_eal.c
+++ b/drivers/net/failsafe/failsafe_eal.c
@@ -36,39 +36,77 @@
#include "failsafe_private.h"

static int
+fs_ethdev_portid_get(const char *name, uint16_t *port_id)
+{
+ uint16_t pid;
+ size_t len;
+
+ if (name == NULL) {
+ DEBUG("Null pointer is specified\n");
+ return -EINVAL;
+ }
+ len = strlen(name);
+ RTE_ETH_FOREACH_DEV(pid) {
+ if (!strncmp(name, rte_eth_devices[pid].device->name, len)) {
+ *port_id = pid;
+ return 0;
+ }
+ }
+ return -ENODEV;
+}
+
+static int
fs_bus_init(struct rte_eth_dev *dev)
{
struct sub_device *sdev;
struct rte_devargs *da;
uint8_t i;
- uint16_t j;
+ uint16_t pid;
int ret;

FOREACH_SUBDEV(sdev, i, dev) {
if (sdev->state != DEV_PARSED)
continue;
da = &sdev->devargs;
- ret = rte_eal_hotplug_add(da->bus->name,
- da->name,
- da->args);
- if (ret) {
- ERROR("sub_device %d probe failed %s%s%s", i,
- rte_errno ? "(" : "",
- rte_errno ? strerror(rte_errno) : "",
- rte_errno ? ")" : "");
- continue;
- }
- RTE_ETH_FOREACH_DEV(j) {
- if (strcmp(rte_eth_devices[j].device->name,
- da->name) == 0) {
- ETH(sdev) = &rte_eth_devices[j];
- break;
+ if (fs_ethdev_portid_get(da->name, &pid) != 0) {
+ ret = rte_eal_hotplug_add(da->bus->name,
+ da->name,
+ da->args);
+ if (ret) {
+ ERROR("sub_device %d probe failed %s%s%s", i,
+ rte_errno ? "(" : "",
+ rte_errno ? strerror(rte_errno) : "",
+ rte_errno ? ")" : "");
+ continue;
}
+ if (fs_ethdev_portid_get(da->name, &pid) != 0) {
+ ERROR("sub_device %d init went wrong", i);
+ return -ENODEV;
+ }
+ } else {
+ char devstr[DEVARGS_MAXLEN] = "";
+ struct rte_devargs *probed_da =
+ rte_eth_devices[pid].device->devargs;
+
+ /* Take control of device probed by EAL options. */
+ free(da->args);
+ memset(da, 0, sizeof(*da));
+ if (probed_da != NULL)
+ snprintf(devstr, sizeof(devstr), "%s,%s",
+ probed_da->name, probed_da->args);
+ else
+ snprintf(devstr, sizeof(devstr), "%s",
+ rte_eth_devices[pid].device->name);
+ ret = rte_eal_devargs_parse(devstr, da);
+ if (ret) {
+ ERROR("Probed devargs parsing failed with code"
+ " %d", ret);
+ return ret;
+ }
+ INFO("Taking control of a probed sub device"
+ " %d named %s", i, da->name);
}
- if (ETH(sdev) == NULL) {
- ERROR("sub_device %d init went wrong", i);
- return -ENODEV;
- }
+ ETH(sdev) = &rte_eth_devices[pid];
SUB_ID(sdev) = i;
sdev->fs_dev = dev;
sdev->dev = ETH(sdev)->device;
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index 5e04ffe..9fcf72e 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -58,6 +58,8 @@
#define FAILSAFE_MAX_ETHPORTS 2
#define FAILSAFE_MAX_ETHADDR 128

+#define DEVARGS_MAXLEN 4096
+
/* TYPES */

struct rxq {
--
1.8.3.1
Matan Azrad
2018-01-18 13:51:42 UTC
Permalink
This patch lays the groundwork for this driver (draft documentation,
copyright notices, code base skeleton and build system hooks). While it can
be successfully compiled and invoked, it's an empty shell at this stage.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
MAINTAINERS | 6 ++
config/common_base | 5 ++
config/common_linuxapp | 1 +
doc/guides/nics/features/vdev_netvsc.ini | 12 +++
doc/guides/nics/index.rst | 1 +
doc/guides/nics/vdev_netvsc.rst | 20 +++++
drivers/net/Makefile | 1 +
drivers/net/vdev_netvsc/Makefile | 27 ++++++
.../vdev_netvsc/rte_pmd_vdev_netvsc_version.map | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 99 ++++++++++++++++++++++
mk/rte.app.mk | 1 +
11 files changed, 177 insertions(+)
create mode 100644 doc/guides/nics/features/vdev_netvsc.ini
create mode 100644 doc/guides/nics/vdev_netvsc.rst
create mode 100644 drivers/net/vdev_netvsc/Makefile
create mode 100644 drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
create mode 100644 drivers/net/vdev_netvsc/vdev_netvsc.c

diff --git a/MAINTAINERS b/MAINTAINERS
index af8de4f..97efbb9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -462,6 +462,12 @@ F: drivers/net/mrvl/
F: doc/guides/nics/mrvl.rst
F: doc/guides/nics/features/mrvl.ini

+Microsoft vdev-netvsc - EXPERIMENTAL
+M: Matan Azrad <***@mellanox.com>
+F: drivers/net/vdev-netvsc/
+F: doc/guides/nics/vdev-netvsc.rst
+F: doc/guides/nics/features/vdev-netvsc.ini
+
Netcope szedata2
M: Matej Vido <***@cesnet.cz>
F: drivers/net/szedata2/
diff --git a/config/common_base b/config/common_base
index 90508a8..664ff21 100644
--- a/config/common_base
+++ b/config/common_base
@@ -279,6 +279,11 @@ CONFIG_RTE_LIBRTE_NFP_DEBUG_RX=n
CONFIG_RTE_LIBRTE_MRVL_PMD=n

#
+# Compile virtual device driver for NetVSC on Hyper-V/Azure
+#
+CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=n
+
+#
# Compile burst-oriented Broadcom BNXT PMD driver
#
CONFIG_RTE_LIBRTE_BNXT_PMD=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 74c7d64..e043262 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -47,6 +47,7 @@ CONFIG_RTE_LIBRTE_PMD_VHOST=y
CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
CONFIG_RTE_LIBRTE_PMD_TAP=y
CONFIG_RTE_LIBRTE_AVP_PMD=y
+CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=y
CONFIG_RTE_LIBRTE_NFP_PMD=y
CONFIG_RTE_LIBRTE_POWER=y
CONFIG_RTE_VIRTIO_USER=y
diff --git a/doc/guides/nics/features/vdev_netvsc.ini b/doc/guides/nics/features/vdev_netvsc.ini
new file mode 100644
index 0000000..cfc5cb9
--- /dev/null
+++ b/doc/guides/nics/features/vdev_netvsc.ini
@@ -0,0 +1,12 @@
+;
+; Supported features of the 'vdev_netvsc' network poll mode driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+ARMv7 = Y
+ARMv8 = Y
+Power8 = Y
+x86-32 = Y
+x86-64 = Y
+Usage doc = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 23babe9..5666046 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -64,6 +64,7 @@ Network Interface Controller Drivers
szedata2
tap
thunderx
+ vdev_netvsc
virtio
vhost
vmxnet3
diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
new file mode 100644
index 0000000..a952908
--- /dev/null
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: BSD-3-Clause
+ Copyright 2017 6WIND S.A.
+ Copyright 2017 Mellanox Technologies, Ltd.
+
+VDEV_NETVSC driver
+==================
+
+The VDEV_NETVSC driver (librte_pmd_vdev_netvsc) provides support for NetVSC
+interfaces and associated SR-IOV virtual function (VF) devices found in
+Linux virtual machines running on Microsoft Hyper-V_ (including Azure)
+platforms.
+
+.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v
+
+Build options
+-------------
+
+- ``CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD`` (default ``y``)
+
+ Toggle compilation of this driver.
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index c2fd7f5..e112732 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -39,6 +39,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += sfc
DIRS-$(CONFIG_RTE_LIBRTE_PMD_SZEDATA2) += szedata2
DIRS-$(CONFIG_RTE_LIBRTE_PMD_TAP) += tap
DIRS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += thunderx
+DIRS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc
DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio
DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += vmxnet3

diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
new file mode 100644
index 0000000..2fb059d
--- /dev/null
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -0,0 +1,27 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright 2017 6WIND S.A.
+# Copyright 2017 Mellanox Technologies, Ltd.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# Properties of the generated library.
+LIB = librte_pmd_vdev_netvsc.a
+LIBABIVER := 1
+EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
+
+# Additional compilation flags.
+CFLAGS += -O3
+CFLAGS += -g
+CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += $(WERROR_FLAGS)
+
+# Dependencies.
+LDLIBS += -lrte_bus_vdev
+LDLIBS += -lrte_eal
+LDLIBS += -lrte_ethdev
+LDLIBS += -lrte_kvargs
+
+# Source files.
+SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map b/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
new file mode 100644
index 0000000..179140f
--- /dev/null
+++ b/drivers/net/vdev_netvsc/rte_pmd_vdev_netvsc_version.map
@@ -0,0 +1,4 @@
+DPDK_18.02 {
+
+ local: *;
+};
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
new file mode 100644
index 0000000..e895b32
--- /dev/null
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2017 6WIND S.A.
+ * Copyright 2017 Mellanox Technologies, Ltd.
+ */
+
+#include <stddef.h>
+
+#include <rte_bus_vdev.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#define VDEV_NETVSC_DRIVER net_vdev_netvsc
+#define VDEV_NETVSC_ARG_IFACE "iface"
+#define VDEV_NETVSC_ARG_MAC "mac"
+
+#define DRV_LOG(level, ...) \
+ rte_log(RTE_LOG_ ## level, \
+ vdev_netvsc_logtype, \
+ RTE_FMT(RTE_STR(VDEV_NETVSC_DRIVER) ": " \
+ RTE_FMT_HEAD(__VA_ARGS__,) "\n", \
+ RTE_FMT_TAIL(__VA_ARGS__,)))
+
+/** Driver-specific log messages type. */
+static int vdev_netvsc_logtype;
+
+/** Number of driver instances relying on context list. */
+static unsigned int vdev_netvsc_ctx_inst;
+
+/**
+ * Probe NetVSC interfaces.
+ *
+ * @param dev
+ * Virtual device context for driver instance.
+ *
+ * @return
+ * Always 0, even in case of errors.
+ */
+static int
+vdev_netvsc_vdev_probe(struct rte_vdev_device *dev)
+{
+ static const char *const vdev_netvsc_arg[] = {
+ VDEV_NETVSC_ARG_IFACE,
+ VDEV_NETVSC_ARG_MAC,
+ NULL,
+ };
+ const char *name = rte_vdev_device_name(dev);
+ const char *args = rte_vdev_device_args(dev);
+ struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
+ vdev_netvsc_arg);
+
+ DRV_LOG(DEBUG, "invoked as \"%s\", using arguments \"%s\"", name, args);
+ if (!kvargs) {
+ DRV_LOG(ERR, "cannot parse arguments list");
+ goto error;
+ }
+error:
+ if (kvargs)
+ rte_kvargs_free(kvargs);
+ ++vdev_netvsc_ctx_inst;
+ return 0;
+}
+
+/**
+ * Remove driver instance.
+ *
+ * @param dev
+ * Virtual device context for driver instance.
+ *
+ * @return
+ * Always 0.
+ */
+static int
+vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
+{
+ --vdev_netvsc_ctx_inst;
+ return 0;
+}
+
+/** Virtual device descriptor. */
+static struct rte_vdev_driver vdev_netvsc_vdev = {
+ .probe = vdev_netvsc_vdev_probe,
+ .remove = vdev_netvsc_vdev_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(VDEV_NETVSC_DRIVER, vdev_netvsc_vdev);
+RTE_PMD_REGISTER_ALIAS(VDEV_NETVSC_DRIVER, eth_vdev_netvsc);
+RTE_PMD_REGISTER_PARAM_STRING(net_vdev_netvsc,
+ VDEV_NETVSC_ARG_IFACE "=<string> "
+ VDEV_NETVSC_ARG_MAC "=<string>");
+
+/** Initialize driver log type. */
+RTE_INIT(vdev_netvsc_init_log)
+{
+ vdev_netvsc_logtype = rte_log_register("pmd.vdev_netvsc");
+ if (vdev_netvsc_logtype >= 0)
+ rte_log_set_level(vdev_netvsc_logtype, RTE_LOG_NOTICE);
+}
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 78f23c5..2f8af49 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -157,6 +157,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_SFC_EFX_PMD) += -lrte_pmd_sfc_efx
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_SZEDATA2) += -lrte_pmd_szedata2 -lsze2
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_TAP) += -lrte_pmd_tap
_LDLIBS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += -lrte_pmd_thunderx_nicvf
+_LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
_LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += -lrte_pmd_virtio
ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += -lrte_pmd_vhost
--
1.8.3.1
Matan Azrad
2018-01-18 13:51:43 UTC
Permalink
As described in more details in the attached documentation (see patch
contents), this virtual device driver manages NetVSC interfaces in virtual
machines hosted by Hyper-V/Azure platforms.

This driver does not manage traffic nor Ethernet devices directly; it acts
as a thin configuration layer that automatically instantiates and controls
fail-safe PMD instances combining tap and PCI sub-devices, so that each
NetVSC interface is exposed as a single consolidated port to DPDK
applications.

PCI sub-devices being hot-pluggable (e.g. during VM migration),
applications automatically benefit from increased throughput when present
and automatic fallback on NetVSC otherwise without interruption thanks to
fail-safe's hot-plug handling.

Once initialized, the sole job of the vdev_netvsc driver is to regularly
scan for PCI devices to associate with NetVSC interfaces and feed their
addresses to corresponding fail-safe instances.

Signed-off-by: Adrien Mazarguil <***@6wind.com>
Signed-off-by: Matan Azrad <***@mellanox.com>
---
doc/guides/nics/vdev_netvsc.rst | 70 +++++
drivers/net/vdev_netvsc/Makefile | 4 +
drivers/net/vdev_netvsc/vdev_netvsc.c | 550 +++++++++++++++++++++++++++++++++-
3 files changed, 623 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/vdev_netvsc.rst b/doc/guides/nics/vdev_netvsc.rst
index a952908..fde1fb8 100644
--- a/doc/guides/nics/vdev_netvsc.rst
+++ b/doc/guides/nics/vdev_netvsc.rst
@@ -12,9 +12,79 @@ platforms.

.. _Hyper-V: https://docs.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-hyper-v

+Implementation details
+----------------------
+
+Each instance of this driver effectively needs to drive two devices: the
+NetVSC interface proper and its SR-IOV VF (referred to as "physical" from
+this point on) counterpart sharing the same MAC address.
+
+Physical devices are part of the host system and cannot be maintained during
+VM migration. From a VM standpoint they appear as hot-plug devices that come
+and go without prior notice.
+
+When the physical device is present, egress and most of the ingress traffic
+flows through it; only multicasts and other hypervisor control still flow
+through NetVSC. Otherwise, NetVSC acts as a fallback for all traffic.
+
+To avoid unnecessary code duplication and ensure maximum performance,
+handling of physical devices is left to their original PMDs; this virtual
+device driver (also known as *vdev*) manages other PMDs as summarized by the
+following block diagram::
+
+ .------------------.
+ | DPDK application |
+ `--------+---------'
+ |
+ .------+------.
+ | DPDK ethdev |
+ `------+------' Control
+ | |
+ .------------+------------. v .--------------------.
+ | failsafe PMD +---------+ vdev_netvsc driver |
+ `--+-------------------+--' `--------------------'
+ | |
+ | .........|.........
+ | : | :
+ .----+----. : .----+----. :
+ | tap PMD | : | any PMD | :
+ `----+----' : `----+----' : <-- Hot-pluggable
+ | : | :
+ .------+-------. : .-----+-----. :
+ | NetVSC-based | : | SR-IOV VF | :
+ | netdevice | : | device | :
+ `--------------' : `-----------' :
+ :.................:
+
+
+This driver implementation may be temporary and should be improved or removed
+either when hot-plug will be fully supported in EAL and bus drivers or when
+a new NetVSC driver will be integrated.
+
Build options
-------------

- ``CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD`` (default ``y``)

Toggle compilation of this driver.
+
+Run-time parameters
+-------------------
+
+To invoke this driver, applications have to explicitly provide the
+``--vdev=net_vdev_netvsc`` EAL option.
+
+The following device parameters are supported:
+
+- ``iface`` [string]
+
+ Provide a specific NetVSC interface (netdevice) name to attach this driver
+ to. Can be provided multiple times for additional instances.
+
+- ``mac`` [string]
+
+ Same as ``iface`` except a suitable NetVSC interface is located using its
+ MAC address.
+
+Not specifying either ``iface`` or ``mac`` makes this driver attach itself to
+all NetVSC interfaces found on the system.
diff --git a/drivers/net/vdev_netvsc/Makefile b/drivers/net/vdev_netvsc/Makefile
index 2fb059d..f2b2ac5 100644
--- a/drivers/net/vdev_netvsc/Makefile
+++ b/drivers/net/vdev_netvsc/Makefile
@@ -13,6 +13,9 @@ EXPORT_MAP := rte_pmd_vdev_netvsc_version.map
CFLAGS += -O3
CFLAGS += -g
CFLAGS += -std=c11 -pedantic -Wall -Wextra
+CFLAGS += -D_XOPEN_SOURCE=600
+CFLAGS += -D_BSD_SOURCE
+CFLAGS += -D_DEFAULT_SOURCE
CFLAGS += $(WERROR_FLAGS)

# Dependencies.
@@ -20,6 +23,7 @@ LDLIBS += -lrte_bus_vdev
LDLIBS += -lrte_eal
LDLIBS += -lrte_ethdev
LDLIBS += -lrte_kvargs
+LDLIBS += -lrte_net

# Source files.
SRCS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += vdev_netvsc.c
diff --git a/drivers/net/vdev_netvsc/vdev_netvsc.c b/drivers/net/vdev_netvsc/vdev_netvsc.c
index e895b32..21c3265 100644
--- a/drivers/net/vdev_netvsc/vdev_netvsc.c
+++ b/drivers/net/vdev_netvsc/vdev_netvsc.c
@@ -3,17 +3,42 @@
* Copyright 2017 Mellanox Technologies, Ltd.
*/

+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <linux/sockios.h>
+#include <net/if.h>
+#include <net/if_arp.h>
+#include <netinet/ip.h>
+#include <stdarg.h>
#include <stddef.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/queue.h>
+#include <sys/socket.h>
+#include <unistd.h>

+#include <rte_alarm.h>
+#include <rte_bus.h>
#include <rte_bus_vdev.h>
#include <rte_common.h>
#include <rte_config.h>
+#include <rte_dev.h>
+#include <rte_errno.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
#include <rte_kvargs.h>
#include <rte_log.h>

#define VDEV_NETVSC_DRIVER net_vdev_netvsc
#define VDEV_NETVSC_ARG_IFACE "iface"
#define VDEV_NETVSC_ARG_MAC "mac"
+#define VDEV_NETVSC_PROBE_MS 1000
+
+#define NETVSC_CLASS_ID "{f8615163-df3e-46c5-913f-f2d2f965ed0e}"

#define DRV_LOG(level, ...) \
rte_log(RTE_LOG_ ## level, \
@@ -25,12 +50,495 @@
/** Driver-specific log messages type. */
static int vdev_netvsc_logtype;

+/** Context structure for a vdev_netvsc instance. */
+struct vdev_netvsc_ctx {
+ LIST_ENTRY(vdev_netvsc_ctx) entry; /**< Next entry in list. */
+ unsigned int id; /**< Unique ID. */
+ char name[64]; /**< Unique name. */
+ char devname[64]; /**< Fail-safe instance name. */
+ char devargs[256]; /**< Fail-safe device arguments. */
+ char if_name[IF_NAMESIZE]; /**< NetVSC netdevice name. */
+ unsigned int if_index; /**< NetVSC netdevice index. */
+ struct ether_addr if_addr; /**< NetVSC MAC address. */
+ int pipe[2]; /**< Fail-safe communication pipe. */
+ char yield[256]; /**< PCI sub-device arguments. */
+};
+
+/** Context list is common to all driver instances. */
+static LIST_HEAD(, vdev_netvsc_ctx) vdev_netvsc_ctx_list =
+ LIST_HEAD_INITIALIZER(vdev_netvsc_ctx_list);
+
+/** Number of entries in context list. */
+static unsigned int vdev_netvsc_ctx_count;
+
/** Number of driver instances relying on context list. */
static unsigned int vdev_netvsc_ctx_inst;

/**
+ * Destroy a vdev_netvsc context instance.
+ *
+ * @param ctx
+ * Context to destroy.
+ */
+static void
+vdev_netvsc_ctx_destroy(struct vdev_netvsc_ctx *ctx)
+{
+ if (ctx->pipe[0] != -1)
+ close(ctx->pipe[0]);
+ if (ctx->pipe[1] != -1)
+ close(ctx->pipe[1]);
+ free(ctx);
+}
+
+/**
+ * Iterate over system network interfaces.
+ *
+ * This function runs a given callback function for each netdevice found on
+ * the system.
+ *
+ * @param func
+ * Callback function pointer. List traversal is aborted when this function
+ * returns a nonzero value.
+ * @param ...
+ * Variable parameter list passed as @p va_list to @p func.
+ *
+ * @return
+ * 0 when the entire list is traversed successfully, a negative error code
+ * in case or failure, or the nonzero value returned by @p func when list
+ * traversal is aborted.
+ */
+static int
+vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap), ...)
+{
+ struct if_nameindex *iface = if_nameindex();
+ int s = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+ unsigned int i;
+ int ret = 0;
+
+ if (!iface) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "cannot retrieve system network interfaces");
+ goto error;
+ }
+ if (s == -1) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot open socket: %s", rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; iface[i].if_name; ++i) {
+ struct ifreq req;
+ struct ether_addr eth_addr;
+ va_list ap;
+
+ strncpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
+ if (ioctl(s, SIOCGIFHWADDR, &req) == -1) {
+ DRV_LOG(WARNING, "cannot retrieve information about"
+ " interface \"%s\": %s",
+ req.ifr_name, rte_strerror(errno));
+ continue;
+ }
+ if (req.ifr_hwaddr.sa_family != ARPHRD_ETHER) {
+ DRV_LOG(DEBUG, "interface %s is non-ethernet device",
+ req.ifr_name);
+ continue;
+ }
+ memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
+ RTE_DIM(eth_addr.addr_bytes));
+ va_start(ap, func);
+ ret = func(&iface[i], &eth_addr, ap);
+ va_end(ap);
+ if (ret)
+ break;
+ }
+error:
+ if (s != -1)
+ close(s);
+ if (iface)
+ if_freenameindex(iface);
+ return ret;
+}
+
+/**
+ * Determine if a network interface is NetVSC.
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ *
+ * @return
+ * A nonzero value when interface is detected as NetVSC. In case of error,
+ * rte_errno is updated and 0 returned.
+ */
+static int
+vdev_netvsc_iface_is_netvsc(const struct if_nameindex *iface)
+{
+ static const char temp[] = "/sys/class/net/%s/device/class_id";
+ char path[sizeof(temp) + IF_NAMESIZE];
+ FILE *f;
+ int ret;
+ int len = 0;
+
+ ret = snprintf(path, sizeof(path), temp, iface->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(path)) {
+ rte_errno = ENOBUFS;
+ return 0;
+ }
+ f = fopen(path, "r");
+ if (!f) {
+ rte_errno = errno;
+ return 0;
+ }
+ ret = fscanf(f, NETVSC_CLASS_ID "%n", &len);
+ if (ret == EOF)
+ rte_errno = errno;
+ ret = len == (int)strlen(NETVSC_CLASS_ID);
+ fclose(f);
+ return ret;
+}
+
+/**
+ * Retrieve network interface data from sysfs symbolic link.
+ *
+ * @param[out] buf
+ * Output data buffer.
+ * @param size
+ * Output buffer size.
+ * @param[in] if_name
+ * Netdevice name.
+ * @param[in] relpath
+ * Symbolic link path relative to netdevice sysfs entry.
+ *
+ * @return
+ * 0 on success, a negative error code otherwise.
+ */
+static int
+vdev_netvsc_sysfs_readlink(char *buf, size_t size, const char *if_name,
+ const char *relpath)
+{
+ int ret;
+
+ ret = snprintf(buf, size, "/sys/class/net/%s/%s", if_name, relpath);
+ if (ret == -1 || (size_t)ret >= size)
+ return -ENOBUFS;
+ ret = readlink(buf, buf, size);
+ if (ret == -1)
+ return -errno;
+ if ((size_t)ret >= size - 1)
+ return -ENOBUFS;
+ buf[ret] = '\0';
+ return 0;
+}
+
+/**
+ * Probe a network interface to associate with vdev_netvsc context.
+ *
+ * This function determines if the network device matches the properties of
+ * the NetVSC interface associated with the vdev_netvsc context and
+ * communicates its bus address to the fail-safe PMD instance if so.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - struct vdev_netvsc_ctx *ctx:
+ * Context to associate network interface with.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_device_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ struct vdev_netvsc_ctx *ctx = va_arg(ap, struct vdev_netvsc_ctx *);
+ char buf[RTE_MAX(sizeof(ctx->yield), 256u)];
+ const char *addr;
+ size_t len;
+ int ret;
+
+ /* Skip non-matching or unwanted NetVSC interfaces. */
+ if (ctx->if_index == iface->if_index) {
+ if (!strcmp(ctx->if_name, iface->if_name))
+ return 0;
+ DRV_LOG(DEBUG,
+ "NetVSC interface \"%s\" (index %u) renamed \"%s\"",
+ ctx->if_name, ctx->if_index, iface->if_name);
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ return 0;
+ }
+ if (vdev_netvsc_iface_is_netvsc(iface))
+ return 0;
+ if (!is_same_ether_addr(eth_addr, &ctx->if_addr))
+ return 0;
+ /* Look for associated PCI device. */
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device/subsystem");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ if (strcmp(addr, "pci"))
+ return 0;
+ ret = vdev_netvsc_sysfs_readlink(buf, sizeof(buf), iface->if_name,
+ "device");
+ if (ret)
+ return 0;
+ addr = strrchr(buf, '/');
+ addr = addr ? addr + 1 : buf;
+ len = strlen(addr);
+ if (!len)
+ return 0;
+ /* Send PCI device argument to fail-safe PMD instance. */
+ if (strcmp(addr, ctx->yield))
+ DRV_LOG(DEBUG, "associating PCI device \"%s\" with NetVSC"
+ " interface \"%s\" (index %u)", addr, ctx->if_name,
+ ctx->if_index);
+ memmove(buf, addr, len + 1);
+ addr = buf;
+ buf[len] = '\n';
+ ret = write(ctx->pipe[1], addr, len + 1);
+ buf[len] = '\0';
+ if (ret == -1) {
+ if (errno == EINTR || errno == EAGAIN)
+ return 1;
+ DRV_LOG(WARNING, "cannot associate PCI device name \"%s\" with"
+ " interface \"%s\": %s", addr, ctx->if_name,
+ rte_strerror(errno));
+ return 1;
+ }
+ if ((size_t)ret != len + 1) {
+ /*
+ * Attempt to override previous partial write, no need to
+ * recover if that fails.
+ */
+ ret = write(ctx->pipe[1], "\n", 1);
+ (void)ret;
+ return 1;
+ }
+ fsync(ctx->pipe[1]);
+ memcpy(ctx->yield, addr, len + 1);
+ return 1;
+}
+
+/**
+ * Alarm callback that regularly probes system network interfaces.
+ *
+ * This callback runs at a frequency determined by VDEV_NETVSC_PROBE_MS as
+ * long as an vdev_netvsc context instance exists.
+ *
+ * @param arg
+ * Ignored.
+ */
+static void
+vdev_netvsc_alarm(__rte_unused void *arg)
+{
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry) {
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ if (ret)
+ break;
+ }
+ if (!vdev_netvsc_ctx_count)
+ return;
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to reschedule alarm callback: %s",
+ rte_strerror(-ret));
+ }
+}
+
+/**
+ * Probe a NetVSC interface to generate a vdev_netvsc context from.
+ *
+ * This function instantiates vdev_netvsc contexts either for all NetVSC
+ * devices found on the system or only a subset provided as device
+ * arguments.
+ *
+ * It is normally used with vdev_netvsc_foreach_iface().
+ *
+ * @param[in] iface
+ * Pointer to netdevice description structure (name and index).
+ * @param[in] eth_addr
+ * MAC address associated with @p iface.
+ * @param ap
+ * Variable arguments list comprising:
+ *
+ * - const char *name:
+ * Name associated with current driver instance.
+ *
+ * - struct rte_kvargs *kvargs:
+ * Device arguments provided to current driver instance.
+ *
+ * - unsigned int specified:
+ * Number of specific netdevices provided as device arguments.
+ *
+ * - unsigned int *matched:
+ * The number of specified netdevices matched by this function.
+ *
+ * @return
+ * A nonzero value when interface matches, 0 otherwise or in case of
+ * error.
+ */
+static int
+vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
+ const struct ether_addr *eth_addr,
+ va_list ap)
+{
+ const char *name = va_arg(ap, const char *);
+ struct rte_kvargs *kvargs = va_arg(ap, struct rte_kvargs *);
+ unsigned int specified = va_arg(ap, unsigned int);
+ unsigned int *matched = va_arg(ap, unsigned int *);
+ unsigned int i;
+ struct vdev_netvsc_ctx *ctx;
+ int ret;
+
+ /* Probe all interfaces when none are specified. */
+ if (specified) {
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE)) {
+ if (!strcmp(pair->value, iface->if_name))
+ break;
+ } else if (!strcmp(pair->key, VDEV_NETVSC_ARG_MAC)) {
+ struct ether_addr tmp;
+
+ if (sscanf(pair->value,
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8 ":"
+ "%" SCNx8 ":%" SCNx8 ":%" SCNx8,
+ &tmp.addr_bytes[0],
+ &tmp.addr_bytes[1],
+ &tmp.addr_bytes[2],
+ &tmp.addr_bytes[3],
+ &tmp.addr_bytes[4],
+ &tmp.addr_bytes[5]) != 6) {
+ DRV_LOG(ERR,
+ "invalid MAC address format"
+ " \"%s\"",
+ pair->value);
+ return -EINVAL;
+ }
+ if (is_same_ether_addr(eth_addr, &tmp))
+ break;
+ }
+ }
+ if (i == kvargs->count)
+ return 0;
+ ++(*matched);
+ }
+ /* Weed out interfaces already handled. */
+ LIST_FOREACH(ctx, &vdev_netvsc_ctx_list, entry)
+ if (ctx->if_index == iface->if_index)
+ break;
+ if (ctx) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is already handled,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ if (!vdev_netvsc_iface_is_netvsc(iface)) {
+ if (!specified)
+ return 0;
+ DRV_LOG(WARNING,
+ "interface \"%s\" (index %u) is not NetVSC,"
+ " skipping",
+ iface->if_name, iface->if_index);
+ return 0;
+ }
+ /* Create interface context. */
+ ctx = calloc(1, sizeof(*ctx));
+ if (!ctx) {
+ ret = -errno;
+ DRV_LOG(ERR, "cannot allocate context for interface \"%s\": %s",
+ iface->if_name, rte_strerror(errno));
+ goto error;
+ }
+ ctx->id = vdev_netvsc_ctx_count;
+ strncpy(ctx->if_name, iface->if_name, sizeof(ctx->if_name));
+ ctx->if_index = iface->if_index;
+ ctx->if_addr = *eth_addr;
+ ctx->pipe[0] = -1;
+ ctx->pipe[1] = -1;
+ ctx->yield[0] = '\0';
+ if (pipe(ctx->pipe) == -1) {
+ ret = -errno;
+ DRV_LOG(ERR,
+ "cannot allocate control pipe for interface \"%s\": %s",
+ ctx->if_name, rte_strerror(errno));
+ goto error;
+ }
+ for (i = 0; i != RTE_DIM(ctx->pipe); ++i) {
+ int flf = fcntl(ctx->pipe[i], F_GETFL);
+
+ if (flf != -1 &&
+ fcntl(ctx->pipe[i], F_SETFL, flf | O_NONBLOCK) != -1)
+ continue;
+ ret = -errno;
+ DRV_LOG(ERR, "cannot toggle non-blocking flag on control file"
+ " descriptor #%u (%d): %s", i, ctx->pipe[i],
+ rte_strerror(errno));
+ goto error;
+ }
+ /* Generate virtual device name and arguments. */
+ i = 0;
+ ret = snprintf(ctx->name, sizeof(ctx->name), "%s_id%u",
+ name, ctx->id);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->name))
+ ++i;
+ ret = snprintf(ctx->devname, sizeof(ctx->devname), "net_failsafe_%s",
+ ctx->name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devname))
+ ++i;
+ ret = snprintf(ctx->devargs, sizeof(ctx->devargs),
+ "fd(%d),dev(net_tap_%s,remote=%s)",
+ ctx->pipe[0], ctx->name, ctx->if_name);
+ if (ret == -1 || (size_t)ret >= sizeof(ctx->devargs))
+ ++i;
+ if (i) {
+ ret = -ENOBUFS;
+ DRV_LOG(ERR, "generated virtual device name or argument list"
+ " too long for interface \"%s\"", ctx->if_name);
+ goto error;
+ }
+ /* Request virtual device generation. */
+ DRV_LOG(DEBUG, "generating virtual device \"%s\" with arguments \"%s\"",
+ ctx->devname, ctx->devargs);
+ vdev_netvsc_foreach_iface(vdev_netvsc_device_probe, ctx);
+ ret = rte_eal_hotplug_add("vdev", ctx->devname, ctx->devargs);
+ if (ret)
+ goto error;
+ LIST_INSERT_HEAD(&vdev_netvsc_ctx_list, ctx, entry);
+ ++vdev_netvsc_ctx_count;
+ DRV_LOG(DEBUG, "added NetVSC interface \"%s\" to context list",
+ ctx->if_name);
+ return 0;
+error:
+ if (ctx)
+ vdev_netvsc_ctx_destroy(ctx);
+ return ret;
+}
+
+/**
* Probe NetVSC interfaces.
*
+ * This function probes system netdevices according to the specified device
+ * arguments and starts a periodic alarm callback to notify the resulting
+ * fail-safe PMD instances of their sub-devices whereabouts.
+ *
* @param dev
* Virtual device context for driver instance.
*
@@ -49,12 +557,40 @@
const char *args = rte_vdev_device_args(dev);
struct rte_kvargs *kvargs = rte_kvargs_parse(args ? args : "",
vdev_netvsc_arg);
+ unsigned int specified = 0;
+ unsigned int matched = 0;
+ unsigned int i;
+ int ret;

DRV_LOG(DEBUG, "invoked as \"%s\", using arguments \"%s\"", name, args);
if (!kvargs) {
DRV_LOG(ERR, "cannot parse arguments list");
goto error;
}
+ for (i = 0; i != kvargs->count; ++i) {
+ const struct rte_kvargs_pair *pair = &kvargs->pairs[i];
+
+ if (!strcmp(pair->key, VDEV_NETVSC_ARG_IFACE) ||
+ !strcmp(pair->key, VDEV_NETVSC_ARG_MAC))
+ ++specified;
+ }
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ /* Gather interfaces. */
+ ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, name, kvargs,
+ specified, &matched);
+ if (ret < 0)
+ goto error;
+ if (matched < specified)
+ DRV_LOG(WARNING,
+ "some of the specified parameters did not match"
+ " recognized network interfaces");
+ ret = rte_eal_alarm_set(VDEV_NETVSC_PROBE_MS * 1000,
+ vdev_netvsc_alarm, NULL);
+ if (ret < 0) {
+ DRV_LOG(ERR, "unable to schedule alarm callback: %s",
+ rte_strerror(-ret));
+ goto error;
+ }
error:
if (kvargs)
rte_kvargs_free(kvargs);
@@ -65,6 +601,9 @@
/**
* Remove driver instance.
*
+ * The alarm callback and underlying vdev_netvsc context instances are only
+ * destroyed after the last PMD instance is removed.
+ *
* @param dev
* Virtual device context for driver instance.
*
@@ -74,7 +613,16 @@
static int
vdev_netvsc_vdev_remove(__rte_unused struct rte_vdev_device *dev)
{
- --vdev_netvsc_ctx_inst;
+ if (--vdev_netvsc_ctx_inst)
+ return 0;
+ rte_eal_alarm_cancel(vdev_netvsc_alarm, NULL);
+ while (!LIST_EMPTY(&vdev_netvsc_ctx_list)) {
+ struct vdev_netvsc_ctx *ctx = LIST_FIRST(&vdev_netvsc_ctx_list);
+
+ LIST_REMOVE(ctx, entry);
+ --vdev_netvsc_ctx_count;
+ vdev_netvsc_ctx_destroy(ctx);
+ }
return 0;
}
--
1.8.3.1
Loading...