Discussion:
[net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
John Fastabend
2014-10-06 00:06:31 UTC
Permalink
This patch adds a net_device ops to split off a set of driver queues
from the driver and map the queues into user space via mmap. This
allows the queues to be directly manipulated from user space. For
raw packet interface this removes any overhead from the kernel network
stack.

Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().

Fortunately there is already a flow classification interface which
is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
currently supported by multiple drivers including sfc, mlx4, niu,
ixgbe, and i40e. Supporting some way to steer traffic to a queue
is the _only_ hardware requirement to support the interface, plus
the driver needs to implement the correct ndo ops. A follow on
patch adds support for ixgbe but we expect at least the subset of
drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.

The interface is driven over an af_packet socket which we believe
is the most natural interface to use. Because it is already used
for raw packet interfaces which is what we are providing here.
The high level flow for this interface looks like:

bind(fd, &sockaddr, sizeof(sockaddr));

/* Get the device type and info */
getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
&optlen);

/* With device info we can look up descriptor format */

/* Get the layout of ring space offset, page_sz, cnt */
getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
&info, &optlen);

/* request some queues from the driver */
setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));

/* if we let the driver pick us queues learn which queues
* we were given
*/
getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));

/* And mmap queue pairs to user space */
mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);

/* Now we have some user space queues to read/write to*/

There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.

The formats are usually provided by the device vendor documentation
If folks want I can provide a follow up patch to provide the formats
in a .h file in ./include/uapi/linux/ for ease of use. I have access
to formats for ixgbe and mlx drivers other driver owners would need to
provide their formats.

We tested this interface using traffic generators and doing basic
L2 forwarding tests on ixgbe devices. Our tests use a set of patches
to DPDK to enable an interface using this socket interfaace. With
this interface we can xmit/receive @ line rate from a test user space
application on a single core.

Additionally we have a set of DPDK patches to enable DPDK with this
interface. DPDK can be downloaded @ dpdk.org although as I hope is
clear from above DPDK is just our paticular test environment we
expect other libraries could be built on this interface.

Signed-off-by: Danny Zhou <***@intel.com>
Signed-off-by: John Fastabend <***@intel.com>
---
include/linux/netdevice.h | 63 ++++++++++++++
include/uapi/linux/if_packet.h | 42 +++++++++
net/packet/af_packet.c | 181 ++++++++++++++++++++++++++++++++++++++++
net/packet/internal.h | 1
4 files changed, 287 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9f5d293..dae96dc2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -51,6 +51,8 @@
#include <linux/neighbour.h>
#include <uapi/linux/netdevice.h>

+#include <linux/if_packet.h>
+
struct netpoll_info;
struct device;
struct phy_device;
@@ -997,6 +999,47 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
* Callback to use for xmit over the accelerated station. This
* is used in place of ndo_start_xmit on accelerated net
* devices.
+ *
+ * int (*ndo_split_queue_pairs) (struct net_device *dev,
+ * unsigned int qpairs_start_from,
+ * unsigned int qpairs_num,
+ * struct sock *sk)
+ * Called to request a set of queues from the driver to be
+ * handed to the callee for management. After this returns the
+ * driver will not use the queues. The call should return zero
+ * on success otherwise an appropriate error code can be used.
+ * If qpairs_start_from is less than zero driver can start at
+ * any available slot.
+ *
+ * int (*ndo_get_queue_pairs)(struct net_device *dev,
+ * unsigned int *qpairs_start_from,
+ * unsigned int *qpairs_num,
+ * struct sock *sk);
+ * Called to get the queues assigned by the driver to this sock
+ * when ndo_split_queue_pairs does not specify a start_from and
+ * qpairs_num field. Returns zero on success.
+ *
+ * int (*ndo_return_queue_pairs) (struct net_device *dev,
+ * struct sock *sk)
+ * Called to return a set of queues identified by sock to
+ * the driver. The socket must have previously requested
+ * the queues via ndo_split_queue_pairs for this action to
+ * be performed.
+ *
+ * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
+ * struct tpacket_dev_qpair_map_region_info *info)
+ * Called to return mapping of queue memory region
+ *
+ * int (*ndo_get_device_desc_info) (struct net_device *dev,
+ * struct tpacket_dev_info *dev_info)
+ * Called to get device specific information. This should
+ * uniquely identify the hardware so that descriptor formats
+ * can be learned by the stack/user space.
+ *
+ * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
+ * struct net_device *dev)
+ * Called to map queue pair range from split_queue_pairs into
+ * mmap region.
*/
struct net_device_ops {
int (*ndo_init)(struct net_device *dev);
@@ -1146,6 +1189,26 @@ struct net_device_ops {
struct net_device *dev,
void *priv);
int (*ndo_get_lock_subclass)(struct net_device *dev);
+ int (*ndo_split_queue_pairs)(struct net_device *dev,
+ unsigned int qpairs_start_from,
+ unsigned int qpairs_num,
+ struct sock *sk);
+ int (*ndo_get_queue_pairs)(struct net_device *dev,
+ unsigned int *qpairs_start_from,
+ unsigned int *qpairs_num,
+ struct sock *sk);
+ int (*ndo_return_queue_pairs)(
+ struct net_device *dev,
+ struct sock *sk);
+ int (*ndo_get_device_qpair_map_region_info)
+ (struct net_device *dev,
+ struct tpacket_dev_qpair_map_region_info *info);
+ int (*ndo_get_device_desc_info)
+ (struct net_device *dev,
+ struct tpacket_dev_info *dev_info);
+ int (*ndo_direct_qpair_page_map)
+ (struct vm_area_struct *vma,
+ struct net_device *dev);
};

/**
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index da2d668..fa94b65 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -54,6 +54,10 @@ struct sockaddr_ll {
#define PACKET_FANOUT 18
#define PACKET_TX_HAS_OFF 19
#define PACKET_QDISC_BYPASS 20
+#define PACKET_RXTX_QPAIRS_SPLIT 21
+#define PACKET_RXTX_QPAIRS_RETURN 22
+#define PACKET_DEV_QPAIR_MAP_REGION_INFO 23
+#define PACKET_DEV_DESC_INFO 24

#define PACKET_FANOUT_HASH 0
#define PACKET_FANOUT_LB 1
@@ -64,6 +68,44 @@ struct sockaddr_ll {
#define PACKET_FANOUT_FLAG_ROLLOVER 0x1000
#define PACKET_FANOUT_FLAG_DEFRAG 0x8000

+#define MAX_MAP_MEMORY_REGIONS 64
+
+struct tpacket_dev_qpair_map_region_info {
+ unsigned int tp_dev_bar_sz; /* size of BAR */
+ unsigned int tp_dev_sysm_sz; /* size of systerm memory */
+ /* number of contiguous memory on BAR mapping to user space */
+ unsigned int tp_num_map_regions;
+ /* number of contiguous memory on system mapping to user apce */
+ unsigned int tp_num_sysm_map_regions;
+ struct map_page_region {
+ unsigned page_offset; /* offset to start of region */
+ unsigned page_sz; /* size of page */
+ unsigned page_cnt; /* number of pages */
+ } regions[MAX_MAP_MEMORY_REGIONS];
+};
+
+#define PACKET_QPAIRS_START_ANY -1
+
+struct tpacket_dev_qpairs_info {
+ unsigned int tp_qpairs_start_from; /* qpairs index to start from */
+ unsigned int tp_qpairs_num; /* number of qpairs */
+};
+
+struct tpacket_dev_info {
+ __u16 tp_device_id;
+ __u16 tp_vendor_id;
+ __u16 tp_subsystem_device_id;
+ __u16 tp_subsystem_vendor_id;
+ __u32 tp_numa_node;
+ __u32 tp_revision_id;
+ __u32 tp_num_total_qpairs;
+ __u32 tp_num_inuse_qpairs;
+ unsigned int tp_rxdesc_size; /* rx desc size in bytes */
+ __u16 tp_rxdesc_ver;
+ unsigned int tp_txdesc_size; /* tx desc size in bytes */
+ __u16 tp_txdesc_ver;
+};
+
struct tpacket_stats {
unsigned int tp_packets;
unsigned int tp_drops;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 87d20f4..19b43ee 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2611,6 +2611,14 @@ static int packet_release(struct socket *sock)
sock_prot_inuse_add(net, sk->sk_prot, -1);
preempt_enable();

+ if (po->tp_owns_queue_pairs) {
+ struct net_device *dev;
+
+ dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+ if (dev)
+ dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+ }
+
spin_lock(&po->bind_lock);
unregister_prot_hook(sk, false);
packet_cached_dev_reset(po);
@@ -3403,6 +3411,70 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
return 0;
}
+
+ case PACKET_RXTX_QPAIRS_SPLIT:
+ {
+ struct tpacket_dev_qpairs_info qpairs;
+ const struct net_device_ops *ops;
+ struct net_device *dev;
+ int err;
+
+ if (optlen != sizeof(qpairs))
+ return -EINVAL;
+ if (copy_from_user(&qpairs, optval, sizeof(qpairs)))
+ return -EFAULT;
+
+ /* Only allow one set of queues to be owned by userspace */
+ if (po->tp_owns_queue_pairs)
+ return -EBUSY;
+
+ /* This call only work after a bind call which calls a dev_hold
+ * operation so we do not need to increment dev ref counter
+ */
+ dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+ if (!dev)
+ return -EINVAL;
+ ops = dev->netdev_ops;
+ if (!ops->ndo_split_queue_pairs)
+ return -EOPNOTSUPP;
+
+ err = ops->ndo_split_queue_pairs(dev,
+ qpairs.tp_qpairs_start_from,
+ qpairs.tp_qpairs_num, sk);
+ if (!err)
+ po->tp_owns_queue_pairs = true;
+
+ return err;
+ }
+
+ case PACKET_RXTX_QPAIRS_RETURN:
+ {
+ struct tpacket_dev_qpairs_info qpairs_info;
+ struct net_device *dev;
+ int err;
+
+ if (optlen != sizeof(qpairs_info))
+ return -EINVAL;
+ if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+ return -EFAULT;
+
+ if (!po->tp_owns_queue_pairs)
+ return -EINVAL;
+
+ /* This call only work after a bind call which calls a dev_hold
+ * operation so we do not need to increment dev ref counter
+ */
+ dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+ if (!dev)
+ return -EINVAL;
+
+ err = dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+ if (!err)
+ po->tp_owns_queue_pairs = false;
+
+ return err;
+ }
+
default:
return -ENOPROTOOPT;
}
@@ -3498,6 +3570,99 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
case PACKET_QDISC_BYPASS:
val = packet_use_direct_xmit(po);
break;
+ case PACKET_RXTX_QPAIRS_SPLIT:
+ {
+ struct net_device *dev;
+ struct tpacket_dev_qpairs_info qpairs_info;
+ int err;
+
+ if (len != sizeof(qpairs_info))
+ return -EINVAL;
+ if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+ return -EFAULT;
+
+ /* This call only works after a successful queue pairs split-off
+ * operation via setsockopt()
+ */
+ if (!po->tp_owns_queue_pairs)
+ return -EINVAL;
+
+ /* This call only work after a bind call which calls a dev_hold
+ * operation so we do not need to increment dev ref counter
+ */
+ dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+ if (!dev)
+ return -EINVAL;
+ if (!dev->netdev_ops->ndo_split_queue_pairs)
+ return -EOPNOTSUPP;
+
+ err = dev->netdev_ops->ndo_get_queue_pairs(dev,
+ &qpairs_info.tp_qpairs_start_from,
+ &qpairs_info.tp_qpairs_num, sk);
+
+ lv = sizeof(qpairs_info);
+ data = &qpairs_info;
+ break;
+ }
+ case PACKET_DEV_QPAIR_MAP_REGION_INFO:
+ {
+ struct tpacket_dev_qpair_map_region_info info;
+ const struct net_device_ops *ops;
+ struct net_device *dev;
+ int err;
+
+ if (len != sizeof(info))
+ return -EINVAL;
+ if (copy_from_user(&info, optval, sizeof(info)))
+ return -EFAULT;
+
+ /* This call only work after a bind call which calls a dev_hold
+ * operation so we do not need to increment dev ref counter
+ */
+ dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+ if (!dev)
+ return -EINVAL;
+
+ ops = dev->netdev_ops;
+ if (!ops->ndo_get_device_qpair_map_region_info)
+ return -EOPNOTSUPP;
+
+ err = ops->ndo_get_device_qpair_map_region_info(dev, &info);
+ if (err)
+ return err;
+
+ lv = sizeof(struct tpacket_dev_qpair_map_region_info);
+ data = &info;
+ break;
+ }
+ case PACKET_DEV_DESC_INFO:
+ {
+ struct net_device *dev;
+ struct tpacket_dev_info info;
+ int err;
+
+ if (len != sizeof(info))
+ return -EINVAL;
+ if (copy_from_user(&info, optval, sizeof(info)))
+ return -EFAULT;
+
+ /* This call only work after a bind call which calls a dev_hold
+ * operation so we do not need to increment dev ref counter
+ */
+ dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+ if (!dev)
+ return -EINVAL;
+ if (!dev->netdev_ops->ndo_get_device_desc_info)
+ return -EOPNOTSUPP;
+
+ err = dev->netdev_ops->ndo_get_device_desc_info(dev, &info);
+ if (err)
+ return err;
+
+ lv = sizeof(struct tpacket_dev_info);
+ data = &info;
+ break;
+ }
default:
return -ENOPROTOOPT;
}
@@ -3904,6 +4069,21 @@ static int packet_mmap(struct file *file, struct socket *sock,

mutex_lock(&po->pg_vec_lock);

+ if (po->tp_owns_queue_pairs) {
+ const struct net_device_ops *ops;
+ struct net_device *dev;
+
+ dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+ if (!dev)
+ return -EINVAL;
+
+ ops = dev->netdev_ops;
+ err = ops->ndo_direct_qpair_page_map(vma, dev);
+ if (err)
+ goto out;
+ goto done;
+ }
+
expected_size = 0;
for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
if (rb->pg_vec) {
@@ -3941,6 +4121,7 @@ static int packet_mmap(struct file *file, struct socket *sock,
}
}

+done:
atomic_inc(&po->mapped);
vma->vm_ops = &packet_mmap_ops;
err = 0;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index cdddf6a..55cadbc 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -113,6 +113,7 @@ struct packet_sock {
unsigned int tp_reserve;
unsigned int tp_loss:1;
unsigned int tp_tx_has_off:1;
+ unsigned int tp_owns_queue_pairs:1;
unsigned int tp_tstamp;
struct net_device __rcu *cached_dev;
int (*xmit)(struct sk_buff *skb);
John Fastabend
2014-10-06 00:07:06 UTC
Permalink
This implements the necessary ndo ops to support the af_packet
interface to directly own and manipulate queues.

Signed-off-by: Danny Zhou <***@intel.com>
Signed-off-by: John Fastabend <***@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe.h | 3
drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 23 ++
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 232 ++++++++++++++++++++++
drivers/net/ethernet/intel/ixgbe/ixgbe_type.h | 1
4 files changed, 251 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 673d820..2f6eadf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -678,6 +678,9 @@ struct ixgbe_adapter {

struct ixgbe_q_vector *q_vector[MAX_Q_VECTORS];

+ /* Direct User Space Queues */
+ struct sock *sk_handles[MAX_RX_QUEUES];
+
/* DCB parameters */
struct ieee_pfc *ixgbe_ieee_pfc;
struct ieee_ets *ixgbe_ieee_ets;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index cff383b..01a6e55 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2581,12 +2581,17 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
return -EOPNOTSUPP;

+ if (fsp->ring_cookie > MAX_RX_QUEUES)
+ return -EINVAL;
+
/*
* Don't allow programming if the action is a queue greater than
- * the number of online Rx queues.
+ * the number of online Rx queues unless it is a user space
+ * queue.
*/
if ((fsp->ring_cookie != RX_CLS_FLOW_DISC) &&
- (fsp->ring_cookie >= adapter->num_rx_queues))
+ (fsp->ring_cookie >= adapter->num_rx_queues) &&
+ !adapter->sk_handles[fsp->ring_cookie])
return -EINVAL;

/* Don't allow indexes to exist outside of available space */
@@ -2663,12 +2668,18 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
/* apply mask and compute/store hash */
ixgbe_atr_compute_perfect_hash_82599(&input->filter, &mask);

+ /* Set input action to reg_idx for driver owned queues otherwise
+ * use the absolute index for user space queues.
+ */
+ if (fsp->ring_cookie < adapter->num_rx_queues &&
+ fsp->ring_cookie != IXGBE_FDIR_DROP_QUEUE)
+ input->action = adapter->rx_ring[input->action]->reg_idx;
+
/* program filters to filter memory */
err = ixgbe_fdir_write_perfect_filter_82599(hw,
- &input->filter, input->sw_idx,
- (input->action == IXGBE_FDIR_DROP_QUEUE) ?
- IXGBE_FDIR_DROP_QUEUE :
- adapter->rx_ring[input->action]->reg_idx);
+ &input->filter,
+ input->sw_idx,
+ input->action);
if (err)
goto err_out_w_lock;

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 06ef5a3..6506550 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -48,7 +48,9 @@
#include <linux/if_macvlan.h>
#include <linux/if_bridge.h>
#include <linux/prefetch.h>
+#include <linux/mm.h>
#include <scsi/fc/fc_fcoe.h>
+#include <linux/if_packet.h>

#include "ixgbe.h"
#include "ixgbe_common.h"
@@ -70,6 +72,8 @@ const char ixgbe_driver_version[] = DRV_VERSION;
static const char ixgbe_copyright[] =
"Copyright (c) 1999-2014 Intel Corporation.";

+static unsigned int *dummy_page_buf;
+
static const struct ixgbe_info *ixgbe_info_tbl[] = {
[board_82598] = &ixgbe_82598_info,
[board_82599] = &ixgbe_82599_info,
@@ -3122,6 +3126,16 @@ static void ixgbe_enable_rx_drop(struct ixgbe_adapter *adapter,
IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(reg_idx), srrctl);
}

+static bool ixgbe_have_user_queues(struct ixgbe_adapter *adapter)
+{
+ int i;
+
+ for (i = 0; i < MAX_RX_QUEUES; i++)
+ if (adapter->sk_handles[i])
+ return true;
+ return false;
+}
+
static void ixgbe_disable_rx_drop(struct ixgbe_adapter *adapter,
struct ixgbe_ring *ring)
{
@@ -3156,7 +3170,8 @@ static void ixgbe_set_rx_drop_en(struct ixgbe_adapter *adapter)
* and performance reasons.
*/
if (adapter->num_vfs || (adapter->num_rx_queues > 1 &&
- !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en)) {
+ !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en) ||
+ ixgbe_have_user_queues(adapter)) {
for (i = 0; i < adapter->num_rx_queues; i++)
ixgbe_enable_rx_drop(adapter, adapter->rx_ring[i]);
} else {
@@ -7812,6 +7827,210 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
kfree(fwd_adapter);
}

+static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
+ unsigned int start_from,
+ unsigned int qpairs_num,
+ struct sock *sk)
+{
+ struct ixgbe_adapter *adapter = netdev_priv(dev);
+ unsigned int qpair_index;
+
+ /* allocate whatever availiable qpairs */
+ if (start_from == PACKET_QPAIRS_START_ANY) {
+ unsigned int count = 0;
+
+ for (qpair_index = adapter->num_rx_queues;
+ qpair_index < MAX_RX_QUEUES;
+ qpair_index++) {
+ if (!adapter->sk_handles[qpair_index]) {
+ count++;
+ if (count == qpairs_num) {
+ start_from = qpair_index - count + 1;
+ break;
+ }
+ } else {
+ count = 0;
+ }
+ }
+ }
+
+ /* otherwise the caller specified exact queues */
+ if ((start_from > MAX_TX_QUEUES) ||
+ (start_from > MAX_RX_QUEUES) ||
+ (start_from + qpairs_num > MAX_TX_QUEUES) ||
+ (start_from + qpairs_num > MAX_RX_QUEUES))
+ return -EINVAL;
+
+ /* If the qpairs are being used by the driver do not let user space
+ * consume the queues. Also if the queue has already been allocated
+ * to a socket do fail the request.
+ */
+ for (qpair_index = start_from;
+ qpair_index < start_from + qpairs_num;
+ qpair_index++) {
+ if ((qpair_index < adapter->num_tx_queues) ||
+ (qpair_index < adapter->num_rx_queues))
+ return -EINVAL;
+
+ if (adapter->sk_handles[qpair_index] != NULL)
+ return -EBUSY;
+ }
+
+ /* remember the sk handle for each queue pair */
+ for (qpair_index = start_from;
+ qpair_index < start_from + qpairs_num;
+ qpair_index++)
+ adapter->sk_handles[qpair_index] = sk;
+
+ return start_from;
+}
+
+static int ixgbe_ndo_get_queue_pairs(struct net_device *dev,
+ unsigned int *start_from,
+ unsigned int *qpairs_num,
+ struct sock *sk)
+{
+ struct ixgbe_adapter *adapter = netdev_priv(dev);
+ unsigned int qpair_index;
+
+ *qpairs_num = 0;
+
+ for (qpair_index = adapter->num_tx_queues;
+ qpair_index < MAX_RX_QUEUES;
+ qpair_index++) {
+ if (adapter->sk_handles[qpair_index] == sk) {
+ if (*qpairs_num == 0)
+ *start_from = qpair_index;
+ *qpairs_num = *qpairs_num + 1;
+ }
+ }
+
+ return 0;
+}
+
+static int ixgbe_ndo_return_queue_pairs(struct net_device *dev, struct sock *sk)
+{
+ struct ixgbe_adapter *adapter = netdev_priv(dev);
+ unsigned int qpair_index;
+
+ for (qpair_index = adapter->num_tx_queues;
+ qpair_index < MAX_TX_QUEUES;
+ qpair_index++) {
+ if (adapter->sk_handles[qpair_index] == sk)
+ adapter->sk_handles[qpair_index] = NULL;
+ }
+
+ return 0;
+}
+
+/* Rx descriptor starts from 0x1000 and Tx descriptor starts from 0x6000
+ * both the TX and RX descriptors use 4K pages.
+ */
+#define RX_DESC_ADDR_OFFSET 0x1000
+#define TX_DESC_ADDR_OFFSET 0x6000
+#define PAGE_SIZE_4K 4096
+
+static int
+ixgbe_ndo_qpair_map_region(struct net_device *dev,
+ struct tpacket_dev_qpair_map_region_info *info)
+{
+ struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+ /* no need to map systme memory to userspace for ixgbe */
+ info->tp_dev_sysm_sz = 0;
+ info->tp_num_sysm_map_regions = 0;
+
+ info->tp_dev_bar_sz = pci_resource_len(adapter->pdev, 0);
+ info->tp_num_map_regions = 2;
+
+ info->regions[0].page_offset = RX_DESC_ADDR_OFFSET;
+ info->regions[0].page_sz = PAGE_SIZE;
+ info->regions[0].page_cnt = 1;
+ info->regions[1].page_offset = TX_DESC_ADDR_OFFSET;
+ info->regions[1].page_sz = PAGE_SIZE;
+ info->regions[1].page_cnt = 1;
+
+ return 0;
+}
+
+static int ixgbe_ndo_get_device_desc_info(struct net_device *dev,
+ struct tpacket_dev_info *dev_info)
+{
+ struct ixgbe_adapter *adapter = netdev_priv(dev);
+ int max_queues;
+
+ max_queues = max(adapter->num_rx_queues, adapter->num_tx_queues);
+
+ dev_info->tp_device_id = adapter->hw.device_id;
+ dev_info->tp_vendor_id = adapter->hw.vendor_id;
+ dev_info->tp_subsystem_device_id = adapter->hw.subsystem_device_id;
+ dev_info->tp_subsystem_vendor_id = adapter->hw.subsystem_vendor_id;
+ dev_info->tp_revision_id = adapter->hw.revision_id;
+ dev_info->tp_numa_node = dev_to_node(&dev->dev);
+
+ dev_info->tp_num_total_qpairs = min(MAX_RX_QUEUES, MAX_TX_QUEUES);
+ dev_info->tp_num_inuse_qpairs = max_queues;
+
+ dev_info->tp_rxdesc_size = sizeof(union ixgbe_adv_rx_desc);
+ dev_info->tp_rxdesc_ver = 1;
+ dev_info->tp_txdesc_size = sizeof(union ixgbe_adv_tx_desc);
+ dev_info->tp_txdesc_ver = 1;
+
+ return 0;
+}
+
+static int
+ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
+{
+ struct ixgbe_adapter *adapter = netdev_priv(dev);
+ phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
+ unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
+ unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
+ unsigned long dummy_page_phy;
+ pgprot_t pre_vm_page_prot;
+ unsigned long start;
+ unsigned int i;
+ int err;
+
+ if (!dummy_page_buf) {
+ dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
+ if (!dummy_page_buf)
+ return -ENOMEM;
+
+ for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
+ dummy_page_buf[i] = 0xdeadbeef;
+ }
+
+ dummy_page_phy = virt_to_phys(dummy_page_buf);
+ pre_vm_page_prot = vma->vm_page_prot;
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+ /* assume the vm_start is 4K aligned address */
+ for (start = vma->vm_start;
+ start < vma->vm_end;
+ start += PAGE_SIZE_4K) {
+ if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
+ err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
+ vma->vm_page_prot);
+ if (err)
+ return -EAGAIN;
+ } else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
+ err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
+ vma->vm_page_prot);
+ if (err)
+ return -EAGAIN;
+ } else {
+ unsigned long addr = dummy_page_phy > PAGE_SHIFT;
+
+ err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
+ pre_vm_page_prot);
+ if (err)
+ return -EAGAIN;
+ }
+ }
+ return 0;
+}
+
static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_open = ixgbe_open,
.ndo_stop = ixgbe_close,
@@ -7856,6 +8075,12 @@ static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_bridge_getlink = ixgbe_ndo_bridge_getlink,
.ndo_dfwd_add_station = ixgbe_fwd_add,
.ndo_dfwd_del_station = ixgbe_fwd_del,
+ .ndo_split_queue_pairs = ixgbe_ndo_split_queue_pairs,
+ .ndo_get_queue_pairs = ixgbe_ndo_get_queue_pairs,
+ .ndo_return_queue_pairs = ixgbe_ndo_return_queue_pairs,
+ .ndo_get_device_qpair_map_region_info = ixgbe_ndo_qpair_map_region,
+ .ndo_get_device_desc_info = ixgbe_ndo_get_device_desc_info,
+ .ndo_direct_qpair_page_map = ixgbe_ndo_qpair_page_map,
};

/**
@@ -8054,7 +8279,9 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
hw->back = adapter;
adapter->msg_enable = netif_msg_init(debug, DEFAULT_MSG_ENABLE);

- hw->hw_addr = ioremap(pci_resource_start(pdev, 0),
+ hw->pci_hw_addr = pci_resource_start(pdev, 0);
+
+ hw->hw_addr = ioremap(hw->pci_hw_addr,
pci_resource_len(pdev, 0));
adapter->io_addr = hw->hw_addr;
if (!hw->hw_addr) {
@@ -8705,6 +8932,7 @@ module_init(ixgbe_init_module);
**/
static void __exit ixgbe_exit_module(void)
{
+ kfree(dummy_page_buf);
#ifdef CONFIG_IXGBE_DCA
dca_unregister_notify(&dca_notifier);
#endif
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index dfd55d8..26e9163 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -3022,6 +3022,7 @@ struct ixgbe_mbx_info {

struct ixgbe_hw {
u8 __iomem *hw_addr;
+ phys_addr_t pci_hw_addr;
void *back;
struct ixgbe_mac_info mac;
struct ixgbe_addr_filter_info addr_ctrl;
John Fastabend
2014-10-06 00:07:36 UTC
Permalink
This adds a section to the packet interface kernel documentation
describing the set of socket options to get direct queue
assignment working.

Signed-off-by: John Fastabend <***@intel.com>
---
Documentation/networking/packet_mmap.txt | 44 ++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)

diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index a6d7cb9..ad26194 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -1047,6 +1047,50 @@ See include/linux/net_tstamp.h and Documentation/networking/timestamping
for more information on hardware timestamps.

-------------------------------------------------------------------------------
++ PACKET_RXTX_QPAIRS_SPLIT and friends
+-------------------------------------------------------------------------------
+
+The PACKET_RXTX_QPAIRS_SLIT setting allows direct access to the hardware
+packet rings. If your NIC is capable of supporting hardware packet steering
+and the driver has this feature enabled you can use the hardware to steer
+packets directly to user mapped memory and use user space descriptor rings.
+
+The user space flow should be,
+
+ bind(fd, &sockaddr, sizeof(sockaddr));
+
+ /* Get the device type and info */
+ getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
+ &optlen);
+
+ /* With device info we can look up descriptor format */
+
+ /* Get the layout of ring space offset, page_sz, cnt */
+ getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
+ &info, &optlen);
+
+ /* request some queues from the driver */
+ setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
+ &qpairs_info, sizeof(qpairs_info));
+
+ /* if we let the driver pick us queues learn which queues
+ * we were given
+ */
+ getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
+ &qpairs_info, sizeof(qpairs_info));
+
+ /* And mmap queue pairs to user space */
+ mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED, fd, 0);
+
+ /* Now we have some user space queues to read/write to*/
+
+After this user space can directly manipulate the drivers descriptor rings.
+The descriptor rings use the native descriptor format of the hardware device.
+The device specifics are returned from the PACKET_DEV_DESC_INFO call which
+allows user space to determine the correct descriptor format to use.
+
+-------------------------------------------------------------------------------
+ Miscellaneous bits
-------------------------------------------------------------------------------
Florian Westphal
2014-10-06 00:29:51 UTC
Permalink
Post by John Fastabend
There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.
I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.

How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?

Thanks,
Florian
David Miller
2014-10-06 01:09:38 UTC
Permalink
From: Florian Westphal <***@strlen.de>
Date: Mon, 6 Oct 2014 02:29:51 +0200
Post by Florian Westphal
Post by John Fastabend
There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.
I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.
How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?
FWIW, it also avoids the domain switch (which is just a fancy way
to refer to performing the system call), both in and out.
John Fastabend
2014-10-06 01:18:47 UTC
Permalink
Post by David Miller
Date: Mon, 6 Oct 2014 02:29:51 +0200
Post by Florian Westphal
Post by John Fastabend
There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.
I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.
How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?
FWIW, it also avoids the domain switch (which is just a fancy way
to refer to performing the system call), both in and out.
Right, my description could have been better and called this out.

Thanks.
--
John Fastabend Intel Corporation
John Fastabend
2014-10-06 01:12:34 UTC
Permalink
Post by Florian Westphal
Post by John Fastabend
There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.
I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.
Well it was only for convenience if it doesn't fit as a socket
option we can remove it. We can look up the device using the netdev
name from the bind call. I see your point though so if there is
consensus that this is not needed that is fine.
Post by Florian Westphal
How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?
This was likely a poor description. If you want to let user space
poll on the ring (without using system calls or interrupts) then
I don't see how you can _not_ expose the ring directly complete with
the vendor descriptor formats.
Post by Florian Westphal
Thanks,
Florian
--
John Fastabend Intel Corporation
Daniel Borkmann
2014-10-06 09:49:25 UTC
Permalink
Hi John,
Post by John Fastabend
Post by Florian Westphal
Post by John Fastabend
There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.
I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.
Well it was only for convenience if it doesn't fit as a socket
option we can remove it. We can look up the device using the netdev
name from the bind call. I see your point though so if there is
consensus that this is not needed that is fine.
Post by Florian Westphal
How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?
This was likely a poor description. If you want to let user space
poll on the ring (without using system calls or interrupts) then
I don't see how you can _not_ expose the ring directly complete with
the vendor descriptor formats.
But how big is the concrete performance degradation you're seeing if you
use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
that does *not* directly expose hw descriptor formats to user space?

With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
on 40G when run on decent hardware though.

It would really be great if we have something vendor neutral exposed as
a stable ABI and could leverage emerging infrastructure we already have
in the kernel such as eBPF and recent qdisc batching for raw sockets
instead of reinventing the wheels. (Don't get me wrong, I would love to
see AF_PACKET improved ...)

Thanks,
Daniel
John Fastabend
2014-10-06 15:01:52 UTC
Permalink
Post by Daniel Borkmann
Hi John,
Post by John Fastabend
Post by Florian Westphal
Post by John Fastabend
There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.
I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.
Well it was only for convenience if it doesn't fit as a socket
option we can remove it. We can look up the device using the netdev
name from the bind call. I see your point though so if there is
consensus that this is not needed that is fine.
Post by Florian Westphal
How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?
This was likely a poor description. If you want to let user space
poll on the ring (without using system calls or interrupts) then
I don't see how you can _not_ expose the ring directly complete with
the vendor descriptor formats.
But how big is the concrete performance degradation you're seeing if you
use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
that does *not* directly expose hw descriptor formats to user space?
If we don't directly expose the hardware descriptor formats then we
need to somehow kick the driver when we want it to do the copy from
the driver descriptor format to the common descriptor format.

This requires a system call as far as I can tell. Which has unwanted
overhead. I can micro-benchmark this if its helpful. But if we dredge
up Jesper's slides here we are really counting cycles so even small
numbers count if we want to hit line rate in a user space application
with 40Gpbs hardware.
Post by Daniel Borkmann
With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
on 40G when run on decent hardware though.
It would really be great if we have something vendor neutral exposed as
a stable ABI and could leverage emerging infrastructure we already have
in the kernel such as eBPF and recent qdisc batching for raw sockets
instead of reinventing the wheels. (Don't get me wrong, I would love to
see AF_PACKET improved ...)
I don't think the interface is vendor specific. It does require some
knowledge of the hardware descriptor layout though. It is though vendor
neutral from my point of view. I provided the ixgbe patch simple because
I'm most familiar with it and have a NIC here. If someone wants to send me
a Mellanox NIC I can give it a try although I was hoping to recruit Or or
Amir? The only hardware feature required is flow classification to queues
which seems to be common across 10Gbps and 40/100Gbps devices. So most
of the drivers should be able to support this.

If your worried driver writers will implement the interface but not make
their descriptor formats easily available I considered putting the layout
in a header file in the uapi somewhere. Then we could just reject any
implementation that doesn't include the header file needed to use it
from user space.

With regards to leveraging eBPF and qdisc batching I don't see how this
works with direct DMA and polling. Needed to give the lowest overhead
between kernel and user space. In this case we want to use the hardware
to do the filtering that would normally be done for eBPF and for many
use cases the hardware flow classifiers is sufficient.

We already added a qdisc bypass option I see this as taking this path
further. I believe there is room for a continuum here. For basic cases
use af_packet v1,v2 for mmap rings but using common descriptors use
af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
specific applications that don't need QOS, eBPF use this interface.

Thanks.
Post by Daniel Borkmann
Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer
2014-10-06 16:35:36 UTC
Permalink
Post by John Fastabend
This requires a system call as far as I can tell. Which has unwanted
overhead. I can micro-benchmark this if its helpful. But if we dredge
up Jesper's slides here we are really counting cycles so even small
numbers count if we want to hit line rate in a user space application
with 40Gpbs hardware.
The micro-benchmarked syscall[2] cost is approx 42 ns [1] (when
disabling CONFIG_AUDITSYSCALL else its approx 88ns), which is
significant compared to the 10G wirespeed smallest packet size budget
of 67.2ns.

See:
[1] http://netoptimizer.blogspot.dk/2014/05/the-calculations-10gbits-wirespeed.html
[2] https://github.com/netoptimizer/network-testing/blob/master/src/syscall_overhead.c

[...]
Post by John Fastabend
We already added a qdisc bypass option I see this as taking this path
further. I believe there is room for a continuum here. For basic cases
use af_packet v1,v2 for mmap rings but using common descriptors use
af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
specific applications that don't need QOS, eBPF use this interface.
Well, after the qdisc bulking changes, when bulking kicks in then the
qdisc path is faster than the qdisc bypass (measured with trafgen).
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
Hannes Frederic Sowa
2014-10-06 17:03:52 UTC
Permalink
Hi John,
Post by John Fastabend
Post by Daniel Borkmann
Hi John,
Post by John Fastabend
Post by Florian Westphal
Post by John Fastabend
There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.
I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.
Well it was only for convenience if it doesn't fit as a socket
option we can remove it. We can look up the device using the netdev
name from the bind call. I see your point though so if there is
consensus that this is not needed that is fine.
Post by Florian Westphal
How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?
This was likely a poor description. If you want to let user space
poll on the ring (without using system calls or interrupts) then
I don't see how you can _not_ expose the ring directly complete with
the vendor descriptor formats.
But how big is the concrete performance degradation you're seeing if you
use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
that does *not* directly expose hw descriptor formats to user space?
If we don't directly expose the hardware descriptor formats then we
need to somehow kick the driver when we want it to do the copy from
the driver descriptor format to the common descriptor format.
This requires a system call as far as I can tell. Which has unwanted
overhead. I can micro-benchmark this if its helpful. But if we dredge
up Jesper's slides here we are really counting cycles so even small
numbers count if we want to hit line rate in a user space application
with 40Gpbs hardware.
I agree, it seems pretty hard to achieve non-syscall sending on the same
core, as we somehow must transfer control over to the kernel without
doing a syscall.

The only other idea would be to export machine code up to user space,
which you can mmap(MAP_EXEC) from the socket somehow to make this API
truly NIC agnostic without recompiling. This code then would transform
the generic descriptors to the hardware specific ones. Seems also pretty
hairy to do that correctly, maybe.
Post by John Fastabend
Post by Daniel Borkmann
With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
on 40G when run on decent hardware though.
It would really be great if we have something vendor neutral exposed as
a stable ABI and could leverage emerging infrastructure we already have
in the kernel such as eBPF and recent qdisc batching for raw sockets
instead of reinventing the wheels. (Don't get me wrong, I would love to
see AF_PACKET improved ...)
I don't think the interface is vendor specific. It does require some
knowledge of the hardware descriptor layout though. It is though vendor
neutral from my point of view. I provided the ixgbe patch simple because
I'm most familiar with it and have a NIC here. If someone wants to send me
a Mellanox NIC I can give it a try although I was hoping to recruit Or or
Amir? The only hardware feature required is flow classification to queues
which seems to be common across 10Gbps and 40/100Gbps devices. So most
of the drivers should be able to support this.
Does flow classification work at the same level as registering network
addresses? Do I have to bind a e.g. multicast address wie ip maddr and
then set up flow director/ntuple to get the packets on the correct user
space facing queue or is it in case of the ixgbe card enough to just add
those addresses via fdir? Have you thought about letting the
kernel/driver handle that? In case one would like to connect their
virtual machines via this interface to the network maybe we need central
policing and resource constraints for queue management here?

Do other drivers need a separate af-packet managed way to bind addresses
to the queue? Maybe there are other quirks we might need to add to
actually build support for other network interface cards. Would be great
to at least examine one other driver in regard to this.

What other properties of the NIC must be exported? I think we also have
to deal with MTUs currently configured in the NIC, promisc mode and
maybe TSO?
Post by John Fastabend
If your worried driver writers will implement the interface but not make
their descriptor formats easily available I considered putting the layout
in a header file in the uapi somewhere. Then we could just reject any
implementation that doesn't include the header file needed to use it
from user space.
With regards to leveraging eBPF and qdisc batching I don't see how this
works with direct DMA and polling. Needed to give the lowest overhead
between kernel and user space. In this case we want to use the hardware
to do the filtering that would normally be done for eBPF and for many
use cases the hardware flow classifiers is sufficient.
I agree, those features are hard to connect.
Post by John Fastabend
We already added a qdisc bypass option I see this as taking this path
further. I believe there is room for a continuum here. For basic cases
use af_packet v1,v2 for mmap rings but using common descriptors use
af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
specific applications that don't need QOS, eBPF use this interface.
You can simply write C code instead of eBPF code, yes.

I find the six additional ndo ops a bit worrisome as we are adding more
and more subsystem specific ndoops to this struct. I would like to see
some unification here, but currently cannot make concrete proposals,
sorry.

Patch 2/3 does not yet expose hw ring descriptors in uapi headers it
seems?

Are there plans to push a user space framework (maybe even into the
kernel), too? Will this be dpdk (alike) in the end?

Bye,
Hannes
John Fastabend
2014-10-06 20:37:01 UTC
Permalink
Post by Daniel Borkmann
Hi John,
Post by John Fastabend
Post by Daniel Borkmann
Hi John,
Post by John Fastabend
Post by Florian Westphal
Post by John Fastabend
There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.
I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.
Well it was only for convenience if it doesn't fit as a socket
option we can remove it. We can look up the device using the netdev
name from the bind call. I see your point though so if there is
consensus that this is not needed that is fine.
Post by Florian Westphal
How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?
This was likely a poor description. If you want to let user space
poll on the ring (without using system calls or interrupts) then
I don't see how you can _not_ expose the ring directly complete with
the vendor descriptor formats.
But how big is the concrete performance degradation you're seeing if you
use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
that does *not* directly expose hw descriptor formats to user space?
If we don't directly expose the hardware descriptor formats then we
need to somehow kick the driver when we want it to do the copy from
the driver descriptor format to the common descriptor format.
This requires a system call as far as I can tell. Which has unwanted
overhead. I can micro-benchmark this if its helpful. But if we dredge
up Jesper's slides here we are really counting cycles so even small
numbers count if we want to hit line rate in a user space application
with 40Gpbs hardware.
I agree, it seems pretty hard to achieve non-syscall sending on the same
core, as we somehow must transfer control over to the kernel without
doing a syscall.
The only other idea would be to export machine code up to user space,
which you can mmap(MAP_EXEC) from the socket somehow to make this API
truly NIC agnostic without recompiling. This code then would transform
the generic descriptors to the hardware specific ones. Seems also pretty
hairy to do that correctly, maybe.
This seems more complicated then a minimal library in userspace to
load descriptor handling code based on device id.
Post by Daniel Borkmann
Post by John Fastabend
Post by Daniel Borkmann
With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
on 40G when run on decent hardware though.
It would really be great if we have something vendor neutral exposed as
a stable ABI and could leverage emerging infrastructure we already have
in the kernel such as eBPF and recent qdisc batching for raw sockets
instead of reinventing the wheels. (Don't get me wrong, I would love to
see AF_PACKET improved ...)
I don't think the interface is vendor specific. It does require some
knowledge of the hardware descriptor layout though. It is though vendor
neutral from my point of view. I provided the ixgbe patch simple because
I'm most familiar with it and have a NIC here. If someone wants to send me
a Mellanox NIC I can give it a try although I was hoping to recruit Or or
Amir? The only hardware feature required is flow classification to queues
which seems to be common across 10Gbps and 40/100Gbps devices. So most
of the drivers should be able to support this.
Does flow classification work at the same level as registering network
addresses? Do I have to bind a e.g. multicast address wie ip maddr and
then set up flow director/ntuple to get the packets on the correct user
space facing queue or is it in case of the ixgbe card enough to just add
those addresses via fdir? Have you thought about letting the
kernel/driver handle that? In case one would like to connect their
virtual machines via this interface to the network maybe we need central
policing and resource constraints for queue management here?
Right now it is enough to program the addresses via fdir. This shouldn't
be ixgbe specific and I did take a quick look at the other drivers use
for fdir and believe it should work the same.

I'm not sure what you mean by kernel/driver handle this? Maybe you mean
from the socket interface? I thought about it briefly but opted for what
I think is more straight forward and using the fdir APIs.

As far as policing and resource constraints I think that is a user space
task. And yes I am working on some user space management applications but
they are still fairly rough.
Post by Daniel Borkmann
Do other drivers need a separate af-packet managed way to bind addresses
to the queue? Maybe there are other quirks we might need to add to
actually build support for other network interface cards. Would be great
to at least examine one other driver in regard to this.
I _believe_ this interface is sufficient. I think one of the mellanox
interfaces could be implemented fairly easily.
Post by Daniel Borkmann
What other properties of the NIC must be exported? I think we also have
to deal with MTUs currently configured in the NIC, promisc mode and
maybe TSO?
We don't support per queue MTUs in the kernel. So I think this can be
learned using existing interfaces.
Post by Daniel Borkmann
Post by John Fastabend
If your worried driver writers will implement the interface but not make
their descriptor formats easily available I considered putting the layout
in a header file in the uapi somewhere. Then we could just reject any
implementation that doesn't include the header file needed to use it
from user space.
With regards to leveraging eBPF and qdisc batching I don't see how this
works with direct DMA and polling. Needed to give the lowest overhead
between kernel and user space. In this case we want to use the hardware
to do the filtering that would normally be done for eBPF and for many
use cases the hardware flow classifiers is sufficient.
I agree, those features are hard to connect.
Post by John Fastabend
We already added a qdisc bypass option I see this as taking this path
further. I believe there is room for a continuum here. For basic cases
use af_packet v1,v2 for mmap rings but using common descriptors use
af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
specific applications that don't need QOS, eBPF use this interface.
You can simply write C code instead of eBPF code, yes.
I find the six additional ndo ops a bit worrisome as we are adding more
and more subsystem specific ndoops to this struct. I would like to see
some unification here, but currently cannot make concrete proposals,
sorry.
I agree it seems like a bit much. One thought was to split the ndo
ops into categories. Switch ops, MACVLAN ops, basic ops and with this
userspace queue ops. This sort of goes along with some of the switch
offload work which is going to add a handful more ops as best I can
tell.
Post by Daniel Borkmann
Patch 2/3 does not yet expose hw ring descriptors in uapi headers it
seems?
Nope I wasn't sure if we wanted to put the ring desciptors in UAPI or
not. If we do I would likely push that as a 4th patch.
Post by Daniel Borkmann
Are there plans to push a user space framework (maybe even into the
kernel), too? Will this be dpdk (alike) in the end?
I have patches to enable this interface on DPDK and it gets the
same performance as the other DPDK interfaces.

I've considered creating a minimal library to do basic tx/rx and
descriptor processing maybe in ./test or ,/scripts to give a usage
example that is easier to get ahold of and review without having
to pull in all the other things DPDK does that may or may not be
interesting depending on what you are doing and on what hardware.
Post by Daniel Borkmann
Bye,
Hannes
Hannes Frederic Sowa
2014-10-06 23:26:11 UTC
Permalink
Hi John,
Post by John Fastabend
Post by Hannes Frederic Sowa
I find the six additional ndo ops a bit worrisome as we are adding more
and more subsystem specific ndoops to this struct. I would like to see
some unification here, but currently cannot make concrete proposals,
sorry.
I agree it seems like a bit much. One thought was to split the ndo
ops into categories. Switch ops, MACVLAN ops, basic ops and with this
userspace queue ops. This sort of goes along with some of the switch
offload work which is going to add a handful more ops as best I can
tell.
Thanks for your mail, you answered all of my questions.

Have you looked at <https://code.google.com/p/kernel/wiki/ProjectUnetq>?
Willem (also in Cc) used sysfs files which get mmaped to represent the
tx/rx descriptors. The representation was independent of the device and
IIRC the prototype used a write(fd, "", 1) to signal the kernel it
should proceed with tx. I agree, it would be great to be syscall-free
here.

For the semantics of the descriptors we could also easily generate files
in sysfs. I thought about something like tracepoints already do for
representing the data in the ringbuffer depending on the event:

-- >8 --
# cat /sys/kernel/debug/tracing/events/net/net_dev_queue/format
name: net_dev_queue
ID: 1006
format:
field:unsigned short common_type; offset:0; size:2;
signed:0;
field:unsigned char common_flags; offset:2; size:1;
signed:0;
field:unsigned char common_preempt_count; offset:3;
size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:void * skbaddr; offset:8; size:8; signed:0;
field:unsigned int len; offset:16; size:4; signed:0;
field:__data_loc char[] name; offset:20; size:4;
signed:1;

print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr,
REC->len
-- >8 --

Maybe the macros from tracing are reusable (TP_STRUCT__entry), e.g.
endianess would need to be added. Hopefully there is already a user
space parser somewhere in the perf sources. An easier to parse binary
representation could be added easily and maybe even something vDSO alike
if people care about that.

Maybe this open/mmap per queue also kills some of the ndo_ops?

Bye,
Hannes
Neil Horman
2014-10-07 18:59:40 UTC
Permalink
Post by Daniel Borkmann
Hi John,
Post by John Fastabend
Post by Hannes Frederic Sowa
I find the six additional ndo ops a bit worrisome as we are adding more
and more subsystem specific ndoops to this struct. I would like to see
some unification here, but currently cannot make concrete proposals,
sorry.
I agree it seems like a bit much. One thought was to split the ndo
ops into categories. Switch ops, MACVLAN ops, basic ops and with this
userspace queue ops. This sort of goes along with some of the switch
offload work which is going to add a handful more ops as best I can
tell.
Thanks for your mail, you answered all of my questions.
Have you looked at <https://code.google.com/p/kernel/wiki/ProjectUnetq>?
Willem (also in Cc) used sysfs files which get mmaped to represent the
tx/rx descriptors. The representation was independent of the device and
IIRC the prototype used a write(fd, "", 1) to signal the kernel it
should proceed with tx. I agree, it would be great to be syscall-free
here.
For the semantics of the descriptors we could also easily generate files
in sysfs. I thought about something like tracepoints already do for
-- >8 --
# cat /sys/kernel/debug/tracing/events/net/net_dev_queue/format
name: net_dev_queue
ID: 1006
field:unsigned short common_type; offset:0; size:2;
signed:0;
field:unsigned char common_flags; offset:2; size:1;
signed:0;
field:unsigned char common_preempt_count; offset:3;
size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:void * skbaddr; offset:8; size:8; signed:0;
field:unsigned int len; offset:16; size:4; signed:0;
field:__data_loc char[] name; offset:20; size:4;
signed:1;
print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr,
REC->len
-- >8 --
Maybe the macros from tracing are reusable (TP_STRUCT__entry), e.g.
endianess would need to be added. Hopefully there is already a user
space parser somewhere in the perf sources. An easier to parse binary
representation could be added easily and maybe even something vDSO alike
if people care about that.
Maybe this open/mmap per queue also kills some of the ndo_ops?
Bye,
Hannes
John-
I don't know if its of use to you here, but I was experimenting awhile
ago with af_packet memory mapping, using the protection bits in the page tables
as a doorbell mechanism. I scrapped the work as the performance bottleneck for
af_packet wasn't found in the syscall trap time, but it occurs to me, it might
be useful for you here, in that, using this mechanism, if you keep the transmit
ring non-empty, you only encur the cost of a single trap to start the transmit
process. Let me know if you want to see it.

Neil
John Fastabend
2014-10-08 17:20:14 UTC
Permalink
Post by Neil Horman
Post by Daniel Borkmann
Hi John,
Post by John Fastabend
Post by Hannes Frederic Sowa
I find the six additional ndo ops a bit worrisome as we are adding more
and more subsystem specific ndoops to this struct. I would like to see
some unification here, but currently cannot make concrete proposals,
sorry.
I agree it seems like a bit much. One thought was to split the ndo
ops into categories. Switch ops, MACVLAN ops, basic ops and with this
userspace queue ops. This sort of goes along with some of the switch
offload work which is going to add a handful more ops as best I can
tell.
Thanks for your mail, you answered all of my questions.
Have you looked at <https://code.google.com/p/kernel/wiki/ProjectUnetq>?
Willem (also in Cc) used sysfs files which get mmaped to represent the
tx/rx descriptors. The representation was independent of the device and
IIRC the prototype used a write(fd, "", 1) to signal the kernel it
should proceed with tx. I agree, it would be great to be syscall-free
here.
For the semantics of the descriptors we could also easily generate files
in sysfs. I thought about something like tracepoints already do for
-- >8 --
# cat /sys/kernel/debug/tracing/events/net/net_dev_queue/format
name: net_dev_queue
ID: 1006
field:unsigned short common_type; offset:0; size:2;
signed:0;
field:unsigned char common_flags; offset:2; size:1;
signed:0;
field:unsigned char common_preempt_count; offset:3;
size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:void * skbaddr; offset:8; size:8; signed:0;
field:unsigned int len; offset:16; size:4; signed:0;
field:__data_loc char[] name; offset:20; size:4;
signed:1;
print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr,
REC->len
-- >8 --
Maybe the macros from tracing are reusable (TP_STRUCT__entry), e.g.
endianess would need to be added. Hopefully there is already a user
space parser somewhere in the perf sources. An easier to parse binary
representation could be added easily and maybe even something vDSO alike
if people care about that.
Maybe this open/mmap per queue also kills some of the ndo_ops?
Bye,
Hannes
John-
I don't know if its of use to you here, but I was experimenting awhile
ago with af_packet memory mapping, using the protection bits in the page tables
as a doorbell mechanism. I scrapped the work as the performance bottleneck for
af_packet wasn't found in the syscall trap time, but it occurs to me, it might
be useful for you here, in that, using this mechanism, if you keep the transmit
ring non-empty, you only encur the cost of a single trap to start the transmit
process. Let me know if you want to see it.
Neil
Hi Neil,

If you could forward it along I'll take a look. It seems like something
along these lines will be needed.

Thanks,
John
--
John Fastabend Intel Corporation
Neil Horman
2014-10-09 13:36:35 UTC
Permalink
This patch adds a variation to the AF_PACKET memory mapped socket transmit
mechanism. Nominally, when using a memory mapped AF_PACKET socket, frames are
written into the memory mapped buffer, and then the application calls sendmsg
with a NULL buffer which triggers then cleans the mapped space of all pending
buffers.

While this provides clean, synchronous operation, improvements can be made. To
this end, I've introduced a doorbell mode of operation to memory mapped packet
sockets. When a packet socket is placed into doorbell mode, it write protects
the mappings of any process using the packet socket, so that on the first write
to it, a kernel trap is generated, which returns the mapping to a read-write
state, and forks a task to begin cleaning the buffers on the applications
behalf. This thread contains some hysterisis to continue running a short while
after the last buffer has been cleaned, allowing subsquent wrtites to be sent
without needing to fork another task. This allows for additional parallelism in
that an application on an smp system can run in parallel with a cleaning task,
so that the socket buffer can be filled and emptied in parallel without having
to incur multiple system call traps.

I've only done some very rough performance estimates, but early results are
promising. Using this code here:
http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

I made some modifications to support using doorbell mode and compared the time
it took to send 1500 packets (each of size 1492 bytes), in basic mmap and
doorbell mmaped mode, and used tcpdump to capture the output. Results:

trace packets start time end time delta p/s size
ndb 1500 2.755605 3.000886 0.245281 6115.43 1492b
db 1500 4.716448 4.846382 0.129934 11544.32 1492b

Its very rough of course but it would seem I get a 40% increase in throughput
when using this method. I'm sure thats an overestimate, and so more testing is
required, but initial results look good.

Signed-off-by: Neil Horman <***@tuxdriver.com>
---
include/uapi/linux/if_packet.h | 1 +
net/packet/af_packet.c | 215 +++++++++++++++++++++++++++++++++++++++--
net/packet/internal.h | 10 ++
3 files changed, 217 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index bac27fa..efce7e1 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -54,6 +54,7 @@ struct sockaddr_ll {
#define PACKET_FANOUT 18
#define PACKET_TX_HAS_OFF 19
#define PACKET_QDISC_BYPASS 20
+#define PACKET_MMAP_DOORBELL 21

#define PACKET_FANOUT_HASH 0
#define PACKET_FANOUT_LB 1
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 6a2bb37..27849c5 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -66,6 +66,8 @@
#include <linux/kmod.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
+#include <linux/rmap.h>
+#include <linux/async.h>
#include <net/net_namespace.h>
#include <net/ip.h>
#include <net/protocol.h>
@@ -234,9 +236,18 @@ struct packet_skb_cb {
(((x)->kactive_blk_num < ((x)->knum_blocks-1)) ? \
((x)->kactive_blk_num+1) : 0)

+ASYNC_DOMAIN_EXCLUSIVE(packet_doorbell_domain);
+
static void __fanout_unlink(struct sock *sk, struct packet_sock *po);
static void __fanout_link(struct sock *sk, struct packet_sock *po);

+
+static void packet_mod_tx_doorbell(struct packet_sock *po,
+ struct vm_area_struct *vma, bool arm);
+
+#define packet_arm_tx_doorbell(p, v) packet_mod_tx_doorbell(p, v, true)
+#define packet_disarm_tx_doorbell(p, v) packet_mod_tx_doorbell(p, v, false)
+
static int packet_direct_xmit(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
@@ -2215,7 +2226,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
int status = TP_STATUS_AVAILABLE;
int hlen, tlen;

- mutex_lock(&po->pg_vec_lock);
+ if (!po->tp_doorbell_mode)
+ mutex_lock(&po->pg_vec_lock);

if (likely(saddr == NULL)) {
dev = packet_cached_dev_get(po);
@@ -2326,7 +2338,8 @@ out_status:
out_put:
dev_put(dev);
out:
- mutex_unlock(&po->pg_vec_lock);
+ if (!po->tp_doorbell_mode)
+ mutex_unlock(&po->pg_vec_lock);
return err;
}

@@ -2548,9 +2561,13 @@ static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
struct sock *sk = sock->sk;
struct packet_sock *po = pkt_sk(sk);

- if (po->tx_ring.pg_vec)
- return tpacket_snd(po, msg);
- else
+ if (po->tx_ring.pg_vec) {
+ if (po->tp_doorbell_mode) {
+ async_synchronize_full_domain(&packet_doorbell_domain);
+ return 0;
+ } else
+ return tpacket_snd(po, msg);
+ } else
return packet_snd(sock, msg, len);
}

@@ -2592,6 +2609,10 @@ static int packet_release(struct socket *sock)

packet_flush_mclist(sk);

+ if (po->tp_doorbell_mode)
+ async_synchronize_full_domain(&packet_doorbell_domain);
+
+
if (po->rx_ring.pg_vec) {
memset(&req_u, 0, sizeof(req_u));
packet_set_ring(sk, &req_u, 1, 0);
@@ -2772,6 +2793,9 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
sock_init_data(sock, sk);

po = pkt_sk(sk);
+ INIT_LIST_HEAD(&po->doorbell_vmas);
+ spin_lock_init(&po->doorbell_lock);
+ atomic_set(&po->doorbell_thread_count, 0);
sk->sk_family = PF_PACKET;
po->num = proto;
po->xmit = dev_queue_xmit;
@@ -3374,6 +3398,21 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
return 0;
}
+ case PACKET_MMAP_DOORBELL:
+ {
+ unsigned int val;
+ if (optlen != sizeof(val))
+ return -EINVAL;
+ if (atomic_read(&po->mapped))
+ return -EBUSY;
+ if (copy_from_user(&val, optval, sizeof(val)))
+ return -EFAULT;
+
+ po->tp_doorbell_mode = !!val;
+ if (!po->tp_doorbell_mode)
+ async_synchronize_full_domain(&packet_doorbell_domain);
+ return 0;
+ }
default:
return -ENOPROTOOPT;
}
@@ -3469,6 +3508,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
case PACKET_QDISC_BYPASS:
val = packet_use_direct_xmit(po);
break;
+ case PACKET_MMAP_DOORBELL:
+ val = po->tp_doorbell_mode;
+ break;
default:
return -ENOPROTOOPT;
}
@@ -3610,6 +3652,74 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
return mask;
}

+void packet_doorbell_send(void *data, async_cookie_t cookie)
+{
+ struct sock *sk = (struct sock *)data;
+ struct packet_sock *po = pkt_sk(sk);
+ struct msghdr msg;
+ int ret;
+ int retry_count;
+ struct doorbell_vma *db_vma;
+ void *more_work;
+
+ WARN_ON(!po);
+
+restart:
+ for (retry_count = 2; retry_count > 0; retry_count--) {
+ do {
+ msg.msg_flags = 0;
+ msg.msg_name = NULL;
+ ret = tpacket_snd(po, &msg);
+ } while (ret > 0);
+ schedule_timeout(1);
+ }
+ atomic_dec(&po->doorbell_thread_count);
+ rcu_read_lock();
+ list_for_each_entry_rcu(db_vma, &po->doorbell_vmas, list)
+ packet_arm_tx_doorbell(po, db_vma->vma);
+ rcu_read_unlock();
+
+ more_work = packet_current_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST);
+
+ if (more_work &&
+ atomic_add_unless(&po->doorbell_thread_count, 1, 1)) {
+ /*
+ * We have more to send and we won the race to be the cleaning
+ * thread. go back and try again
+ */
+ rcu_read_lock();
+ list_for_each_entry_rcu(db_vma, &po->doorbell_vmas, list)
+ packet_disarm_tx_doorbell(po, db_vma->vma);
+ rcu_read_unlock();
+ goto restart;
+ }
+
+}
+
+static int packet_mm_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct file *file = vma->vm_file;
+ struct socket *sock = file->private_data;
+ struct sock *sk = sock->sk;
+ struct doorbell_vma *db_vma;
+ struct packet_sock *po = sk ? pkt_sk(sk) : NULL;
+
+ if (po) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(db_vma, &po->doorbell_vmas, list)
+ packet_disarm_tx_doorbell(po, db_vma->vma);
+ rcu_read_unlock();
+ if (atomic_add_unless(&po->doorbell_thread_count, 1, 1)) {
+ if (po->tp_doorbell_mode)
+ async_schedule_domain(packet_doorbell_send, sk,
+ &packet_doorbell_domain);
+ else
+ atomic_dec(&po->doorbell_thread_count);
+ }
+
+ }
+ return VM_FAULT_RETRY;
+}

/* Dirty? Well, I still did not learn better way to account
* for user mmaps.
@@ -3627,17 +3737,29 @@ static void packet_mm_open(struct vm_area_struct *vma)

static void packet_mm_close(struct vm_area_struct *vma)
{
+ struct doorbell_vma *db_vma;
struct file *file = vma->vm_file;
struct socket *sock = file->private_data;
struct sock *sk = sock->sk;
-
- if (sk)
- atomic_dec(&pkt_sk(sk)->mapped);
+ struct packet_sock *po = sk ? pkt_sk(sk) : NULL;
+
+ if (po) {
+ spin_lock(&po->doorbell_lock);
+ list_for_each_entry_rcu(db_vma, &po->doorbell_vmas, list) {
+ if (db_vma->vma == vma) {
+ list_del_rcu(&db_vma->list);
+ kfree_rcu(db_vma, rcu);
+ }
+ }
+ spin_unlock(&po->doorbell_lock);
+ atomic_dec(&po->mapped);
+ }
}

-static const struct vm_operations_struct packet_mmap_ops = {
+const struct vm_operations_struct packet_mmap_ops = {
.open = packet_mm_open,
.close = packet_mm_close,
+ .page_mkwrite = packet_mm_mkwrite,
};

static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
@@ -3855,6 +3977,62 @@ out:
return err;
}

+static void packet_mod_tx_doorbell(struct packet_sock *po,
+ struct vm_area_struct *vma, bool arm)
+{
+ void *kaddr;
+ int pg_num;
+ struct packet_ring_buffer *rb;
+ pte_t entry;
+ struct page *page;
+ int i;
+ pte_t *ptep;
+ unsigned long start;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ rb = &po->tx_ring;
+
+ for (i = 0; i < rb->pg_vec_len; i++) {
+ kaddr = rb->pg_vec[i].buffer;
+ start = vma->vm_start;
+ for (pg_num = 0; pg_num < rb->pg_vec_pages; pg_num++) {
+ ptep = NULL;
+ page = pgv_to_page(kaddr);
+
+
+ pgd = pgd_offset(vma->vm_mm, start);
+ if (!pgd_present(*pgd))
+ goto next;
+
+ pud = pud_offset(pgd, start);
+ if (!pud_present(*pud))
+ goto next;
+
+ pmd = pmd_offset(pud, start);
+ if (!pmd_present(*pmd))
+ goto next;
+
+ ptep = pte_offset_kernel(pmd, start);
+
+ if (arm)
+ entry = pte_wrprotect(*ptep);
+ else
+ entry = pte_mkwrite(*ptep);
+
+ flush_dcache_page(page);
+ set_pte_at(vma->vm_mm, start, ptep, entry);
+
+next:
+ kaddr += PAGE_SIZE;
+ start += PAGE_SIZE;
+ }
+ }
+
+
+}
+
static int packet_mmap(struct file *file, struct socket *sock,
struct vm_area_struct *vma)
{
@@ -3865,10 +4043,17 @@ static int packet_mmap(struct file *file, struct socket *sock,
unsigned long start;
int err = -EINVAL;
int i;
+ struct doorbell_vma *db_vma = NULL;

if (vma->vm_pgoff)
return -EINVAL;

+ if (po->tp_doorbell_mode) {
+ db_vma = kzalloc(sizeof(struct doorbell_vma), GFP_KERNEL);
+ if (!db_vma)
+ return -ENOMEM;
+ }
+
mutex_lock(&po->pg_vec_lock);

expected_size = 0;
@@ -3905,9 +4090,21 @@ static int packet_mmap(struct file *file, struct socket *sock,
start += PAGE_SIZE;
kaddr += PAGE_SIZE;
}
+#ifdef CONFIG_X86
+ set_pages_uc(pgv_to_page(rb->pg_vec[i].buffer), rb->pg_vec_pages);
+#endif
}
}

+ if (po->tp_doorbell_mode) {
+ vma->vm_flags |= VM_SHARED;
+ db_vma->vma = vma;
+ spin_lock(&po->doorbell_lock);
+ list_add_rcu(&db_vma->list, &po->doorbell_vmas);
+ spin_unlock(&po->doorbell_lock);
+ packet_arm_tx_doorbell(po, vma);
+ }
+
atomic_inc(&po->mapped);
vma->vm_ops = &packet_mmap_ops;
err = 0;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index eb9580a..2e1f5f7 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -89,9 +89,17 @@ struct packet_fanout {
struct packet_type prot_hook ____cacheline_aligned_in_smp;
};

+struct doorbell_vma {
+ struct list_head list;
+ struct vm_area_struct *vma;
+ struct rcu_head rcu;
+};
+
struct packet_sock {
/* struct sock has to be the first member of packet_sock */
struct sock sk;
+ struct list_head __rcu doorbell_vmas;
+ spinlock_t doorbell_lock;
struct packet_fanout *fanout;
union tpacket_stats_u stats;
struct packet_ring_buffer rx_ring;
@@ -112,6 +120,8 @@ struct packet_sock {
unsigned int tp_reserve;
unsigned int tp_loss:1;
unsigned int tp_tx_has_off:1;
+ unsigned int tp_doorbell_mode:1;
+ atomic_t doorbell_thread_count;
unsigned int tp_tstamp;
struct net_device __rcu *cached_dev;
int (*xmit)(struct sk_buff *skb);
--
1.9.3
John Fastabend
2014-10-09 15:01:07 UTC
Permalink
Post by Neil Horman
This patch adds a variation to the AF_PACKET memory mapped socket transmit
mechanism. Nominally, when using a memory mapped AF_PACKET socket, frames are
written into the memory mapped buffer, and then the application calls sendmsg
with a NULL buffer which triggers then cleans the mapped space of all pending
buffers.
While this provides clean, synchronous operation, improvements can be made. To
this end, I've introduced a doorbell mode of operation to memory mapped packet
sockets. When a packet socket is placed into doorbell mode, it write protects
the mappings of any process using the packet socket, so that on the first write
to it, a kernel trap is generated, which returns the mapping to a read-write
state, and forks a task to begin cleaning the buffers on the applications
behalf. This thread contains some hysterisis to continue running a short while
after the last buffer has been cleaned, allowing subsquent wrtites to be sent
without needing to fork another task. This allows for additional parallelism in
that an application on an smp system can run in parallel with a cleaning task,
so that the socket buffer can be filled and emptied in parallel without having
to incur multiple system call traps.
I've only done some very rough performance estimates, but early results are
http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
I made some modifications to support using doorbell mode and compared the time
it took to send 1500 packets (each of size 1492 bytes), in basic mmap and
trace packets start time end time delta p/s size
ndb 1500 2.755605 3.000886 0.245281 6115.43 1492b
db 1500 4.716448 4.846382 0.129934 11544.32 1492b
Its very rough of course but it would seem I get a 40% increase in throughput
when using this method. I'm sure thats an overestimate, and so more testing is
required, but initial results look good.
---
Thanks Neil, This looks helpful I'll see if I can merge something like this with
my previous patch. Not likely to have anything by next week though ;)

.John
Neil Horman
2014-10-09 16:05:27 UTC
Permalink
Post by John Fastabend
Post by Neil Horman
This patch adds a variation to the AF_PACKET memory mapped socket transmit
mechanism. Nominally, when using a memory mapped AF_PACKET socket, frames are
written into the memory mapped buffer, and then the application calls sendmsg
with a NULL buffer which triggers then cleans the mapped space of all pending
buffers.
While this provides clean, synchronous operation, improvements can be made. To
this end, I've introduced a doorbell mode of operation to memory mapped packet
sockets. When a packet socket is placed into doorbell mode, it write protects
the mappings of any process using the packet socket, so that on the first write
to it, a kernel trap is generated, which returns the mapping to a read-write
state, and forks a task to begin cleaning the buffers on the applications
behalf. This thread contains some hysterisis to continue running a short while
after the last buffer has been cleaned, allowing subsquent wrtites to be sent
without needing to fork another task. This allows for additional parallelism in
that an application on an smp system can run in parallel with a cleaning task,
so that the socket buffer can be filled and emptied in parallel without having
to incur multiple system call traps.
I've only done some very rough performance estimates, but early results are
http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
I made some modifications to support using doorbell mode and compared the time
it took to send 1500 packets (each of size 1492 bytes), in basic mmap and
trace packets start time end time delta p/s size
ndb 1500 2.755605 3.000886 0.245281 6115.43 1492b
db 1500 4.716448 4.846382 0.129934 11544.32 1492b
Its very rough of course but it would seem I get a 40% increase in throughput
when using this method. I'm sure thats an overestimate, and so more testing is
required, but initial results look good.
---
Thanks Neil, This looks helpful I'll see if I can merge something like this with
my previous patch. Not likely to have anything by next week though ;)
No worries, I'm on vacation next week anyway :)
Thanks!
Neil
Post by John Fastabend
.John
Stephen Hemminger
2014-10-06 16:55:53 UTC
Permalink
On Sun, 05 Oct 2014 17:06:31 -0700
Post by John Fastabend
This patch adds a net_device ops to split off a set of driver queues
from the driver and map the queues into user space via mmap. This
allows the queues to be directly manipulated from user space. For
raw packet interface this removes any overhead from the kernel network
stack.
Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().
Fortunately there is already a flow classification interface which
is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
currently supported by multiple drivers including sfc, mlx4, niu,
ixgbe, and i40e. Supporting some way to steer traffic to a queue
is the _only_ hardware requirement to support the interface, plus
the driver needs to implement the correct ndo ops. A follow on
patch adds support for ixgbe but we expect at least the subset of
drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.
The interface is driven over an af_packet socket which we believe
is the most natural interface to use. Because it is already used
for raw packet interfaces which is what we are providing here.
bind(fd, &sockaddr, sizeof(sockaddr));
/* Get the device type and info */
getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
&optlen);
/* With device info we can look up descriptor format */
/* Get the layout of ring space offset, page_sz, cnt */
getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
&info, &optlen);
/* request some queues from the driver */
setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
/* if we let the driver pick us queues learn which queues
* we were given
*/
getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
/* And mmap queue pairs to user space */
mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
/* Now we have some user space queues to read/write to*/
There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.
The formats are usually provided by the device vendor documentation
If folks want I can provide a follow up patch to provide the formats
in a .h file in ./include/uapi/linux/ for ease of use. I have access
to formats for ixgbe and mlx drivers other driver owners would need to
provide their formats.
We tested this interface using traffic generators and doing basic
L2 forwarding tests on ixgbe devices. Our tests use a set of patches
to DPDK to enable an interface using this socket interfaace. With
application on a single core.
Additionally we have a set of DPDK patches to enable DPDK with this
clear from above DPDK is just our paticular test environment we
expect other libraries could be built on this interface.
I like the ability to share a device between kernel and user mode networking.
The model used for DPDK for this is really ugly and fragile/broken.
Your proposal assumes that you fully trust the user mode networking application
which is not a generally safe assumption.

A device can DMA from/to any arbitrary physical memory.
And it would be hard to use IOMMU to protect because the
IOMMU doesn't know that the difference between the applications queue and
the rest of the queues.

At least with DPDK you can use VFIO, and you are claiming the whole device to
allow protection against random memory being read/written.
John Fastabend
2014-10-06 20:42:28 UTC
Permalink
Post by Stephen Hemminger
On Sun, 05 Oct 2014 17:06:31 -0700
Post by John Fastabend
This patch adds a net_device ops to split off a set of driver queues
from the driver and map the queues into user space via mmap. This
allows the queues to be directly manipulated from user space. For
raw packet interface this removes any overhead from the kernel network
stack.
Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().
Fortunately there is already a flow classification interface which
is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
currently supported by multiple drivers including sfc, mlx4, niu,
ixgbe, and i40e. Supporting some way to steer traffic to a queue
is the _only_ hardware requirement to support the interface, plus
the driver needs to implement the correct ndo ops. A follow on
patch adds support for ixgbe but we expect at least the subset of
drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.
The interface is driven over an af_packet socket which we believe
is the most natural interface to use. Because it is already used
for raw packet interfaces which is what we are providing here.
bind(fd, &sockaddr, sizeof(sockaddr));
/* Get the device type and info */
getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
&optlen);
/* With device info we can look up descriptor format */
/* Get the layout of ring space offset, page_sz, cnt */
getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
&info, &optlen);
/* request some queues from the driver */
setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
/* if we let the driver pick us queues learn which queues
* we were given
*/
getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
/* And mmap queue pairs to user space */
mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
/* Now we have some user space queues to read/write to*/
There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.
The formats are usually provided by the device vendor documentation
If folks want I can provide a follow up patch to provide the formats
in a .h file in ./include/uapi/linux/ for ease of use. I have access
to formats for ixgbe and mlx drivers other driver owners would need to
provide their formats.
We tested this interface using traffic generators and doing basic
L2 forwarding tests on ixgbe devices. Our tests use a set of patches
to DPDK to enable an interface using this socket interfaace. With
application on a single core.
Additionally we have a set of DPDK patches to enable DPDK with this
clear from above DPDK is just our paticular test environment we
expect other libraries could be built on this interface.
I like the ability to share a device between kernel and user mode networking.
The model used for DPDK for this is really ugly and fragile/broken.
Your proposal assumes that you fully trust the user mode networking application
which is not a generally safe assumption.
A device can DMA from/to any arbitrary physical memory.
And it would be hard to use IOMMU to protect because the
IOMMU doesn't know that the difference between the applications queue and
the rest of the queues.
At least with DPDK you can use VFIO, and you are claiming the whole device to
allow protection against random memory being read/written.
However not all platforms support VFIO and when the application
only want to handle specific traffic types a queue maps well to
this.
--
John Fastabend Intel Corporation
David Miller
2014-10-06 21:42:43 UTC
Permalink
From: John Fastabend <***@gmail.com>
Date: Sun, 05 Oct 2014 17:06:31 -0700
Post by John Fastabend
This patch adds a net_device ops to split off a set of driver queues
from the driver and map the queues into user space via mmap. This
allows the queues to be directly manipulated from user space. For
raw packet interface this removes any overhead from the kernel network
stack.
About the facility in general, I am generally in favor, as I expressed
at the networking track in Chiacgo.

But you missed the mark wrt. describing the descriptors.

I do not want you to give device IDs.

I want the code to be %100 agnostic to device or vendor IDs.

Really "describe" the descriptor. Not just how large is it (32-bits,
64-bits, etc.), but also: 1) is it little or big endian 2) where is
the length field 3) where is control bit "foo" located, etc.

That's what I want to see in "struct tpacket_dev_info", rather than
device IDs and "versions".
John Fastabend
2014-10-07 04:25:12 UTC
Permalink
Post by David Miller
Date: Sun, 05 Oct 2014 17:06:31 -0700
Post by John Fastabend
This patch adds a net_device ops to split off a set of driver queues
from the driver and map the queues into user space via mmap. This
allows the queues to be directly manipulated from user space. For
raw packet interface this removes any overhead from the kernel network
stack.
About the facility in general, I am generally in favor, as I expressed
at the networking track in Chiacgo.
But you missed the mark wrt. describing the descriptors.
I do not want you to give device IDs.
I want the code to be %100 agnostic to device or vendor IDs.
OK. I'll work on this and have a v2 after net-next opens again.

Thanks!
Post by David Miller
Really "describe" the descriptor. Not just how large is it (32-bits,
64-bits, etc.), but also: 1) is it little or big endian 2) where is
the length field 3) where is control bit "foo" located, etc.
That's what I want to see in "struct tpacket_dev_info", rather than
device IDs and "versions".
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Willem de Bruijn
2014-10-07 04:24:14 UTC
Permalink
Post by John Fastabend
Supporting some way to steer traffic to a queue
is the _only_ hardware requirement to support the interface,
I would not impose his constraint. There may be legitimate use
cases for taking over all queues of a device. For instance, when
this is a secondary nic that does not carry any control traffic.
Post by John Fastabend
Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().
Why does the interface work at the level of queue_pairs instead of
individual queues?
Post by John Fastabend
/* Get the layout of ring space offset, page_sz, cnt */
getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
&info, &optlen);
/* request some queues from the driver */
setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
/* if we let the driver pick us queues learn which queues
* we were given
*/
getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
If ethtool -U is used to steer traffic to a specific descriptor queue,
then the setsockopt can pass the exact id of that queue and there
is no need for a getsockopt follow-up.
Post by John Fastabend
/* And mmap queue pairs to user space */
mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
How will packet data be mapped and how will userspace translate
from paddr to vaddr? Is the goal to maintain long lived mappings
and instruct drivers to allocate from this restricted range (to
avoid per-packet system calls and vma operations)?

For throughput-oriented workloads, the syscall overhead
involved in kicking the nic (on tx, or for increasing the ring
consumer index on rx) can be amortized. And the operation
can perhaps piggy-back on interrupts or other events
(as long as interrupts are not disabled for full userspace
polling). Latency would be harder to satisfy while maintaining
some kernel policy enforcement. An extreme solution
uses an asynchronously busy polling kernel worker thread
(at high cycle cost, so acceptable for few workloads).

When keeping the kernel in the loop, it is possible to do
some basic sanity checking and transparently translate between
vaddr and paddr, even when exposing the hardware descriptors
directly. Though at this point it may be just as cheap to expose
an idealized virtualized descriptor format and copy fields between
that and device descriptors.

One assumption underlying exposing the hardware descriptors
is that they are quire similar between devices. How true is this
in the context of formats that span multiple descriptors?
Post by John Fastabend
+ * int (*ndo_split_queue_pairs) (struct net_device *dev,
+ * unsigned int qpairs_start_from,
+ * unsigned int qpairs_num,
+ * struct sock *sk)
+ * Called to request a set of queues from the driver to be
+ * handed to the callee for management. After this returns the
+ * driver will not use the queues.
Are these queues also taken out of ethtool management, or is
this equivalent to taking removing them from the rss set with
ethtool -X?
David Laight
2014-10-07 09:27:03 UTC
Permalink
From: Willem de Bruijn
...
Post by Willem de Bruijn
When keeping the kernel in the loop, it is possible to do
some basic sanity checking and transparently translate between
vaddr and paddr, even when exposing the hardware descriptors
directly.
The application could change the addresses after they have been
validated, but before they have been read by the device.
Post by Willem de Bruijn
Though at this point it may be just as cheap to expose
an idealized virtualized descriptor format and copy fields between
that and device descriptors.
That is (probably) the only scheme that stops the application
accessing random parts of physical memory.
Post by Willem de Bruijn
One assumption underlying exposing the hardware descriptors
is that they are quire similar between devices. How true is this
in the context of formats that span multiple descriptors?
I suspect you'd need to define complete ring entries for 'initial',
'middle', 'final' and 'complete' fragments, together with the
offsets and endianness (and size?) of the address and data fields.

Also whether there is a special 'last entry' in the ring.

Passing checksum offload flags through adds an extra level of complexity.

Rings like the xhci (actually USB, but could contain ethernet data)
require the 'owner' bit be written odd or even in alternating passes.
Actually mapping support for usbnet (especially xhci - usb3) might show
up some deficiencies in the definition.

You also need to know when transmits have completed.
This might be an 'owner' bit being cleared, but could be signall
David Miller
2014-10-07 15:43:41 UTC
Permalink
From: David Laight <***@ACULAB.COM>
Date: Tue, 7 Oct 2014 09:27:03 +0000
Post by David Laight
That is (probably) the only scheme that stops the application
accessing random parts of physical memory.
I don't know where this claim keeps coming from, it's false.

The application has to attach memory to the ring, and then the
ring can only refer to that memory for the duration of the
session.

There is no way that the user can program the address field of the
descriptors to point at arbitrary physical memory locations.

There is protection and control.
David Laight
2014-10-07 15:59:35 UTC
Permalink
From: David
Post by David Miller
Date: Tue, 7 Oct 2014 09:27:03 +0000
Post by David Laight
That is (probably) the only scheme that stops the application
accessing random parts of physical memory.
I don't know where this claim keeps coming from, it's false.
The application has to attach memory to the ring, and then the
ring can only refer to that memory for the duration of the
session.
There is no way that the user can program the address field of the
descriptors to point at arbitrary physical memory locations.
There is protection and control.
I got the impression that the application was directly writing the ring
structure that the ethernet mac hardware uses to describe tx and rx buffers.
(ie they are mapped read-write into userspace).
Unless you have a system where you can limit the physical memory
ranges accessible to the mac hardware, I don't see how you can stop
the application putting rogue values into the ring.

Clearly I'm missing something in my quick read of the change.

David
David Miller
2014-10-07 16:08:44 UTC
Permalink
From: David Laight <***@ACULAB.COM>
Date: Tue, 7 Oct 2014 15:59:35 +0000
Post by David Miller
From: David
Post by David Miller
Date: Tue, 7 Oct 2014 09:27:03 +0000
Post by David Laight
That is (probably) the only scheme that stops the application
accessing random parts of physical memory.
I don't know where this claim keeps coming from, it's false.
The application has to attach memory to the ring, and then the
ring can only refer to that memory for the duration of the
session.
There is no way that the user can program the address field of the
descriptors to point at arbitrary physical memory locations.
There is protection and control.
I got the impression that the application was directly writing the ring
structure that the ethernet mac hardware uses to describe tx and rx buffers.
(ie they are mapped read-write into userspace).
Unless you have a system where you can limit the physical memory
ranges accessible to the mac hardware, I don't see how you can stop
the application putting rogue values into the ring.
Clearly I'm missing something in my quick read of the change.
No, I think I misunderstood, and apparently the Mellanox driver allows
the user to crap into arbitrary physical memory too.

All of this garbage must get fixed and this feature is a non-starter
until there is control over the memory the rings can point ti.
Zhou, Danny
2014-10-07 15:21:15 UTC
Permalink
-----Original Message-----
Sent: Tuesday, October 07, 2014 12:24 PM
To: John Fastabend
Vadai; Eric Dumazet; Zhou, Danny
Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
Post by John Fastabend
Supporting some way to steer traffic to a queue
is the _only_ hardware requirement to support the interface,
I would not impose his constraint. There may be legitimate use
cases for taking over all queues of a device. For instance, when
this is a secondary nic that does not carry any control traffic.
For the secondary NIC that carries the data plane traffics only, you can use UIO or VFIO
to map the entire NIC's entire I/O space to user space. Then the user space poll-mode driver,
like those have been supported and open-sourced in DPDK and those supports
Mellanox/Emulex NICs but not open-sourced, can drive the NIC as a sole driver in user space.
Post by John Fastabend
Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().
Why does the interface work at the level of queue_pairs instead of
individual queues?
The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
egress traffics respectively, although the flow director only applies to ingress traffics.
Post by John Fastabend
/* Get the layout of ring space offset, page_sz, cnt */
getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
&info, &optlen);
/* request some queues from the driver */
setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
/* if we let the driver pick us queues learn which queues
* we were given
*/
getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
If ethtool -U is used to steer traffic to a specific descriptor queue,
then the setsockopt can pass the exact id of that queue and there
is no need for a getsockopt follow-up.
Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
to call getsocketopt.
Post by John Fastabend
/* And mmap queue pairs to user space */
mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
How will packet data be mapped and how will userspace translate
from paddr to vaddr? Is the goal to maintain long lived mappings
and instruct drivers to allocate from this restricted range (to
avoid per-packet system calls and vma operations)?
Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
are translated to paddr) to fill in the packet descriptors.
As of security concern raised in previous discussion, the reason we think(BTW, correct me if I am wrong)
af_packet is most suitable is because only user application with root permission is allowed to successfully
split-off queue pairs and mmap a small window of PCIe I/O space to user space, so concern regarding "device
can DMA from/to any arbitrary physical memory." is not that big. As all user space device drivers based on
UIO mechanism has the same concern issue, VFIO adds protection but it is based on IOMMU which is
specific to Intel silicons.
For throughput-oriented workloads, the syscall overhead
involved in kicking the nic (on tx, or for increasing the ring
consumer index on rx) can be amortized. And the operation
can perhaps piggy-back on interrupts or other events
(as long as interrupts are not disabled for full userspace
polling). Latency would be harder to satisfy while maintaining
some kernel policy enforcement. An extreme solution
uses an asynchronously busy polling kernel worker thread
(at high cycle cost, so acceptable for few workloads).
When keeping the kernel in the loop, it is possible to do
some basic sanity checking and transparently translate between
vaddr and paddr, even when exposing the hardware descriptors
directly. Though at this point it may be just as cheap to expose
an idealized virtualized descriptor format and copy fields between
that and device descriptors.
One assumption underlying exposing the hardware descriptors
is that they are quire similar between devices. How true is this
in the context of formats that span multiple descriptors?
Packet descriptors format varies for different vendors. On Intel NICs, 1G/10G/40G NICs have
totally different formats. Even a same Intel 10G/40G NIC supports at least 2 different descriptor
formats. IMHO, the idea behind those patches intent to skip descriptor difference among devices,
as it just maps certain I/O space pages to user space and user space "slave" NIC driver can handle
it using different descriptor struct based on vendor/device ID. But I am open to add support of generic
packet descriptor format description, per David M' suggestion.
Post by John Fastabend
+ * int (*ndo_split_queue_pairs) (struct net_device *dev,
+ * unsigned int qpairs_start_from,
+ * unsigned int qpairs_num,
+ * struct sock *sk)
+ * Called to request a set of queues from the driver to be
+ * handed to the callee for management. After this returns the
+ * driver will not use the queues.
Are these queues also taken out of ethtool management, or is
this equivalent to taking removing them from the rss set with
ethtool -X?
As a master driver, the NIC kernel driver still takes control of flow director as a ethtool backend. Generally,
not all queues are initialized and used by NIC kernel driver, which reports actually used rx/tx numbers to stacks.
Before splitting off certain queues, if you want use ethtool to direct traffics to those unused queues,
ethtool reports invalid argument. Once certain stack-unaware queues are allocated for user space slave driver,
ethtool allows directing packets to them as the NIC driver maintains a data struct about which queues are visible
and used by kernel, which are used by
Willem de Bruijn
2014-10-07 15:46:05 UTC
Permalink
Post by Zhou, Danny
Post by Willem de Bruijn
Post by John Fastabend
Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().
Why does the interface work at the level of queue_pairs instead of
individual queues?
The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
egress traffics respectively, although the flow director only applies to ingress traffics.
That requirement of co-allocation is absent in existing packet
rings. Many applications only receive or transmit. For
receive-only, it would even be possible to map descriptor
rings read-only, if the kernel remains responsible for posting
buffers -- but I see below that that is not the case, so that's
not very relevant here.

Still, some workloads want asymmetric sets of rx and tx rings.
For instance, instead of using RSS, a process may want to
receive on as few rings as possible, load balance across
workers in software, but still give each worker thread its own
private transmit ring.
Post by Zhou, Danny
Post by Willem de Bruijn
Post by John Fastabend
/* Get the layout of ring space offset, page_sz, cnt */
getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
&info, &optlen);
/* request some queues from the driver */
setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
/* if we let the driver pick us queues learn which queues
* we were given
*/
getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
If ethtool -U is used to steer traffic to a specific descriptor queue,
then the setsockopt can pass the exact id of that queue and there
is no need for a getsockopt follow-up.
Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
to call getsocketopt.
One step further would be to move the entire configuration behind
the packet socket interface. It's perhaps out of scope of this patch,
but the difference between using `ethtool -U` and passing the same
expression through the packet socket is that in the latter case the
kernel can automatically rollback the configuration change when the
process dies.
Post by Zhou, Danny
Post by Willem de Bruijn
Post by John Fastabend
/* And mmap queue pairs to user space */
mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
How will packet data be mapped and how will userspace translate
from paddr to vaddr? Is the goal to maintain long lived mappings
and instruct drivers to allocate from this restricted range (to
avoid per-packet system calls and vma operations)?
Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
are translated to paddr) to fill in the packet descriptors.
Ah, userspace is responsible for posting buffers and translation
from vaddr to paddr is straightforward. Yes that makes sense.
John Fastabend
2014-10-07 15:55:23 UTC
Permalink
Post by Willem de Bruijn
Post by Zhou, Danny
Post by Willem de Bruijn
Post by John Fastabend
Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().
Why does the interface work at the level of queue_pairs instead of
individual queues?
The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
egress traffics respectively, although the flow director only applies to ingress traffics.
That requirement of co-allocation is absent in existing packet
rings. Many applications only receive or transmit. For
receive-only, it would even be possible to map descriptor
rings read-only, if the kernel remains responsible for posting
buffers -- but I see below that that is not the case, so that's
not very relevant here.
Still, some workloads want asymmetric sets of rx and tx rings.
For instance, instead of using RSS, a process may want to
receive on as few rings as possible, load balance across
workers in software, but still give each worker thread its own
private transmit ring.
We can build this into the interface by having the setsockopt
provide both the number of tx rings and number of rx rings. It
might not be immediately available in any drivers because at
least ixgbe is pretty dependent on tx/rx pairing.

I would have to look through the other drivers to see how
much work it would be to support this on them. If I can't find
a good candidate we might leave it out until we can fix up the
drivers.
Post by Willem de Bruijn
Post by Zhou, Danny
Post by Willem de Bruijn
Post by John Fastabend
/* Get the layout of ring space offset, page_sz, cnt */
getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
&info, &optlen);
/* request some queues from the driver */
setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
/* if we let the driver pick us queues learn which queues
* we were given
*/
getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
If ethtool -U is used to steer traffic to a specific descriptor queue,
then the setsockopt can pass the exact id of that queue and there
is no need for a getsockopt follow-up.
Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
to call getsocketopt.
One step further would be to move the entire configuration behind
the packet socket interface. It's perhaps out of scope of this patch,
but the difference between using `ethtool -U` and passing the same
expression through the packet socket is that in the latter case the
kernel can automatically rollback the configuration change when the
process dies.
hmm might be interesting I think this is a follow on path to
investigate after the initial support.
Post by Willem de Bruijn
Post by Zhou, Danny
Post by Willem de Bruijn
Post by John Fastabend
/* And mmap queue pairs to user space */
mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
How will packet data be mapped and how will userspace translate
from paddr to vaddr? Is the goal to maintain long lived mappings
and instruct drivers to allocate from this restricted range (to
avoid per-packet system calls and vma operations)?
Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
are translated to paddr) to fill in the packet descriptors.
Ah, userspace is responsible for posting buffers and translation
from vaddr to paddr is straightforward. Yes that makes sense.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Zhou, Danny
2014-10-07 16:06:41 UTC
Permalink
-----Original Message-----
From: Fastabend, John R
Sent: Tuesday, October 07, 2014 11:55 PM
To: Willem de Bruijn; Zhou, Danny
Ronciak, John; Amir Vadai; Eric Dumazet
Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
Post by Willem de Bruijn
Post by Zhou, Danny
Post by Willem de Bruijn
Post by John Fastabend
Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().
Why does the interface work at the level of queue_pairs instead of
individual queues?
The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
egress traffics respectively, although the flow director only applies to ingress traffics.
That requirement of co-allocation is absent in existing packet
rings. Many applications only receive or transmit. For
receive-only, it would even be possible to map descriptor
rings read-only, if the kernel remains responsible for posting
buffers -- but I see below that that is not the case, so that's
not very relevant here.
Still, some workloads want asymmetric sets of rx and tx rings.
For instance, instead of using RSS, a process may want to
receive on as few rings as possible, load balance across
workers in software, but still give each worker thread its own
private transmit ring.
We can build this into the interface by having the setsockopt
provide both the number of tx rings and number of rx rings. It
might not be immediately available in any drivers because at
least ixgbe is pretty dependent on tx/rx pairing.
I would have to look through the other drivers to see how
much work it would be to support this on them. If I can't find
a good candidate we might leave it out until we can fix up the
drivers.
Fully agreed. Unlike ixgbe, DPDK supports asymmetric sets of rx and tx rings. It should be easily enabled
if user space requests those free and available rx/tx queues, but it would cause a lot of troubles if certain
requested queues have been occupied by ixgbe, as it breaks existing rx/tx pairing.
Post by Willem de Bruijn
Post by Zhou, Danny
Post by Willem de Bruijn
Post by John Fastabend
/* Get the layout of ring space offset, page_sz, cnt */
getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
&info, &optlen);
/* request some queues from the driver */
setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
/* if we let the driver pick us queues learn which queues
* we were given
*/
getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
&qpairs_info, sizeof(qpairs_info));
If ethtool -U is used to steer traffic to a specific descriptor queue,
then the setsockopt can pass the exact id of that queue and there
is no need for a getsockopt follow-up.
Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
to call getsocketopt.
One step further would be to move the entire configuration behind
the packet socket interface. It's perhaps out of scope of this patch,
but the difference between using `ethtool -U` and passing the same
expression through the packet socket is that in the latter case the
kernel can automatically rollback the configuration change when the
process dies.
hmm might be interesting I think this is a follow on path to
investigate after the initial support.
Post by Willem de Bruijn
Post by Zhou, Danny
Post by Willem de Bruijn
Post by John Fastabend
/* And mmap queue pairs to user space */
mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
How will packet data be mapped and how will userspace translate
from paddr to vaddr? Is the goal to maintain long lived mappings
and instruct drivers to allocate from this restricted range (to
avoid per-packet system calls and vma operations)?
Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
are translated to paddr) to fill in the packet descriptors.
Ah, userspace is responsible for posting buffers and translation
from vaddr to paddr is straightforward. Yes that makes sense.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
More majordomo info at http://vger.kernel.org/majordomo-inf
David Miller
2014-10-07 16:05:34 UTC
Permalink
From: "Zhou, Danny" <***@intel.com>
Date: Tue, 7 Oct 2014 15:21:15 +0000
Post by Zhou, Danny
Once qpairs split-off is done, the user space driver, as a slave
driver, will re-initialize those queues completely in user space by
using paddr(in the case of DPDK, vaddr of DPDK used huge pages are
translated to paddr) to fill in the packet descriptors. As of
security concern raised in previous discussion, the reason we
think(BTW, correct me if I am wrong) af_packet is most suitable is
because only user application with root permission is allowed to
successfully split-off queue pairs and mmap a small window of PCIe
I/O space to user space, so concern regarding "device can DMA
from/to any arbitrary physical memory." is not that big. As all user
space device drivers based on UIO mechanism has the same concern
issue, VFIO adds protection but it is based on IOMMU which is
specific to Intel silicons.
Wait a second.

If there is no memory protection performed I'm not merging this.

I thought the user has to associate a fixed pool of memory to the
queueus, the kernel attaches that memory, and then the user cannot
modify the addresses _AT_ _ALL_.

If the user can modify the addresses in the descriptors and make
the chip crap on random memory, this is a non-starter.

Sorry.
Zhou, Danny
2014-10-10 03:49:23 UTC
Permalink
-----Original Message-----
Sent: Wednesday, October 08, 2014 12:06 AM
To: Zhou, Danny
Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
Date: Tue, 7 Oct 2014 15:21:15 +0000
Post by Zhou, Danny
Once qpairs split-off is done, the user space driver, as a slave
driver, will re-initialize those queues completely in user space by
using paddr(in the case of DPDK, vaddr of DPDK used huge pages are
translated to paddr) to fill in the packet descriptors. As of
security concern raised in previous discussion, the reason we
think(BTW, correct me if I am wrong) af_packet is most suitable is
because only user application with root permission is allowed to
successfully split-off queue pairs and mmap a small window of PCIe
I/O space to user space, so concern regarding "device can DMA
from/to any arbitrary physical memory." is not that big. As all user
space device drivers based on UIO mechanism has the same concern
issue, VFIO adds protection but it is based on IOMMU which is
specific to Intel silicons.
Wait a second.
If there is no memory protection performed I'm not merging this.
I thought the user has to associate a fixed pool of memory to the
queueus, the kernel attaches that memory, and then the user cannot
modify the addresses _AT_ _ALL_.
If the user can modify the addresses in the descriptors and make
the chip crap on random memory, this is a non-starter.
Sorry.
Fairly enough, we will manage to add memory protection in future versions.
Several options are under investigation.

Alexei Starovoitov
2014-10-07 16:33:04 UTC
Permalink
Post by Zhou, Danny
As a master driver, the NIC kernel driver still takes control of flow director as a ethtool backend. Generally,
not all queues are initialized and used by NIC kernel driver, which reports actually used rx/tx numbers to stacks.
Before splitting off certain queues, if you want use ethtool to direct traffics to those unused queues,
ethtool reports invalid argument. Once certain stack-unaware queues are allocated for user space slave driver,
ethtool allows directing packets to them as the NIC driver maintains a data struct about which queues are visible
and used by kernel, which are used by user space.
this whole thing sounds like it's a way to let DPDK apps share physical
interfaces with kernel, so that tcp-based control plane traffic
can reach DPDK process via kernel while DPDK user space
data path does the rest of packet processing?
One still needs pcie register i/o and all of user space driver support
to make it work, right?
I guess that's great for DPDK users, but I don't think it's good for linux.
Zhou, Danny
2014-10-07 16:46:11 UTC
Permalink
-----Original Message-----
Sent: Wednesday, October 08, 2014 12:33 AM
To: Zhou, Danny
Development; Ronciak, John; Amir Vadai; Eric Dumazet; David S. Miller
Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
Post by Zhou, Danny
As a master driver, the NIC kernel driver still takes control of flow director as a ethtool backend. Generally,
not all queues are initialized and used by NIC kernel driver, which reports actually used rx/tx numbers to stacks.
Before splitting off certain queues, if you want use ethtool to direct traffics to those unused queues,
ethtool reports invalid argument. Once certain stack-unaware queues are allocated for user space slave driver,
ethtool allows directing packets to them as the NIC driver maintains a data struct about which queues are visible
and used by kernel, which are used by user space.
this whole thing sounds like it's a way to let DPDK apps share physical
interfaces with kernel, so that tcp-based control plane traffic
can reach DPDK process via kernel while DPDK user space
data path does the rest of packet processing?
One still needs pcie register i/o and all of user space driver support
to make it work, right?
I guess that's great for DPDK users, but I don't think it's good for linux.
No, it is a generic NIC resource(e.g. rx/tx queue pairs) portioning mechanism among NIC kernel and user space drivers with
different performance/latency/jitter characteristics. One only needs write efficient packet rx/tx routines by taking advantage
of whatever software optimizations, to manipulate limited rx/tx queue relevant registers on PCIe I/O space. In other words, no
need to port the entire NIC driver from kernel to user
David Miller
2014-10-07 17:01:27 UTC
Permalink
From: Alexei Starovoitov <***@gmail.com>
Date: Tue, 7 Oct 2014 09:33:04 -0700
Post by Alexei Starovoitov
I guess that's great for DPDK users, but I don't think it's good for linux.
Any use of a piece of hardware is fine with me, personally, as long
as adequate protections are in place.

If it's just a descriptor ring in software and a doorbell to trigger
a refetch of the head and tail pointers, with appropriate protection
and control of the memory attached to the ring, I don't see how I
could object to such a facility.
Loading...