Discussion:
[dpdk-dev] [PATCH 0/8] virtio: virtio ring layout optimization and RX vector processing
Huawei Xie
2015-09-29 14:45:45 UTC
Permalink
Copied some message from patch 4.
In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it needs to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring.
and avail ring will always be the same during the run.
This removes L1 cache transfer from virtio core to vhost core for avail ring.
Besides, no descriptor free and allocation is needed.

Most importantly, this makes vector procesing possible to further accelerate
the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As one virtio header is needed for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++

Performance boost could be observed if the virtio backend isn't the bottleneck or in VM2VM
case.
There are also several vhost optimization patches to be submitted later.

Huawei Xie (8):
virtio: add configure for simple virtio rx/tx
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf, simple_rxtx into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: rxtx_func_get

config/common_linuxapp | 1 +
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 29 ++-
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 70 +++++-
drivers/net/virtio/virtio_rxtx.h | 39 ++++
drivers/net/virtio/virtio_rxtx_simple.c | 403 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 7 +
8 files changed, 550 insertions(+), 6 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx.h
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c
--
1.8.1.4
Huawei Xie
2015-09-29 14:45:46 UTC
Permalink
Turned off by default. This is development feature.
Will remove this macro when this optimization gets widely accepted.

Signed-off-by: Huawei Xie <***@intel.com>
---
config/common_linuxapp | 1 +
1 file changed, 1 insertion(+)

diff --git a/config/common_linuxapp b/config/common_linuxapp
index 0de43d5..b70c5d7 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -241,6 +241,7 @@ CONFIG_RTE_LIBRTE_ENIC_DEBUG=n
# Compile burst-oriented VIRTIO PMD driver
#
CONFIG_RTE_LIBRTE_VIRTIO_PMD=y
+CONFIG_RTE_LIBRTE_VIRTIO_SIMPLE=n
CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_INIT=n
CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_RX=n
CONFIG_RTE_LIBRTE_VIRTIO_DEBUG_TX=n
--
1.8.1.4
Huawei Xie
2015-09-29 14:45:47 UTC
Permalink
Would move all rx/tx related code into this header file in future.
Add RTE_VIRTIO_PMD_MAX_BURST.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 1 +
drivers/net/virtio/virtio_rxtx.c | 1 +
drivers/net/virtio/virtio_rxtx.h | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 36 insertions(+)
create mode 100644 drivers/net/virtio/virtio_rxtx.h

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 465d3cd..79a3640 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -61,6 +61,7 @@
#include "virtio_pci.h"
#include "virtio_logs.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"


static int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index c5b53bb..9324f7f 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -54,6 +54,7 @@
#include "virtio_logs.h"
#include "virtio_ethdev.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"

#ifdef RTE_LIBRTE_VIRTIO_DEBUG_DUMP
#define VIRTIO_DUMP_PACKET(m, len) rte_pktmbuf_dump(stdout, m, len)
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
new file mode 100644
index 0000000..a10aa69
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -0,0 +1,34 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
--
1.8.1.4
Huawei Xie
2015-09-29 14:45:48 UTC
Permalink
Add software RX ring in virtqueue.
Add fake_mbuf in virtqueue for wraparound processing.
Use simple_rxtx to indicate whether simple rxtx is enabled

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 12 ++++++++++++
drivers/net/virtio/virtio_rxtx.c | 5 +++++
drivers/net/virtio/virtqueue.h | 6 ++++++
3 files changed, 23 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 79a3640..3b7b841 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -247,6 +247,9 @@ virtio_dev_queue_release(struct virtqueue *vq) {
VIRTIO_WRITE_REG_2(hw, VIRTIO_PCI_QUEUE_SEL, vq->queue_id);
VIRTIO_WRITE_REG_4(hw, VIRTIO_PCI_QUEUE_PFN, 0);

+ if (vq->sw_ring)
+ rte_free(vq->sw_ring);
+
rte_free(vq);
vq = NULL;
}
@@ -292,6 +295,9 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
dev->data->port_id, queue_idx);
vq = rte_zmalloc(vq_name, sizeof(struct virtqueue) +
vq_size * sizeof(struct vq_desc_extra), RTE_CACHE_LINE_SIZE);
+ vq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
+ (RTE_PMD_VIRTIO_RX_MAX_BURST + vq_size) *
+ sizeof(vq->sw_ring[0]), RTE_CACHE_LINE_SIZE, socket_id);
} else if (queue_type == VTNET_TQ) {
snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d",
dev->data->port_id, queue_idx);
@@ -308,6 +314,12 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
PMD_INIT_LOG(ERR, "%s: Can not allocate virtqueue", __func__);
return (-ENOMEM);
}
+ if (queue_type == VTNET_RQ && vq->sw_ring == NULL) {
+ PMD_INIT_LOG(ERR, "%s: Can not allocate RX soft ring",
+ __func__);
+ rte_free(vq);
+ return -ENOMEM;
+ }

vq->hw = hw;
vq->port_id = dev->data->port_id;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 9324f7f..dcc4524 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -299,6 +299,11 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/* Allocate blank mbufs for the each rx descriptor */
nbufs = 0;
error = ENOSPC;
+
+ memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
+ for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
+ vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
+
while (!virtqueue_full(vq)) {
m = rte_rxmbuf_alloc(vq->mpool);
if (m == NULL)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 7789411..dd63285 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -190,11 +190,17 @@ struct virtqueue {
uint16_t vq_avail_idx;
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

+ struct rte_mbuf **sw_ring; /**< RX software ring. */
+ /* dummy mbuf, for wraparound when processing RX ring. */
+ struct rte_mbuf fake_mbuf;
+
/* Statistics */
uint64_t packets;
uint64_t bytes;
uint64_t errors;

+ int use_simple_rxtx;
+
struct vq_desc_extra {
void *cookie;
uint16_t ndescs;
--
1.8.1.4
Stephen Hemminger
2015-09-29 16:15:05 UTC
Permalink
On Tue, 29 Sep 2015 22:45:48 +0800
Post by Huawei Xie
+ if (vq->sw_ring)
+ rte_free(vq->sw_ring);
rte_free of NULL is a nop so conditional here is unnecessary
Huawei Xie
2015-09-29 14:45:49 UTC
Permalink
In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it needs to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring.
and avail ring will always be the same during the run.
This removes L1 cache transfer from virtio core to vhost core for avail ring.
Besides, no descriptor free and allocation is needed.
This also makes vector procesing possible to further accelerate the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index dcc4524..b4d268d 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -300,6 +300,12 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
nbufs = 0;
error = ENOSPC;

+ if (vq->use_simple_rxtx)
+ for (i = 0; i < vq->vq_nentries; i++) {
+ vq->vq_ring.avail->ring[i] = i;
+ vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
+ }
+
memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
@@ -330,6 +336,24 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else if (queue_type == VTNET_TQ) {
+ if (vq->use_simple_rxtx) {
+ int mid_idx = vq->vq_nentries >> 1;
+ for (i = 0; i < mid_idx; i++) {
+ vq->vq_ring.avail->ring[i] = i + mid_idx;
+ vq->vq_ring.desc[i + mid_idx].next = i;
+ vq->vq_ring.desc[i + mid_idx].addr =
+ vq->virtio_net_hdr_mem +
+ mid_idx * vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].len =
+ vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].flags =
+ VRING_DESC_F_NEXT;
+ vq->vq_ring.desc[i].flags = 0;
+ }
+ for (i = mid_idx; i < vq->vq_nentries; i++)
+ vq->vq_ring.avail->ring[i] = i;
+ }
+
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
--
1.8.1.4
Huawei Xie
2015-09-29 14:45:51 UTC
Permalink
With fixed avail ring, we don't need to get desc idx from avail ring.
virtio driver only has to deal with desc ring.
This patch uses vector instruction to accelerate processing desc ring.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 2 +
drivers/net/virtio/virtio_rxtx.c | 17 +++
drivers/net/virtio/virtio_rxtx.h | 2 +
drivers/net/virtio/virtio_rxtx_simple.c | 224 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 1 +
5 files changed, 246 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 9026d42..d7797ab 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,8 @@ uint16_t virtio_recv_mergeable_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts);

/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index aab6724..b721336 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -430,6 +430,9 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
vq->mpool = mp;

dev->data->rx_queues[queue_idx] = vq;
+
+ virtio_rxq_vec_setup(vq);
+
return 0;
}

@@ -858,6 +861,20 @@ virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
return nb_tx;
}

+uint16_t __attribute__((weak))
+virtio_recv_pkts_vec(
+ void __rte_unused *rx_queue,
+ struct rte_mbuf __rte_unused **rx_pkts,
+ uint16_t __rte_unused nb_pkts)
+{
+ return 0;
+}
+
+int __attribute__((weak))
+virtio_rxq_vec_setup(struct virtqueue __rte_unused *rxq)
+{
+ return -1;
+}

int __attribute__((weak))
virtqueue_enqueue_recv_refill_simple(struct virtqueue __rte_unused *vq,
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index 7d2d8fe..19c871c 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -33,5 +33,7 @@

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64

+int virtio_rxq_vec_setup(struct virtqueue __rte_unused *rxq);
+
int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index cac5b9f..3d57038 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -58,6 +58,10 @@
#include "virtqueue.h"
#include "virtio_rxtx.h"

+#define RTE_VIRTIO_VPMD_RX_BURST 32
+#define RTE_VIRTIO_DESC_PER_LOOP 8
+#define RTE_VIRTIO_VPMD_RX_REARM_THRESH RTE_VIRTIO_VPMD_RX_BURST
+
int __attribute__((cold))
virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *cookie)
@@ -82,3 +86,223 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,

return 0;
}
+
+static inline void
+virtio_rxq_rearm_vec(struct virtqueue *rxvq)
+{
+ int i;
+ uint16_t desc_idx;
+ struct rte_mbuf **sw_ring;
+ struct vring_desc *start_dp;
+ int ret;
+
+ desc_idx = rxvq->vq_avail_idx & (rxvq->vq_nentries - 1);
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ start_dp = &rxvq->vq_ring.desc[desc_idx];
+
+ ret = rte_mempool_get_bulk(rxvq->mpool, (void **)sw_ring,
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH);
+ if (unlikely(ret)) {
+ rte_eth_devices[rxvq->port_id].data->rx_mbuf_alloc_failed +=
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ return;
+ }
+
+ for (i = 0; i < RTE_VIRTIO_VPMD_RX_REARM_THRESH; i++) {
+ uintptr_t p;
+
+ p = (uintptr_t)&sw_ring[i]->rearm_data;
+ *(uint64_t *)p = rxvq->mbuf_initializer;
+
+ start_dp[i].addr =
+ (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[i].len = sw_ring[i]->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+ }
+
+ rxvq->vq_avail_idx += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ rxvq->vq_free_cnt -= RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ vq_update_avail_idx(rxvq);
+}
+
+/*
+ * virtio vPMD receive routine, only accept(nb_pkts >= RTE_VIRTIO_DESC_PER_LOOP)
+ *
+ * This routine is for non-mergable RX, one desc for each guest buffer.
+ * This routine is based on the RX ring layout optimization. Each entry in the
+ * avail ring points to the desc with the same index in the desc ring and this
+ * will never be changed in the driver.
+ *
+ * - nb_pkts < RTE_VIRTIO_DESC_PER_LOOP, just return no packet
+ */
+uint16_t
+virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *rxvq = rx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_used_elem *rused;
+ struct rte_mbuf **sw_ring;
+ struct rte_mbuf **sw_ring_end;
+ uint16_t nb_pkts_received;
+ __m128i shuf_msk1, shuf_msk2, len_adjust;
+
+ shuf_msk1 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 5, 4, /* dat len */
+ 0xFF, 0xFF, 5, 4, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+
+ );
+
+ shuf_msk2 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 13, 12, /* dat len */
+ 0xFF, 0xFF, 13, 12, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+ );
+
+ /* Substract the header length.
+ * In which case do we need the header length in used->len ? */
+ len_adjust = _mm_set_epi16(
+ 0, 0,
+ 0,
+ (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, 0);
+
+ if (unlikely(nb_pkts < RTE_VIRTIO_DESC_PER_LOOP))
+ return 0;
+
+ nb_used = *(volatile uint16_t *)&rxvq->vq_ring.used->idx -
+ rxvq->vq_used_cons_idx;
+
+ rte_compiler_barrier();
+
+ if (unlikely(nb_used == 0))
+ return 0;
+
+ nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_VIRTIO_DESC_PER_LOOP);
+ nb_used = RTE_MIN(nb_used, nb_pkts);
+
+ desc_idx = (uint16_t)(rxvq->vq_used_cons_idx & (rxvq->vq_nentries - 1));
+ rused = &rxvq->vq_ring.used->ring[desc_idx];
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ sw_ring_end = &rxvq->sw_ring[rxvq->vq_nentries];
+
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
+
+ if (rxvq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
+ virtio_rxq_rearm_vec(rxvq);
+ if (unlikely(virtqueue_kick_prepare(rxvq)))
+ virtqueue_notify(rxvq);
+ }
+
+ for (nb_pkts_received = 0;
+ nb_pkts_received < nb_used;) {
+ __m128i desc[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i mbp[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i pkt_mb[RTE_VIRTIO_DESC_PER_LOOP];
+
+ mbp[0] = _mm_loadu_si128((__m128i *)(sw_ring + 0));
+ desc[0] = _mm_loadu_si128((__m128i *)(rused + 0));
+ _mm_storeu_si128((__m128i *)&rx_pkts[0], mbp[0]);
+
+ mbp[1] = _mm_loadu_si128((__m128i *)(sw_ring + 2));
+ desc[1] = _mm_loadu_si128((__m128i *)(rused + 2));
+ _mm_storeu_si128((__m128i *)&rx_pkts[2], mbp[1]);
+
+ mbp[2] = _mm_loadu_si128((__m128i *)(sw_ring + 4));
+ desc[2] = _mm_loadu_si128((__m128i *)(rused + 4));
+ _mm_storeu_si128((__m128i *)&rx_pkts[4], mbp[2]);
+
+ mbp[3] = _mm_loadu_si128((__m128i *)(sw_ring + 6));
+ desc[3] = _mm_loadu_si128((__m128i *)(rused + 6));
+ _mm_storeu_si128((__m128i *)&rx_pkts[6], mbp[3]);
+
+ pkt_mb[1] = _mm_shuffle_epi8(desc[0], shuf_msk2);
+ pkt_mb[0] = _mm_shuffle_epi8(desc[0], shuf_msk1);
+ pkt_mb[1] = _mm_add_epi16(pkt_mb[1], len_adjust);
+ pkt_mb[0] = _mm_add_epi16(pkt_mb[0], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[1]->rx_descriptor_fields1,
+ pkt_mb[1]);
+ _mm_storeu_si128((void *)&rx_pkts[0]->rx_descriptor_fields1,
+ pkt_mb[0]);
+
+ pkt_mb[3] = _mm_shuffle_epi8(desc[1], shuf_msk2);
+ pkt_mb[2] = _mm_shuffle_epi8(desc[1], shuf_msk1);
+ pkt_mb[3] = _mm_add_epi16(pkt_mb[3], len_adjust);
+ pkt_mb[2] = _mm_add_epi16(pkt_mb[2], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[3]->rx_descriptor_fields1,
+ pkt_mb[3]);
+ _mm_storeu_si128((void *)&rx_pkts[2]->rx_descriptor_fields1,
+ pkt_mb[2]);
+
+ pkt_mb[5] = _mm_shuffle_epi8(desc[2], shuf_msk2);
+ pkt_mb[4] = _mm_shuffle_epi8(desc[2], shuf_msk1);
+ pkt_mb[5] = _mm_add_epi16(pkt_mb[5], len_adjust);
+ pkt_mb[4] = _mm_add_epi16(pkt_mb[4], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[5]->rx_descriptor_fields1,
+ pkt_mb[5]);
+ _mm_storeu_si128((void *)&rx_pkts[4]->rx_descriptor_fields1,
+ pkt_mb[4]);
+
+ pkt_mb[7] = _mm_shuffle_epi8(desc[3], shuf_msk2);
+ pkt_mb[6] = _mm_shuffle_epi8(desc[3], shuf_msk1);
+ pkt_mb[7] = _mm_add_epi16(pkt_mb[7], len_adjust);
+ pkt_mb[6] = _mm_add_epi16(pkt_mb[6], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[7]->rx_descriptor_fields1,
+ pkt_mb[7]);
+ _mm_storeu_si128((void *)&rx_pkts[6]->rx_descriptor_fields1,
+ pkt_mb[6]);
+
+ if (unlikely(nb_used <= RTE_VIRTIO_DESC_PER_LOOP)) {
+ if (sw_ring + nb_used <= sw_ring_end)
+ nb_pkts_received += nb_used;
+ else
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ if (unlikely(sw_ring + RTE_VIRTIO_DESC_PER_LOOP >=
+ sw_ring_end)) {
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ nb_pkts_received += RTE_VIRTIO_DESC_PER_LOOP;
+
+ rx_pkts += RTE_VIRTIO_DESC_PER_LOOP;
+ sw_ring += RTE_VIRTIO_DESC_PER_LOOP;
+ rused += RTE_VIRTIO_DESC_PER_LOOP;
+ nb_used -= RTE_VIRTIO_DESC_PER_LOOP;
+ }
+ }
+ }
+
+ rxvq->vq_used_cons_idx += nb_pkts_received;
+ rxvq->vq_free_cnt += nb_pkts_received;
+ rxvq->packets += nb_pkts_received;
+ return nb_pkts_received;
+}
+
+int __attribute__((cold))
+virtio_rxq_vec_setup(struct virtqueue *rxq)
+{
+ uintptr_t p;
+ struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+ mb_def.nb_segs = 1;
+ mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+ mb_def.port = rxq->port_id;
+ rte_mbuf_refcnt_set(&mb_def, 1);
+
+ /* prevent compiler reordering: rearm_data covers previous fields */
+ rte_compiler_barrier();
+ p = (uintptr_t)&mb_def.rearm_data;
+ rxq->mbuf_initializer = *(uint64_t *)p;
+
+ return 0;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index dd63285..363fb99 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -188,6 +188,7 @@ struct virtqueue {
*/
uint16_t vq_used_cons_idx;
uint16_t vq_avail_idx;
+ uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

struct rte_mbuf **sw_ring; /**< RX software ring. */
--
1.8.1.4
Huawei Xie
2015-09-29 14:45:52 UTC
Permalink
bulk free of mbufs when clean used ring.
shift operation of idx could be further saved if vq_free_cnt means
free slots rather than free descriptors.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 3 ++
drivers/net/virtio/virtio_rxtx.c | 9 ++++
drivers/net/virtio/virtio_rxtx_simple.c | 95 +++++++++++++++++++++++++++++++++
3 files changed, 107 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index d7797ab..ae2d47d 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -111,6 +111,9 @@ uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts);
+
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
* frames larger than 1514 bytes. We do not yet support software LRO
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index b721336..328bb7d 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -870,6 +870,15 @@ virtio_recv_pkts_vec(
return 0;
}

+uint16_t __attribute__((weak))
+virtio_xmit_pkts_simple(
+ void __rte_unused *tx_queue,
+ struct rte_mbuf __rte_unused **tx_pkts,
+ uint16_t __rte_unused nb_pkts)
+{
+ return 0;
+}
+
int __attribute__((weak))
virtio_rxq_vec_setup(struct virtqueue __rte_unused *rxq)
{
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index 3d57038..d1eed79 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -288,6 +288,101 @@ virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
return nb_pkts_received;
}

+#define VIRTIO_TX_FREE_THRESH 32
+#define VIRTIO_TX_MAX_FREE_BUF_SZ 32
+#define VIRTIO_TX_FREE_NR 32
+/* TODO: vq->tx_free_cnt could mean num of free slots so we could avoid shift */
+static inline void __attribute__((always_inline))
+virtio_xmit_cleanup(struct virtqueue *vq)
+{
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+
+ return;
+}
+
+uint16_t
+virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *txvq = tx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_desc *start_dp;
+ uint16_t nb_tail, nb_commit;
+ int i;
+ uint16_t desc_idx_max = (txvq->vq_nentries >> 1) - 1;
+
+ nb_used = VIRTQUEUE_NUSED(txvq);
+ rte_compiler_barrier();
+
+ nb_commit = nb_pkts = RTE_MIN((txvq->vq_free_cnt >> 1), nb_pkts);
+ desc_idx = (uint16_t) (txvq->vq_avail_idx & desc_idx_max);
+ start_dp = txvq->vq_ring.desc;
+ nb_tail = (uint16_t) (desc_idx_max + 1 - desc_idx);
+
+ if (nb_used >= VIRTIO_TX_FREE_THRESH)
+ virtio_xmit_cleanup(tx_queue);
+
+ if (nb_commit >= nb_tail) {
+ for (i = 0; i < nb_tail; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_tail; i++) {
+ start_dp[desc_idx].addr =
+ RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+ nb_commit -= nb_tail;
+ desc_idx = 0;
+ }
+ for (i = 0; i < nb_commit; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_commit; i++) {
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+
+ rte_compiler_barrier();
+
+ txvq->vq_free_cnt -= (uint16_t)(nb_pkts << 1);
+ txvq->vq_avail_idx += nb_pkts;
+ txvq->vq_ring.avail->idx = txvq->vq_avail_idx;
+ txvq->packets += nb_pkts;
+
+ if (likely(nb_pkts)) {
+ if (unlikely(virtqueue_kick_prepare(txvq)))
+ virtqueue_notify(txvq);
+ }
+
+ return nb_pkts;
+}
+
int __attribute__((cold))
virtio_rxq_vec_setup(struct virtqueue *rxq)
{
--
1.8.1.4
Huawei Xie
2015-09-29 14:45:50 UTC
Permalink
fill avail ring with blank mbufs in virtio_dev_vring_start

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_rxtx.c | 14 +++++-
drivers/net/virtio/virtio_rxtx.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 84 +++++++++++++++++++++++++++++++++
4 files changed, 100 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 930b60f..40f8e8d 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -50,7 +50,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtqueue.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_pci.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
-
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_SIMPLE) += virtio_rxtx_simple.c

# this lib depends upon:
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index b4d268d..aab6724 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -318,8 +318,10 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/******************************************
* Enqueue allocated buffers *
*******************************************/
- error = virtqueue_enqueue_recv_refill(vq, m);
-
+ if (vq->use_simple_rxtx)
+ error = virtqueue_enqueue_recv_refill_simple(vq, m);
+ else
+ error = virtqueue_enqueue_recv_refill(vq, m);
if (error) {
rte_pktmbuf_free(m);
break;
@@ -855,3 +857,11 @@ virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)

return nb_tx;
}
+
+
+int __attribute__((weak))
+virtqueue_enqueue_recv_refill_simple(struct virtqueue __rte_unused *vq,
+ struct rte_mbuf __rte_unused *m)
+{
+ return -1;
+}
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index a10aa69..7d2d8fe 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -32,3 +32,6 @@
*/

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
+
+int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
new file mode 100644
index 0000000..cac5b9f
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -0,0 +1,84 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <tmmintrin.h>
+
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_mbuf.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_prefetch.h>
+#include <rte_string_fns.h>
+#include <rte_errno.h>
+#include <rte_byteorder.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "virtio_rxtx.h"
+
+int __attribute__((cold))
+virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *cookie)
+{
+ struct vq_desc_extra *dxp;
+ struct vring_desc *start_dp;
+ uint16_t desc_idx;
+
+ desc_idx = vq->vq_avail_idx & (vq->vq_nentries - 1);
+ dxp = &vq->vq_descx[desc_idx];
+ dxp->cookie = (void *)cookie;
+ vq->sw_ring[desc_idx] = cookie;
+
+ start_dp = vq->vq_ring.desc;
+ start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[desc_idx].len = cookie->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+
+ vq->vq_free_cnt--;
+ vq->vq_avail_idx++;
+
+ return 0;
+}
--
1.8.1.4
Huawei Xie
2015-09-29 14:45:53 UTC
Permalink
Select simplified rx/tx when mergable isn't enabled and there is no
offload flags specified.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 3b7b841..a08529c 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -369,6 +369,8 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
vq->virtio_net_hdr_mz = NULL;
vq->virtio_net_hdr_mem = 0;

+ vq->use_simple_rxtx = (dev->rx_pkt_burst == virtio_recv_pkts_vec);
+
if (queue_type == VTNET_TQ) {
/*
* For each xmit packet, allocate a virtio_net_hdr
@@ -1156,13 +1158,21 @@ virtio_interrupt_handler(__rte_unused struct rte_intr_handle *handle,
}

static void
-rx_func_get(struct rte_eth_dev *eth_dev)
+rxtx_func_get(struct rte_eth_dev *eth_dev)
{
struct virtio_hw *hw = eth_dev->data->dev_private;
+
if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF))
eth_dev->rx_pkt_burst = &virtio_recv_mergeable_pkts;
else
eth_dev->rx_pkt_burst = &virtio_recv_pkts;
+
+#ifdef RTE_LIBRTE_VIRTIO_SIMPLE
+ if (!vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+ eth_dev->rx_pkt_burst = &virtio_recv_pkts_vec;
+ eth_dev->tx_pkt_burst = &virtio_xmit_pkts_simple;
+ }
+#endif
}

/*
@@ -1184,7 +1194,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
eth_dev->tx_pkt_burst = &virtio_xmit_pkts;

if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
- rx_func_get(eth_dev);
+ rxtx_func_get(eth_dev);
return 0;
}

@@ -1214,7 +1224,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
vtpci_set_status(hw, VIRTIO_CONFIG_STATUS_DRIVER);
virtio_negotiate_features(hw);

- rx_func_get(eth_dev);
+ rxtx_func_get(eth_dev);

/* Setting up rx_header size for the device */
if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF))
--
1.8.1.4
Xie, Huawei
2015-09-29 15:41:41 UTC
Permalink
Thomas:
Let us review first, then discuss the macro after that.
My preference is use the configure macro and then fix this before next
release. We could give some development features or aggressive changes a
time buffer.
Post by Huawei Xie
Copied some message from patch 4.
In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor
When vhost fetches the avail ring, it needs to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.
allocate the fixed descriptor for each entry of avail ring.
and avail ring will always be the same during the run.
This removes L1 cache transfer from virtio core to vhost core for avail ring.
Besides, no descriptor free and allocation is needed.
Most importantly, this makes vector procesing possible to further accelerate
the processing.
This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+
This is the ring layout for TX.
As one virtio header is needed for each xmit packet, we have 128 slots available.
++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++
Performance boost could be observed if the virtio backend isn't the bottleneck or in VM2VM
case.
There are also several vhost optimization patches to be submitted later.
virtio: add configure for simple virtio rx/tx
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf, simple_rxtx into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: rxtx_func_get
config/common_linuxapp | 1 +
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 29 ++-
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 70 +++++-
drivers/net/virtio/virtio_rxtx.h | 39 ++++
drivers/net/virtio/virtio_rxtx_simple.c | 403 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 7 +
8 files changed, 550 insertions(+), 6 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx.h
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c
Huawei Xie
2015-10-18 06:28:25 UTC
Permalink
In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it needs to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring.
and avail ring will always be the same during the run.
This removes L1 cache transfer from virtio core to vhost core for avail ring.
Besides, no descriptor free and allocation is needed.

Most importantly, this makes vector procesing possible to further accelerate
the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++

Performance boost could be observed only if the virtio backend isn't the bottleneck or in VM2VM
case.
There are also several vhost optimization patches to be submitted later.

Changes in v2:
- Remove the configure macro
- Enable simple R/TX processing when user specifies simple txq flags
- Reword some comments and commit messages

Huawei Xie (7):
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: pick simple rx/tx func

drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 13 ++
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 53 ++++-
drivers/net/virtio/virtio_rxtx.h | 39 ++++
drivers/net/virtio/virtio_rxtx_simple.c | 403 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 5 +
7 files changed, 517 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx.h
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c
--
1.8.1.4
Huawei Xie
2015-10-18 06:28:57 UTC
Permalink
In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it needs to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring.
and avail ring will always be the same during the run.
This removes L1 cache transfer from virtio core to vhost core for avail ring.
Besides, no descriptor free and allocation is needed.

Most importantly, this makes vector procesing possible to further accelerate
the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++

Performance boost could be observed only if the virtio backend isn't the bottleneck or in VM2VM
case.
There are also several vhost optimization patches to be submitted later.

Changes in v2:
- Remove the configure macro
- Enable simple R/TX processing when user specifies simple txq flags
- Reword some comments and commit messages

Huawei Xie (7):
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: pick simple rx/tx func

drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 13 ++
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 53 ++++-
drivers/net/virtio/virtio_rxtx.h | 39 ++++
drivers/net/virtio/virtio_rxtx_simple.c | 403 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 5 +
7 files changed, 517 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx.h
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c
--
1.8.1.4
Huawei Xie
2015-10-18 06:29:00 UTC
Permalink
In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it needs to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring.
and avail ring will always be the same during the run.
This removes L1 cache transfer from virtio core to vhost core for avail ring.
Besides, no descriptor free and allocation is needed.
This also makes vector procesing possible to further accelerate the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5c00e9d..7c82a6a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -302,6 +302,12 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
nbufs = 0;
error = ENOSPC;

+ if (use_simple_rxtx)
+ for (i = 0; i < vq->vq_nentries; i++) {
+ vq->vq_ring.avail->ring[i] = i;
+ vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
+ }
+
memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
@@ -332,6 +338,24 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else if (queue_type == VTNET_TQ) {
+ if (use_simple_rxtx) {
+ int mid_idx = vq->vq_nentries >> 1;
+ for (i = 0; i < mid_idx; i++) {
+ vq->vq_ring.avail->ring[i] = i + mid_idx;
+ vq->vq_ring.desc[i + mid_idx].next = i;
+ vq->vq_ring.desc[i + mid_idx].addr =
+ vq->virtio_net_hdr_mem +
+ mid_idx * vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].len =
+ vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].flags =
+ VRING_DESC_F_NEXT;
+ vq->vq_ring.desc[i].flags = 0;
+ }
+ for (i = mid_idx; i < vq->vq_nentries; i++)
+ vq->vq_ring.avail->ring[i] = i;
+ }
+
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
--
1.8.1.4
Huawei Xie
2015-10-18 06:29:01 UTC
Permalink
fill avail ring with blank mbufs in virtio_dev_vring_start

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_rxtx.c | 6 ++-
drivers/net/virtio/virtio_rxtx.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 84 +++++++++++++++++++++++++++++++++
4 files changed, 92 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 930b60f..43835ba 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -50,7 +50,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtqueue.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_pci.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
-
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c

# this lib depends upon:
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 7c82a6a..5162ce6 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -320,8 +320,10 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/******************************************
* Enqueue allocated buffers *
*******************************************/
- error = virtqueue_enqueue_recv_refill(vq, m);
-
+ if (use_simple_rxtx)
+ error = virtqueue_enqueue_recv_refill_simple(vq, m);
+ else
+ error = virtqueue_enqueue_recv_refill(vq, m);
if (error) {
rte_pktmbuf_free(m);
break;
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index a10aa69..7d2d8fe 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -32,3 +32,6 @@
*/

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
+
+int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
new file mode 100644
index 0000000..cac5b9f
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -0,0 +1,84 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <tmmintrin.h>
+
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_mbuf.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_prefetch.h>
+#include <rte_string_fns.h>
+#include <rte_errno.h>
+#include <rte_byteorder.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "virtio_rxtx.h"
+
+int __attribute__((cold))
+virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *cookie)
+{
+ struct vq_desc_extra *dxp;
+ struct vring_desc *start_dp;
+ uint16_t desc_idx;
+
+ desc_idx = vq->vq_avail_idx & (vq->vq_nentries - 1);
+ dxp = &vq->vq_descx[desc_idx];
+ dxp->cookie = (void *)cookie;
+ vq->sw_ring[desc_idx] = cookie;
+
+ start_dp = vq->vq_ring.desc;
+ start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[desc_idx].len = cookie->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+
+ vq->vq_free_cnt--;
+ vq->vq_avail_idx++;
+
+ return 0;
+}
--
1.8.1.4
Huawei Xie
2015-10-18 06:28:58 UTC
Permalink
Would move all rx/tx related code into this header file in future.
Add RTE_VIRTIO_PMD_MAX_BURST.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 1 +
drivers/net/virtio/virtio_rxtx.c | 1 +
drivers/net/virtio/virtio_rxtx.h | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 36 insertions(+)
create mode 100644 drivers/net/virtio/virtio_rxtx.h

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 465d3cd..79a3640 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -61,6 +61,7 @@
#include "virtio_pci.h"
#include "virtio_logs.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"


static int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index c5b53bb..9324f7f 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -54,6 +54,7 @@
#include "virtio_logs.h"
#include "virtio_ethdev.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"

#ifdef RTE_LIBRTE_VIRTIO_DEBUG_DUMP
#define VIRTIO_DUMP_PACKET(m, len) rte_pktmbuf_dump(stdout, m, len)
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
new file mode 100644
index 0000000..a10aa69
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -0,0 +1,34 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
--
1.8.1.4
Huawei Xie
2015-10-18 06:28:59 UTC
Permalink
Add software RX ring in virtqueue.
Add fake_mbuf in virtqueue for wraparound processing.
Use global simple_rxtx to indicate whether simple rxtx is enabled

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 12 ++++++++++++
drivers/net/virtio/virtio_rxtx.c | 7 +++++++
drivers/net/virtio/virtqueue.h | 4 ++++
3 files changed, 23 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 79a3640..3b7b841 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -247,6 +247,9 @@ virtio_dev_queue_release(struct virtqueue *vq) {
VIRTIO_WRITE_REG_2(hw, VIRTIO_PCI_QUEUE_SEL, vq->queue_id);
VIRTIO_WRITE_REG_4(hw, VIRTIO_PCI_QUEUE_PFN, 0);

+ if (vq->sw_ring)
+ rte_free(vq->sw_ring);
+
rte_free(vq);
vq = NULL;
}
@@ -292,6 +295,9 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
dev->data->port_id, queue_idx);
vq = rte_zmalloc(vq_name, sizeof(struct virtqueue) +
vq_size * sizeof(struct vq_desc_extra), RTE_CACHE_LINE_SIZE);
+ vq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
+ (RTE_PMD_VIRTIO_RX_MAX_BURST + vq_size) *
+ sizeof(vq->sw_ring[0]), RTE_CACHE_LINE_SIZE, socket_id);
} else if (queue_type == VTNET_TQ) {
snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d",
dev->data->port_id, queue_idx);
@@ -308,6 +314,12 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
PMD_INIT_LOG(ERR, "%s: Can not allocate virtqueue", __func__);
return (-ENOMEM);
}
+ if (queue_type == VTNET_RQ && vq->sw_ring == NULL) {
+ PMD_INIT_LOG(ERR, "%s: Can not allocate RX soft ring",
+ __func__);
+ rte_free(vq);
+ return -ENOMEM;
+ }

vq->hw = hw;
vq->port_id = dev->data->port_id;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 9324f7f..5c00e9d 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -62,6 +62,8 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0)
#endif

+static int use_simple_rxtx;
+
static void
vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx)
{
@@ -299,6 +301,11 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/* Allocate blank mbufs for the each rx descriptor */
nbufs = 0;
error = ENOSPC;
+
+ memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
+ for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
+ vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
+
while (!virtqueue_full(vq)) {
m = rte_rxmbuf_alloc(vq->mpool);
if (m == NULL)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 7789411..6a1ec48 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -190,6 +190,10 @@ struct virtqueue {
uint16_t vq_avail_idx;
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

+ struct rte_mbuf **sw_ring; /**< RX software ring. */
+ /* dummy mbuf, for wraparound when processing RX ring. */
+ struct rte_mbuf fake_mbuf;
+
/* Statistics */
uint64_t packets;
uint64_t bytes;
--
1.8.1.4
Stephen Hemminger
2015-10-19 04:20:12 UTC
Permalink
On Sun, 18 Oct 2015 14:28:59 +0800
Post by Huawei Xie
+ if (vq->sw_ring)
+ rte_free(vq->sw_ring);
+
Do not need to test for NULL before calling rte_free.
Better to just rely on the fact that rte_free(NULL) is documented
to be ok (no operation).
Xie, Huawei
2015-10-19 05:06:28 UTC
Permalink
On 10/19/2015 12:20 PM, Stephen Hemminger wrote:

On Sun, 18 Oct 2015 14:28:59 +0800
Huawei Xie <***@intel.com><mailto:***@intel.com> wrote:



+ if (vq->sw_ring)
+ rte_free(vq->sw_ring);
+



Do not need to test for NULL before calling rte_free.
Better to just rely on the fact that rte_free(NULL) is documented
to be ok (no operation).



ok, btw, in previous commit, just in the same function,
void virtio_dev_queue_release(vq)
[...]

rte_free(vq);

vq = NULL;
I think there is no need to set NULL to vq. Will submit a patch to fix it if you agree.
Xie, Huawei
2015-10-20 15:32:32 UTC
Permalink
Post by Stephen Hemminger
On Sun, 18 Oct 2015 14:28:59 +0800
+ if (vq->sw_ring)
+ rte_free(vq->sw_ring);
+
Do not need to test for NULL before calling rte_free.
Better to just rely on the fact that rte_free(NULL) is documented
to be ok (no operation).
ok, btw, in previous commit, just in the same function,
void virtio_dev_queue_release(vq)
[...]
rte_free(vq);
vq = NULL;
I think there is no need to set NULL to vq. Will submit a patch to fix it if you agree.
In v3 patch, i also remove the "vq = NULL".
Huawei Xie
2015-10-18 06:29:02 UTC
Permalink
With fixed avail ring, we don't need to get desc idx from avail ring.
virtio driver only has to deal with desc ring.
This patch uses vector instruction to accelerate processing desc ring.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 2 +
drivers/net/virtio/virtio_rxtx.c | 3 +
drivers/net/virtio/virtio_rxtx.h | 2 +
drivers/net/virtio/virtio_rxtx_simple.c | 224 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 1 +
5 files changed, 232 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 9026d42..d7797ab 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,8 @@ uint16_t virtio_recv_mergeable_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts);

/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5162ce6..947fc46 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -432,6 +432,9 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
vq->mpool = mp;

dev->data->rx_queues[queue_idx] = vq;
+
+ virtio_rxq_vec_setup(vq);
+
return 0;
}

diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index 7d2d8fe..831e492 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -33,5 +33,7 @@

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64

+int virtio_rxq_vec_setup(struct virtqueue *rxq);
+
int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index cac5b9f..ef17562 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -58,6 +58,10 @@
#include "virtqueue.h"
#include "virtio_rxtx.h"

+#define RTE_VIRTIO_VPMD_RX_BURST 32
+#define RTE_VIRTIO_DESC_PER_LOOP 8
+#define RTE_VIRTIO_VPMD_RX_REARM_THRESH RTE_VIRTIO_VPMD_RX_BURST
+
int __attribute__((cold))
virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *cookie)
@@ -82,3 +86,223 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,

return 0;
}
+
+static inline void
+virtio_rxq_rearm_vec(struct virtqueue *rxvq)
+{
+ int i;
+ uint16_t desc_idx;
+ struct rte_mbuf **sw_ring;
+ struct vring_desc *start_dp;
+ int ret;
+
+ desc_idx = rxvq->vq_avail_idx & (rxvq->vq_nentries - 1);
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ start_dp = &rxvq->vq_ring.desc[desc_idx];
+
+ ret = rte_mempool_get_bulk(rxvq->mpool, (void **)sw_ring,
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH);
+ if (unlikely(ret)) {
+ rte_eth_devices[rxvq->port_id].data->rx_mbuf_alloc_failed +=
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ return;
+ }
+
+ for (i = 0; i < RTE_VIRTIO_VPMD_RX_REARM_THRESH; i++) {
+ uintptr_t p;
+
+ p = (uintptr_t)&sw_ring[i]->rearm_data;
+ *(uint64_t *)p = rxvq->mbuf_initializer;
+
+ start_dp[i].addr =
+ (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[i].len = sw_ring[i]->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+ }
+
+ rxvq->vq_avail_idx += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ rxvq->vq_free_cnt -= RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ vq_update_avail_idx(rxvq);
+}
+
+/* virtio vPMD receive routine, only accept(nb_pkts >= RTE_VIRTIO_DESC_PER_LOOP)
+ *
+ * This routine is for non-mergable RX, one desc for each guest buffer.
+ * This routine is based on the RX ring layout optimization. Each entry in the
+ * avail ring points to the desc with the same index in the desc ring and this
+ * will never be changed in the driver.
+ *
+ * - nb_pkts < RTE_VIRTIO_DESC_PER_LOOP, just return no packet
+ */
+uint16_t
+virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *rxvq = rx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_used_elem *rused;
+ struct rte_mbuf **sw_ring;
+ struct rte_mbuf **sw_ring_end;
+ uint16_t nb_pkts_received;
+ __m128i shuf_msk1, shuf_msk2, len_adjust;
+
+ shuf_msk1 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 5, 4, /* dat len */
+ 0xFF, 0xFF, 5, 4, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+
+ );
+
+ shuf_msk2 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 13, 12, /* dat len */
+ 0xFF, 0xFF, 13, 12, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+ );
+
+ /* Substract the header length.
+ * In which case do we need the header length in used->len ?
+ */
+ len_adjust = _mm_set_epi16(
+ 0, 0,
+ 0,
+ (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, 0);
+
+ if (unlikely(nb_pkts < RTE_VIRTIO_DESC_PER_LOOP))
+ return 0;
+
+ nb_used = *(volatile uint16_t *)&rxvq->vq_ring.used->idx -
+ rxvq->vq_used_cons_idx;
+
+ rte_compiler_barrier();
+
+ if (unlikely(nb_used == 0))
+ return 0;
+
+ nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_VIRTIO_DESC_PER_LOOP);
+ nb_used = RTE_MIN(nb_used, nb_pkts);
+
+ desc_idx = (uint16_t)(rxvq->vq_used_cons_idx & (rxvq->vq_nentries - 1));
+ rused = &rxvq->vq_ring.used->ring[desc_idx];
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ sw_ring_end = &rxvq->sw_ring[rxvq->vq_nentries];
+
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
+
+ if (rxvq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
+ virtio_rxq_rearm_vec(rxvq);
+ if (unlikely(virtqueue_kick_prepare(rxvq)))
+ virtqueue_notify(rxvq);
+ }
+
+ for (nb_pkts_received = 0;
+ nb_pkts_received < nb_used;) {
+ __m128i desc[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i mbp[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i pkt_mb[RTE_VIRTIO_DESC_PER_LOOP];
+
+ mbp[0] = _mm_loadu_si128((__m128i *)(sw_ring + 0));
+ desc[0] = _mm_loadu_si128((__m128i *)(rused + 0));
+ _mm_storeu_si128((__m128i *)&rx_pkts[0], mbp[0]);
+
+ mbp[1] = _mm_loadu_si128((__m128i *)(sw_ring + 2));
+ desc[1] = _mm_loadu_si128((__m128i *)(rused + 2));
+ _mm_storeu_si128((__m128i *)&rx_pkts[2], mbp[1]);
+
+ mbp[2] = _mm_loadu_si128((__m128i *)(sw_ring + 4));
+ desc[2] = _mm_loadu_si128((__m128i *)(rused + 4));
+ _mm_storeu_si128((__m128i *)&rx_pkts[4], mbp[2]);
+
+ mbp[3] = _mm_loadu_si128((__m128i *)(sw_ring + 6));
+ desc[3] = _mm_loadu_si128((__m128i *)(rused + 6));
+ _mm_storeu_si128((__m128i *)&rx_pkts[6], mbp[3]);
+
+ pkt_mb[1] = _mm_shuffle_epi8(desc[0], shuf_msk2);
+ pkt_mb[0] = _mm_shuffle_epi8(desc[0], shuf_msk1);
+ pkt_mb[1] = _mm_add_epi16(pkt_mb[1], len_adjust);
+ pkt_mb[0] = _mm_add_epi16(pkt_mb[0], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[1]->rx_descriptor_fields1,
+ pkt_mb[1]);
+ _mm_storeu_si128((void *)&rx_pkts[0]->rx_descriptor_fields1,
+ pkt_mb[0]);
+
+ pkt_mb[3] = _mm_shuffle_epi8(desc[1], shuf_msk2);
+ pkt_mb[2] = _mm_shuffle_epi8(desc[1], shuf_msk1);
+ pkt_mb[3] = _mm_add_epi16(pkt_mb[3], len_adjust);
+ pkt_mb[2] = _mm_add_epi16(pkt_mb[2], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[3]->rx_descriptor_fields1,
+ pkt_mb[3]);
+ _mm_storeu_si128((void *)&rx_pkts[2]->rx_descriptor_fields1,
+ pkt_mb[2]);
+
+ pkt_mb[5] = _mm_shuffle_epi8(desc[2], shuf_msk2);
+ pkt_mb[4] = _mm_shuffle_epi8(desc[2], shuf_msk1);
+ pkt_mb[5] = _mm_add_epi16(pkt_mb[5], len_adjust);
+ pkt_mb[4] = _mm_add_epi16(pkt_mb[4], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[5]->rx_descriptor_fields1,
+ pkt_mb[5]);
+ _mm_storeu_si128((void *)&rx_pkts[4]->rx_descriptor_fields1,
+ pkt_mb[4]);
+
+ pkt_mb[7] = _mm_shuffle_epi8(desc[3], shuf_msk2);
+ pkt_mb[6] = _mm_shuffle_epi8(desc[3], shuf_msk1);
+ pkt_mb[7] = _mm_add_epi16(pkt_mb[7], len_adjust);
+ pkt_mb[6] = _mm_add_epi16(pkt_mb[6], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[7]->rx_descriptor_fields1,
+ pkt_mb[7]);
+ _mm_storeu_si128((void *)&rx_pkts[6]->rx_descriptor_fields1,
+ pkt_mb[6]);
+
+ if (unlikely(nb_used <= RTE_VIRTIO_DESC_PER_LOOP)) {
+ if (sw_ring + nb_used <= sw_ring_end)
+ nb_pkts_received += nb_used;
+ else
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ if (unlikely(sw_ring + RTE_VIRTIO_DESC_PER_LOOP >=
+ sw_ring_end)) {
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ nb_pkts_received += RTE_VIRTIO_DESC_PER_LOOP;
+
+ rx_pkts += RTE_VIRTIO_DESC_PER_LOOP;
+ sw_ring += RTE_VIRTIO_DESC_PER_LOOP;
+ rused += RTE_VIRTIO_DESC_PER_LOOP;
+ nb_used -= RTE_VIRTIO_DESC_PER_LOOP;
+ }
+ }
+ }
+
+ rxvq->vq_used_cons_idx += nb_pkts_received;
+ rxvq->vq_free_cnt += nb_pkts_received;
+ rxvq->packets += nb_pkts_received;
+ return nb_pkts_received;
+}
+
+int __attribute__((cold))
+virtio_rxq_vec_setup(struct virtqueue *rxq)
+{
+ uintptr_t p;
+ struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+ mb_def.nb_segs = 1;
+ mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+ mb_def.port = rxq->port_id;
+ rte_mbuf_refcnt_set(&mb_def, 1);
+
+ /* prevent compiler reordering: rearm_data covers previous fields */
+ rte_compiler_barrier();
+ p = (uintptr_t)&mb_def.rearm_data;
+ rxq->mbuf_initializer = *(uint64_t *)p;
+
+ return 0;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6a1ec48..98a77d5 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -188,6 +188,7 @@ struct virtqueue {
*/
uint16_t vq_used_cons_idx;
uint16_t vq_avail_idx;
+ uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

struct rte_mbuf **sw_ring; /**< RX software ring. */
--
1.8.1.4
Huawei Xie
2015-10-18 06:29:03 UTC
Permalink
bulk free of mbufs when clean used ring.
shift operation of idx could be further saved if vq_free_cnt means
free slots rather than free descriptors.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 95 +++++++++++++++++++++++++++++++++
2 files changed, 98 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index d7797ab..ae2d47d 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -111,6 +111,9 @@ uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts);
+
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
* frames larger than 1514 bytes. We do not yet support software LRO
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index ef17562..3339a24 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -288,6 +288,101 @@ virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
return nb_pkts_received;
}

+#define VIRTIO_TX_FREE_THRESH 32
+#define VIRTIO_TX_MAX_FREE_BUF_SZ 32
+#define VIRTIO_TX_FREE_NR 32
+/* TODO: vq->tx_free_cnt could mean num of free slots so we could avoid shift */
+static inline void __attribute__((always_inline))
+virtio_xmit_cleanup(struct virtqueue *vq)
+{
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+
+ return;
+}
+
+uint16_t
+virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *txvq = tx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_desc *start_dp;
+ uint16_t nb_tail, nb_commit;
+ int i;
+ uint16_t desc_idx_max = (txvq->vq_nentries >> 1) - 1;
+
+ nb_used = VIRTQUEUE_NUSED(txvq);
+ rte_compiler_barrier();
+
+ nb_commit = nb_pkts = RTE_MIN((txvq->vq_free_cnt >> 1), nb_pkts);
+ desc_idx = (uint16_t) (txvq->vq_avail_idx & desc_idx_max);
+ start_dp = txvq->vq_ring.desc;
+ nb_tail = (uint16_t) (desc_idx_max + 1 - desc_idx);
+
+ if (nb_used >= VIRTIO_TX_FREE_THRESH)
+ virtio_xmit_cleanup(tx_queue);
+
+ if (nb_commit >= nb_tail) {
+ for (i = 0; i < nb_tail; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_tail; i++) {
+ start_dp[desc_idx].addr =
+ RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+ nb_commit -= nb_tail;
+ desc_idx = 0;
+ }
+ for (i = 0; i < nb_commit; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_commit; i++) {
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+
+ rte_compiler_barrier();
+
+ txvq->vq_free_cnt -= (uint16_t)(nb_pkts << 1);
+ txvq->vq_avail_idx += nb_pkts;
+ txvq->vq_ring.avail->idx = txvq->vq_avail_idx;
+ txvq->packets += nb_pkts;
+
+ if (likely(nb_pkts)) {
+ if (unlikely(virtqueue_kick_prepare(txvq)))
+ virtqueue_notify(txvq);
+ }
+
+ return nb_pkts;
+}
+
int __attribute__((cold))
virtio_rxq_vec_setup(struct virtqueue *rxq)
{
--
1.8.1.4
Stephen Hemminger
2015-10-19 04:16:57 UTC
Permalink
On Sun, 18 Oct 2015 14:29:03 +0800
Post by Huawei Xie
bulk free of mbufs when clean used ring.
shift operation of idx could be further saved if vq_free_cnt means
free slots rather than free descriptors.
Did you measure this. I finished my transmit optimizations and gets 25% performance improvement
without any of these restrictions.
Xie, Huawei
2015-10-19 05:22:12 UTC
Permalink
Post by Stephen Hemminger
On Sun, 18 Oct 2015 14:29:03 +0800
Post by Huawei Xie
bulk free of mbufs when clean used ring.
shift operation of idx could be further saved if vq_free_cnt means
free slots rather than free descriptors.
Did you measure this. I finished my transmit optimizations and gets 25% performance improvement
without any of these restrictions.
Which restriction do you mean? For the ring layout optimization, this
is the core idea. For the single segment mbuf, this is what all other
PMDs assume for fastest rx/tx path.
With all vhost and virtio optimizations, it could achieve approximately
3~4 times performance improvement(depending on the workload).
Do you mean the indirect feature support or additional optimization not
submitted? Would review your patch this week.
Stephen Hemminger
2015-10-19 04:18:00 UTC
Permalink
On Sun, 18 Oct 2015 14:29:03 +0800
Post by Huawei Xie
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
This assumes all transmits are from the same pool, which is not necessarily true.
Xie, Huawei
2015-10-19 05:15:30 UTC
Permalink
Post by Stephen Hemminger
On Sun, 18 Oct 2015 14:29:03 +0800
Post by Huawei Xie
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
This assumes all transmits are from the same pool, which is not necessarily true.
Don't get you. It accumulates all the mbufs from the same pool, free all
of them until it meets one from a different pool.

if (likely(m->pool == free[0]->pool))
Stephen Hemminger
2015-10-19 04:19:25 UTC
Permalink
+static inline void __attribute__((always_inline))
+virtio_xmit_cleanup(struct virtqueue *vq)
+{

Please don't use always inline, frustrating the compiler isn't going
to help.

+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+
+ return;
+}

Don't add return; at end of void functions. It only clutters
things for no reason.
Xie, Huawei
2015-10-19 05:12:19 UTC
Permalink
Post by Huawei Xie
+static inline void __attribute__((always_inline))
+virtio_xmit_cleanup(struct virtqueue *vq)
+{
Please don't use always inline, frustrating the compiler isn't going
to help.
always inline is scattered elsewhere in the dpdk code.
What is the negative effect? Should we remove all of them?
Post by Huawei Xie
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+
+ return;
+}
Don't add return; at end of void functions. It only clutters
things for no reason.
Agree.
Huawei Xie
2015-10-18 06:29:04 UTC
Permalink
simple rx/tx func is enabled when user specifies single segment, no offload support.
merge-able should be disabled to use simple rxtx.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 947fc46..71f8cd4 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -62,6 +62,10 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0)
#endif

+
+#define VIRTIO_SIMPLE_FLAGS ((uint32_t)ETH_TXQ_FLAGS_NOMULTSEGS | \
+ ETH_TXQ_FLAGS_NOOFFLOADS)
+
static int use_simple_rxtx;

static void
@@ -471,6 +475,14 @@ virtio_dev_tx_queue_setup(struct rte_eth_dev *dev,
return -EINVAL;
}

+ /* Use simple rx/tx func if single segment and no offloads */
+ if ((tx_conf->txq_flags & VIRTIO_SIMPLE_FLAGS) == VIRTIO_SIMPLE_FLAGS) {
+ PMD_INIT_LOG(INFO, "Using simple rx/tx path");
+ dev->tx_pkt_burst = virtio_xmit_pkts_simple;
+ dev->rx_pkt_burst = virtio_recv_pkts_vec;
+ use_simple_rxtx = 1;
+ }
+
ret = virtio_dev_queue_setup(dev, VTNET_TQ, queue_idx, vtpci_queue_idx,
nb_desc, socket_id, &vq);
if (ret < 0) {
--
1.8.1.4
Huawei Xie
2015-10-20 15:30:00 UTC
Permalink
Changes in v2:
- Remove the configure macro
- Enable simple R/TX processing when user specifies simple txq flags
- Reword some comments and commit messages

Changes in v3:
- Remove unnecessary NULL test for rte_free
- Remove unnecessary assign of local var after free
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
- Reword some commit messages
- Add TODO in the commit message of simple tx patch

In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on other different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it need to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring, so avail ring will
always be the same during the run.
This removes L1M cache transfer from virtio core to vhost core for avail ring.
(Note we couldn't avoid the cache transfer for descriptors).
Besides, descriptor allocation and free operation is eliminated.
This also makes vector procesing possible to further accelerate the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++


Performance boost could be observed only if the virtio backend isn't the bottleneck or in VM2VM
case.
There are also several vhost optimization patches to be submitted later.

Huawei Xie (7):
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: choose simple rx/tx func

drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 12 +-
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 53 ++++-
drivers/net/virtio/virtio_rxtx.h | 39 ++++
drivers/net/virtio/virtio_rxtx_simple.c | 401 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 5 +
7 files changed, 513 insertions(+), 4 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx.h
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c
--
1.8.1.4
Huawei Xie
2015-10-20 15:30:01 UTC
Permalink
Would move all rx/tx related declarations into this header file in future.
Add RTE_VIRTIO_PMD_MAX_BURST.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 1 +
drivers/net/virtio/virtio_rxtx.c | 1 +
drivers/net/virtio/virtio_rxtx.h | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 36 insertions(+)
create mode 100644 drivers/net/virtio/virtio_rxtx.h

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 465d3cd..79a3640 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -61,6 +61,7 @@
#include "virtio_pci.h"
#include "virtio_logs.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"


static int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index c5b53bb..9324f7f 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -54,6 +54,7 @@
#include "virtio_logs.h"
#include "virtio_ethdev.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"

#ifdef RTE_LIBRTE_VIRTIO_DEBUG_DUMP
#define VIRTIO_DUMP_PACKET(m, len) rte_pktmbuf_dump(stdout, m, len)
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
new file mode 100644
index 0000000..a10aa69
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -0,0 +1,34 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
--
1.8.1.4
Huawei Xie
2015-10-20 15:30:02 UTC
Permalink
Changes in v3:
- Remove unnecessary NULL test for rte_free
- Remove unnecessary assign of local var vq after free

Add software RX ring in virtqueue.
Add fake_mbuf in virtqueue for wraparound processing.
Use global simple_rxtx to indicate whether simple rxtx is enabled

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 11 ++++++++++-
drivers/net/virtio/virtio_rxtx.c | 7 +++++++
drivers/net/virtio/virtqueue.h | 4 ++++
3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 79a3640..82676d3 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -247,8 +247,8 @@ virtio_dev_queue_release(struct virtqueue *vq) {
VIRTIO_WRITE_REG_2(hw, VIRTIO_PCI_QUEUE_SEL, vq->queue_id);
VIRTIO_WRITE_REG_4(hw, VIRTIO_PCI_QUEUE_PFN, 0);

+ rte_free(vq->sw_ring);
rte_free(vq);
- vq = NULL;
}
}

@@ -292,6 +292,9 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
dev->data->port_id, queue_idx);
vq = rte_zmalloc(vq_name, sizeof(struct virtqueue) +
vq_size * sizeof(struct vq_desc_extra), RTE_CACHE_LINE_SIZE);
+ vq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
+ (RTE_PMD_VIRTIO_RX_MAX_BURST + vq_size) *
+ sizeof(vq->sw_ring[0]), RTE_CACHE_LINE_SIZE, socket_id);
} else if (queue_type == VTNET_TQ) {
snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d",
dev->data->port_id, queue_idx);
@@ -308,6 +311,12 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
PMD_INIT_LOG(ERR, "%s: Can not allocate virtqueue", __func__);
return (-ENOMEM);
}
+ if (queue_type == VTNET_RQ && vq->sw_ring == NULL) {
+ PMD_INIT_LOG(ERR, "%s: Can not allocate RX soft ring",
+ __func__);
+ rte_free(vq);
+ return -ENOMEM;
+ }

vq->hw = hw;
vq->port_id = dev->data->port_id;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 9324f7f..5c00e9d 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -62,6 +62,8 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0)
#endif

+static int use_simple_rxtx;
+
static void
vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx)
{
@@ -299,6 +301,11 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/* Allocate blank mbufs for the each rx descriptor */
nbufs = 0;
error = ENOSPC;
+
+ memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
+ for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
+ vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
+
while (!virtqueue_full(vq)) {
m = rte_rxmbuf_alloc(vq->mpool);
if (m == NULL)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 7789411..6a1ec48 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -190,6 +190,10 @@ struct virtqueue {
uint16_t vq_avail_idx;
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

+ struct rte_mbuf **sw_ring; /**< RX software ring. */
+ /* dummy mbuf, for wraparound when processing RX ring. */
+ struct rte_mbuf fake_mbuf;
+
/* Statistics */
uint64_t packets;
uint64_t bytes;
--
1.8.1.4
Huawei Xie
2015-10-20 15:30:04 UTC
Permalink
fill avail ring with blank mbufs in virtio_dev_vring_start

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_rxtx.c | 6 ++-
drivers/net/virtio/virtio_rxtx.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 84 +++++++++++++++++++++++++++++++++
4 files changed, 92 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 930b60f..43835ba 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -50,7 +50,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtqueue.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_pci.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
-
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c

# this lib depends upon:
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 7c82a6a..5162ce6 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -320,8 +320,10 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/******************************************
* Enqueue allocated buffers *
*******************************************/
- error = virtqueue_enqueue_recv_refill(vq, m);
-
+ if (use_simple_rxtx)
+ error = virtqueue_enqueue_recv_refill_simple(vq, m);
+ else
+ error = virtqueue_enqueue_recv_refill(vq, m);
if (error) {
rte_pktmbuf_free(m);
break;
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index a10aa69..7d2d8fe 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -32,3 +32,6 @@
*/

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
+
+int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
new file mode 100644
index 0000000..cac5b9f
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -0,0 +1,84 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <tmmintrin.h>
+
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_mbuf.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_prefetch.h>
+#include <rte_string_fns.h>
+#include <rte_errno.h>
+#include <rte_byteorder.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "virtio_rxtx.h"
+
+int __attribute__((cold))
+virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *cookie)
+{
+ struct vq_desc_extra *dxp;
+ struct vring_desc *start_dp;
+ uint16_t desc_idx;
+
+ desc_idx = vq->vq_avail_idx & (vq->vq_nentries - 1);
+ dxp = &vq->vq_descx[desc_idx];
+ dxp->cookie = (void *)cookie;
+ vq->sw_ring[desc_idx] = cookie;
+
+ start_dp = vq->vq_ring.desc;
+ start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[desc_idx].len = cookie->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+
+ vq->vq_free_cnt--;
+ vq->vq_avail_idx++;
+
+ return 0;
+}
--
1.8.1.4
Huawei Xie
2015-10-20 15:30:06 UTC
Permalink
Changes in v3:
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup

bulk free of mbufs when clean used ring.
shift operation of idx could be saved if vq_free_cnt means
free slots rather than free descriptors.

TODO: rearrange vq data structure, pack the stats var together so that we could use
one vec instruction to update all of them.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 93 +++++++++++++++++++++++++++++++++
2 files changed, 96 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index d7797ab..ae2d47d 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -111,6 +111,9 @@ uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts);
+
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
* frames larger than 1514 bytes. We do not yet support software LRO
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index ef17562..a53d462 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -288,6 +288,99 @@ virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
return nb_pkts_received;
}

+#define VIRTIO_TX_FREE_THRESH 32
+#define VIRTIO_TX_MAX_FREE_BUF_SZ 32
+#define VIRTIO_TX_FREE_NR 32
+/* TODO: vq->tx_free_cnt could mean num of free slots so we could avoid shift */
+static inline void
+virtio_xmit_cleanup(struct virtqueue *vq)
+{
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+}
+
+uint16_t
+virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *txvq = tx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_desc *start_dp;
+ uint16_t nb_tail, nb_commit;
+ int i;
+ uint16_t desc_idx_max = (txvq->vq_nentries >> 1) - 1;
+
+ nb_used = VIRTQUEUE_NUSED(txvq);
+ rte_compiler_barrier();
+
+ nb_commit = nb_pkts = RTE_MIN((txvq->vq_free_cnt >> 1), nb_pkts);
+ desc_idx = (uint16_t) (txvq->vq_avail_idx & desc_idx_max);
+ start_dp = txvq->vq_ring.desc;
+ nb_tail = (uint16_t) (desc_idx_max + 1 - desc_idx);
+
+ if (nb_used >= VIRTIO_TX_FREE_THRESH)
+ virtio_xmit_cleanup(tx_queue);
+
+ if (nb_commit >= nb_tail) {
+ for (i = 0; i < nb_tail; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_tail; i++) {
+ start_dp[desc_idx].addr =
+ RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+ nb_commit -= nb_tail;
+ desc_idx = 0;
+ }
+ for (i = 0; i < nb_commit; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_commit; i++) {
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+
+ rte_compiler_barrier();
+
+ txvq->vq_free_cnt -= (uint16_t)(nb_pkts << 1);
+ txvq->vq_avail_idx += nb_pkts;
+ txvq->vq_ring.avail->idx = txvq->vq_avail_idx;
+ txvq->packets += nb_pkts;
+
+ if (likely(nb_pkts)) {
+ if (unlikely(virtqueue_kick_prepare(txvq)))
+ virtqueue_notify(txvq);
+ }
+
+ return nb_pkts;
+}
+
int __attribute__((cold))
virtio_rxq_vec_setup(struct virtqueue *rxq)
{
--
1.8.1.4
Stephen Hemminger
2015-10-20 18:58:29 UTC
Permalink
On Tue, 20 Oct 2015 23:30:06 +0800
Post by Huawei Xie
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
Might be better to introduce a function in rte_mbuf.h which
does this so other drivers can use same code?

rte_pktmbuf_free_bulk(pkts[], n)
Xie, Huawei
2015-10-22 05:43:54 UTC
Permalink
Post by Stephen Hemminger
On Tue, 20 Oct 2015 23:30:06 +0800
Post by Huawei Xie
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
Might be better to introduce a function in rte_mbuf.h which
does this so other drivers can use same code?
rte_pktmbuf_free_bulk(pkts[], n)
Agree. It would be good to have a generic rte_pktmbuf_free(/alloc)_bulk.
Several other drivers and future vhost patches also use the same logic.
I prefer to implement this later as this is API change.
Tan, Jianfeng
2015-10-22 02:27:25 UTC
Permalink
-----Original Message-----
Sent: Tuesday, October 20, 2015 11:30 PM
Subject: [dpdk-dev] [PATCH v3 6/7] virtio: simple tx routine
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
bulk free of mbufs when clean used ring.
shift operation of idx could be saved if vq_free_cnt means free slots rather
than free descriptors.
TODO: rearrange vq data structure, pack the stats var together so that we
could use one vec instruction to update all of them.
---
drivers/net/virtio/virtio_ethdev.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 93
+++++++++++++++++++++++++++++++++
2 files changed, 96 insertions(+)
diff --git a/drivers/net/virtio/virtio_ethdev.h
b/drivers/net/virtio/virtio_ethdev.h
index d7797ab..ae2d47d 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -111,6 +111,9 @@ uint16_t virtio_xmit_pkts(void *tx_queue, struct
rte_mbuf **tx_pkts, uint16_t virtio_recv_pkts_vec(void *rx_queue, struct
rte_mbuf **rx_pkts,
uint16_t nb_pkts);
+uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts);
+
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
* frames larger than 1514 bytes. We do not yet support software LRO diff --
git a/drivers/net/virtio/virtio_rxtx_simple.c
b/drivers/net/virtio/virtio_rxtx_simple.c
index ef17562..a53d462 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -288,6 +288,99 @@ virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
return nb_pkts_received;
}
+#define VIRTIO_TX_FREE_THRESH 32
+#define VIRTIO_TX_MAX_FREE_BUF_SZ 32
+#define VIRTIO_TX_FREE_NR 32
+/* TODO: vq->tx_free_cnt could mean num of free slots so we could avoid
+shift */ static inline void virtio_xmit_cleanup(struct virtqueue *vq) {
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1); }
+
+uint16_t
+virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *txvq = tx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_desc *start_dp;
+ uint16_t nb_tail, nb_commit;
+ int i;
+ uint16_t desc_idx_max = (txvq->vq_nentries >> 1) - 1;
+
+ nb_used = VIRTQUEUE_NUSED(txvq);
+ rte_compiler_barrier();
+
+ nb_commit = nb_pkts = RTE_MIN((txvq->vq_free_cnt >> 1),
nb_pkts);
Here if nb_commit is zero, how about return 0 immediately?
+ desc_idx = (uint16_t) (txvq->vq_avail_idx & desc_idx_max);
+ start_dp = txvq->vq_ring.desc;
+ nb_tail = (uint16_t) (desc_idx_max + 1 - desc_idx);
+
+ if (nb_used >= VIRTIO_TX_FREE_THRESH)
+ virtio_xmit_cleanup(tx_queue);
If this cleanup should be put before vq_free_cnt is referenced? It's because it may free some descs to vq_free_cnt.
+
+ if (nb_commit >= nb_tail) {
+ for (i = 0; i < nb_tail; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_tail; i++) {
+ start_dp[desc_idx].addr =
+ RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+ nb_commit -= nb_tail;
+ desc_idx = 0;
+ }
+ for (i = 0; i < nb_commit; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_commit; i++) {
+ start_dp[desc_idx].addr =
RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+
+ rte_compiler_barrier();
+
+ txvq->vq_free_cnt -= (uint16_t)(nb_pkts << 1);
+ txvq->vq_avail_idx += nb_pkts;
+ txvq->vq_ring.avail->idx = txvq->vq_avail_idx;
+ txvq->packets += nb_pkts;
+
+ if (likely(nb_pkts)) {
+ if (unlikely(virtqueue_kick_prepare(txvq)))
+ virtqueue_notify(txvq);
+ }
+
+ return nb_pkts;
+}
+
int __attribute__((cold))
virtio_rxq_vec_setup(struct virtqueue *rxq) {
--
1.8.1.4
Huawei Xie
2015-10-20 15:30:03 UTC
Permalink
In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on other different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it need to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring, so avail ring will
always be the same during the run.
This removes L1M cache transfer from virtio core to vhost core for avail ring.
(Note we couldn't avoid the cache transfer for descriptors).
Besides, descriptor allocation and free operation is eliminated.
This also makes vector procesing possible to further accelerate the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5c00e9d..7c82a6a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -302,6 +302,12 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
nbufs = 0;
error = ENOSPC;

+ if (use_simple_rxtx)
+ for (i = 0; i < vq->vq_nentries; i++) {
+ vq->vq_ring.avail->ring[i] = i;
+ vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
+ }
+
memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
@@ -332,6 +338,24 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else if (queue_type == VTNET_TQ) {
+ if (use_simple_rxtx) {
+ int mid_idx = vq->vq_nentries >> 1;
+ for (i = 0; i < mid_idx; i++) {
+ vq->vq_ring.avail->ring[i] = i + mid_idx;
+ vq->vq_ring.desc[i + mid_idx].next = i;
+ vq->vq_ring.desc[i + mid_idx].addr =
+ vq->virtio_net_hdr_mem +
+ mid_idx * vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].len =
+ vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].flags =
+ VRING_DESC_F_NEXT;
+ vq->vq_ring.desc[i].flags = 0;
+ }
+ for (i = mid_idx; i < vq->vq_nentries; i++)
+ vq->vq_ring.avail->ring[i] = i;
+ }
+
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
--
1.8.1.4
Huawei Xie
2015-10-20 15:30:05 UTC
Permalink
With fixed avail ring, we don't need to get desc idx from avail ring.
virtio driver only has to deal with desc ring.
This patch uses vector instruction to accelerate processing desc ring.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 2 +
drivers/net/virtio/virtio_rxtx.c | 3 +
drivers/net/virtio/virtio_rxtx.h | 2 +
drivers/net/virtio/virtio_rxtx_simple.c | 224 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 1 +
5 files changed, 232 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 9026d42..d7797ab 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,8 @@ uint16_t virtio_recv_mergeable_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts);

/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5162ce6..947fc46 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -432,6 +432,9 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
vq->mpool = mp;

dev->data->rx_queues[queue_idx] = vq;
+
+ virtio_rxq_vec_setup(vq);
+
return 0;
}

diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index 7d2d8fe..831e492 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -33,5 +33,7 @@

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64

+int virtio_rxq_vec_setup(struct virtqueue *rxq);
+
int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index cac5b9f..ef17562 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -58,6 +58,10 @@
#include "virtqueue.h"
#include "virtio_rxtx.h"

+#define RTE_VIRTIO_VPMD_RX_BURST 32
+#define RTE_VIRTIO_DESC_PER_LOOP 8
+#define RTE_VIRTIO_VPMD_RX_REARM_THRESH RTE_VIRTIO_VPMD_RX_BURST
+
int __attribute__((cold))
virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *cookie)
@@ -82,3 +86,223 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,

return 0;
}
+
+static inline void
+virtio_rxq_rearm_vec(struct virtqueue *rxvq)
+{
+ int i;
+ uint16_t desc_idx;
+ struct rte_mbuf **sw_ring;
+ struct vring_desc *start_dp;
+ int ret;
+
+ desc_idx = rxvq->vq_avail_idx & (rxvq->vq_nentries - 1);
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ start_dp = &rxvq->vq_ring.desc[desc_idx];
+
+ ret = rte_mempool_get_bulk(rxvq->mpool, (void **)sw_ring,
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH);
+ if (unlikely(ret)) {
+ rte_eth_devices[rxvq->port_id].data->rx_mbuf_alloc_failed +=
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ return;
+ }
+
+ for (i = 0; i < RTE_VIRTIO_VPMD_RX_REARM_THRESH; i++) {
+ uintptr_t p;
+
+ p = (uintptr_t)&sw_ring[i]->rearm_data;
+ *(uint64_t *)p = rxvq->mbuf_initializer;
+
+ start_dp[i].addr =
+ (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[i].len = sw_ring[i]->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+ }
+
+ rxvq->vq_avail_idx += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ rxvq->vq_free_cnt -= RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ vq_update_avail_idx(rxvq);
+}
+
+/* virtio vPMD receive routine, only accept(nb_pkts >= RTE_VIRTIO_DESC_PER_LOOP)
+ *
+ * This routine is for non-mergable RX, one desc for each guest buffer.
+ * This routine is based on the RX ring layout optimization. Each entry in the
+ * avail ring points to the desc with the same index in the desc ring and this
+ * will never be changed in the driver.
+ *
+ * - nb_pkts < RTE_VIRTIO_DESC_PER_LOOP, just return no packet
+ */
+uint16_t
+virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *rxvq = rx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_used_elem *rused;
+ struct rte_mbuf **sw_ring;
+ struct rte_mbuf **sw_ring_end;
+ uint16_t nb_pkts_received;
+ __m128i shuf_msk1, shuf_msk2, len_adjust;
+
+ shuf_msk1 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 5, 4, /* dat len */
+ 0xFF, 0xFF, 5, 4, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+
+ );
+
+ shuf_msk2 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 13, 12, /* dat len */
+ 0xFF, 0xFF, 13, 12, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+ );
+
+ /* Substract the header length.
+ * In which case do we need the header length in used->len ?
+ */
+ len_adjust = _mm_set_epi16(
+ 0, 0,
+ 0,
+ (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, 0);
+
+ if (unlikely(nb_pkts < RTE_VIRTIO_DESC_PER_LOOP))
+ return 0;
+
+ nb_used = *(volatile uint16_t *)&rxvq->vq_ring.used->idx -
+ rxvq->vq_used_cons_idx;
+
+ rte_compiler_barrier();
+
+ if (unlikely(nb_used == 0))
+ return 0;
+
+ nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_VIRTIO_DESC_PER_LOOP);
+ nb_used = RTE_MIN(nb_used, nb_pkts);
+
+ desc_idx = (uint16_t)(rxvq->vq_used_cons_idx & (rxvq->vq_nentries - 1));
+ rused = &rxvq->vq_ring.used->ring[desc_idx];
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ sw_ring_end = &rxvq->sw_ring[rxvq->vq_nentries];
+
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
+
+ if (rxvq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
+ virtio_rxq_rearm_vec(rxvq);
+ if (unlikely(virtqueue_kick_prepare(rxvq)))
+ virtqueue_notify(rxvq);
+ }
+
+ for (nb_pkts_received = 0;
+ nb_pkts_received < nb_used;) {
+ __m128i desc[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i mbp[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i pkt_mb[RTE_VIRTIO_DESC_PER_LOOP];
+
+ mbp[0] = _mm_loadu_si128((__m128i *)(sw_ring + 0));
+ desc[0] = _mm_loadu_si128((__m128i *)(rused + 0));
+ _mm_storeu_si128((__m128i *)&rx_pkts[0], mbp[0]);
+
+ mbp[1] = _mm_loadu_si128((__m128i *)(sw_ring + 2));
+ desc[1] = _mm_loadu_si128((__m128i *)(rused + 2));
+ _mm_storeu_si128((__m128i *)&rx_pkts[2], mbp[1]);
+
+ mbp[2] = _mm_loadu_si128((__m128i *)(sw_ring + 4));
+ desc[2] = _mm_loadu_si128((__m128i *)(rused + 4));
+ _mm_storeu_si128((__m128i *)&rx_pkts[4], mbp[2]);
+
+ mbp[3] = _mm_loadu_si128((__m128i *)(sw_ring + 6));
+ desc[3] = _mm_loadu_si128((__m128i *)(rused + 6));
+ _mm_storeu_si128((__m128i *)&rx_pkts[6], mbp[3]);
+
+ pkt_mb[1] = _mm_shuffle_epi8(desc[0], shuf_msk2);
+ pkt_mb[0] = _mm_shuffle_epi8(desc[0], shuf_msk1);
+ pkt_mb[1] = _mm_add_epi16(pkt_mb[1], len_adjust);
+ pkt_mb[0] = _mm_add_epi16(pkt_mb[0], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[1]->rx_descriptor_fields1,
+ pkt_mb[1]);
+ _mm_storeu_si128((void *)&rx_pkts[0]->rx_descriptor_fields1,
+ pkt_mb[0]);
+
+ pkt_mb[3] = _mm_shuffle_epi8(desc[1], shuf_msk2);
+ pkt_mb[2] = _mm_shuffle_epi8(desc[1], shuf_msk1);
+ pkt_mb[3] = _mm_add_epi16(pkt_mb[3], len_adjust);
+ pkt_mb[2] = _mm_add_epi16(pkt_mb[2], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[3]->rx_descriptor_fields1,
+ pkt_mb[3]);
+ _mm_storeu_si128((void *)&rx_pkts[2]->rx_descriptor_fields1,
+ pkt_mb[2]);
+
+ pkt_mb[5] = _mm_shuffle_epi8(desc[2], shuf_msk2);
+ pkt_mb[4] = _mm_shuffle_epi8(desc[2], shuf_msk1);
+ pkt_mb[5] = _mm_add_epi16(pkt_mb[5], len_adjust);
+ pkt_mb[4] = _mm_add_epi16(pkt_mb[4], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[5]->rx_descriptor_fields1,
+ pkt_mb[5]);
+ _mm_storeu_si128((void *)&rx_pkts[4]->rx_descriptor_fields1,
+ pkt_mb[4]);
+
+ pkt_mb[7] = _mm_shuffle_epi8(desc[3], shuf_msk2);
+ pkt_mb[6] = _mm_shuffle_epi8(desc[3], shuf_msk1);
+ pkt_mb[7] = _mm_add_epi16(pkt_mb[7], len_adjust);
+ pkt_mb[6] = _mm_add_epi16(pkt_mb[6], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[7]->rx_descriptor_fields1,
+ pkt_mb[7]);
+ _mm_storeu_si128((void *)&rx_pkts[6]->rx_descriptor_fields1,
+ pkt_mb[6]);
+
+ if (unlikely(nb_used <= RTE_VIRTIO_DESC_PER_LOOP)) {
+ if (sw_ring + nb_used <= sw_ring_end)
+ nb_pkts_received += nb_used;
+ else
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ if (unlikely(sw_ring + RTE_VIRTIO_DESC_PER_LOOP >=
+ sw_ring_end)) {
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ nb_pkts_received += RTE_VIRTIO_DESC_PER_LOOP;
+
+ rx_pkts += RTE_VIRTIO_DESC_PER_LOOP;
+ sw_ring += RTE_VIRTIO_DESC_PER_LOOP;
+ rused += RTE_VIRTIO_DESC_PER_LOOP;
+ nb_used -= RTE_VIRTIO_DESC_PER_LOOP;
+ }
+ }
+ }
+
+ rxvq->vq_used_cons_idx += nb_pkts_received;
+ rxvq->vq_free_cnt += nb_pkts_received;
+ rxvq->packets += nb_pkts_received;
+ return nb_pkts_received;
+}
+
+int __attribute__((cold))
+virtio_rxq_vec_setup(struct virtqueue *rxq)
+{
+ uintptr_t p;
+ struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+ mb_def.nb_segs = 1;
+ mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+ mb_def.port = rxq->port_id;
+ rte_mbuf_refcnt_set(&mb_def, 1);
+
+ /* prevent compiler reordering: rearm_data covers previous fields */
+ rte_compiler_barrier();
+ p = (uintptr_t)&mb_def.rearm_data;
+ rxq->mbuf_initializer = *(uint64_t *)p;
+
+ return 0;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6a1ec48..98a77d5 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -188,6 +188,7 @@ struct virtqueue {
*/
uint16_t vq_used_cons_idx;
uint16_t vq_avail_idx;
+ uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

struct rte_mbuf **sw_ring; /**< RX software ring. */
--
1.8.1.4
Wang, Zhihong
2015-10-22 04:04:28 UTC
Permalink
-----Original Message-----
Sent: Tuesday, October 20, 2015 11:30 PM
Subject: [dpdk-dev] [PATCH v3 5/7] virtio: virtio vec rx
With fixed avail ring, we don't need to get desc idx from avail ring.
virtio driver only has to deal with desc ring.
This patch uses vector instruction to accelerate processing desc ring.
---
drivers/net/virtio/virtio_ethdev.h | 2 +
drivers/net/virtio/virtio_rxtx.c | 3 +
drivers/net/virtio/virtio_rxtx.h | 2 +
drivers/net/virtio/virtio_rxtx_simple.c | 224
++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 1 +
5 files changed, 232 insertions(+)
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 9026d42..d7797ab 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,8 @@ uint16_t virtio_recv_mergeable_pkts(void *rx_queue,
struct rte_mbuf **rx_pkts,
uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);
+uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts);
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5162ce6..947fc46 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -432,6 +432,9 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
vq->mpool = mp;
dev->data->rx_queues[queue_idx] = vq;
+
+ virtio_rxq_vec_setup(vq);
+
return 0;
}
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index 7d2d8fe..831e492 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -33,5 +33,7 @@
#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
+int virtio_rxq_vec_setup(struct virtqueue *rxq);
+
int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c
b/drivers/net/virtio/virtio_rxtx_simple.c
index cac5b9f..ef17562 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -58,6 +58,10 @@
#include "virtqueue.h"
#include "virtio_rxtx.h"
+#define RTE_VIRTIO_VPMD_RX_BURST 32
+#define RTE_VIRTIO_DESC_PER_LOOP 8
+#define RTE_VIRTIO_VPMD_RX_REARM_THRESH
RTE_VIRTIO_VPMD_RX_BURST
+
int __attribute__((cold))
virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *cookie)
@@ -82,3 +86,223 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
return 0;
}
+
+static inline void
+virtio_rxq_rearm_vec(struct virtqueue *rxvq)
+{
+ int i;
+ uint16_t desc_idx;
+ struct rte_mbuf **sw_ring;
+ struct vring_desc *start_dp;
+ int ret;
+
+ desc_idx = rxvq->vq_avail_idx & (rxvq->vq_nentries - 1);
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ start_dp = &rxvq->vq_ring.desc[desc_idx];
+
+ ret = rte_mempool_get_bulk(rxvq->mpool, (void **)sw_ring,
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH);
+ if (unlikely(ret)) {
+ rte_eth_devices[rxvq->port_id].data->rx_mbuf_alloc_failed +=
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ return;
+ }
+
+ for (i = 0; i < RTE_VIRTIO_VPMD_RX_REARM_THRESH; i++) {
+ uintptr_t p;
+
+ p = (uintptr_t)&sw_ring[i]->rearm_data;
+ *(uint64_t *)p = rxvq->mbuf_initializer;
+
+ start_dp[i].addr =
+ (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[i].len = sw_ring[i]->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+ }
+
+ rxvq->vq_avail_idx += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ rxvq->vq_free_cnt -= RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ vq_update_avail_idx(rxvq);
+}
+
+/* virtio vPMD receive routine, only accept(nb_pkts >=
RTE_VIRTIO_DESC_PER_LOOP)
+ *
+ * This routine is for non-mergable RX, one desc for each guest buffer.
+ * This routine is based on the RX ring layout optimization. Each entry in the
+ * avail ring points to the desc with the same index in the desc ring and this
+ * will never be changed in the driver.
+ *
+ * - nb_pkts < RTE_VIRTIO_DESC_PER_LOOP, just return no packet
+ */
+uint16_t
+virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *rxvq = rx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_used_elem *rused;
+ struct rte_mbuf **sw_ring;
+ struct rte_mbuf **sw_ring_end;
+ uint16_t nb_pkts_received;
+ __m128i shuf_msk1, shuf_msk2, len_adjust;
+
+ shuf_msk1 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 5, 4, /* dat len */
+ 0xFF, 0xFF, 5, 4, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+
+ );
+
+ shuf_msk2 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 13, 12, /* dat len */
+ 0xFF, 0xFF, 13, 12, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+ );
+
+ /* Substract the header length.
+ * In which case do we need the header length in used->len ?
+ */
+ len_adjust = _mm_set_epi16(
+ 0, 0,
+ 0,
+ (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, 0);
+
+ if (unlikely(nb_pkts < RTE_VIRTIO_DESC_PER_LOOP))
+ return 0;
+
+ nb_used = *(volatile uint16_t *)&rxvq->vq_ring.used->idx -
+ rxvq->vq_used_cons_idx;
+
+ rte_compiler_barrier();
+
+ if (unlikely(nb_used == 0))
+ return 0;
+
+ nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_VIRTIO_DESC_PER_LOOP);
+ nb_used = RTE_MIN(nb_used, nb_pkts);
+
+ desc_idx = (uint16_t)(rxvq->vq_used_cons_idx & (rxvq->vq_nentries - 1));
+ rused = &rxvq->vq_ring.used->ring[desc_idx];
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ sw_ring_end = &rxvq->sw_ring[rxvq->vq_nentries];
+
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
Wonder if the prefetch will actually help here.
Will prefetching rx_pkts[i] be more helpful?
+
+ if (rxvq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
+ virtio_rxq_rearm_vec(rxvq);
+ if (unlikely(virtqueue_kick_prepare(rxvq)))
+ virtqueue_notify(rxvq);
+ }
+
+ for (nb_pkts_received = 0;
+ nb_pkts_received < nb_used;) {
+ __m128i desc[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i mbp[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i pkt_mb[RTE_VIRTIO_DESC_PER_LOOP];
+
+ mbp[0] = _mm_loadu_si128((__m128i *)(sw_ring + 0));
+ desc[0] = _mm_loadu_si128((__m128i *)(rused + 0));
+ _mm_storeu_si128((__m128i *)&rx_pkts[0], mbp[0]);
+
+ mbp[1] = _mm_loadu_si128((__m128i *)(sw_ring + 2));
+ desc[1] = _mm_loadu_si128((__m128i *)(rused + 2));
+ _mm_storeu_si128((__m128i *)&rx_pkts[2], mbp[1]);
+
+ mbp[2] = _mm_loadu_si128((__m128i *)(sw_ring + 4));
+ desc[2] = _mm_loadu_si128((__m128i *)(rused + 4));
+ _mm_storeu_si128((__m128i *)&rx_pkts[4], mbp[2]);
+
+ mbp[3] = _mm_loadu_si128((__m128i *)(sw_ring + 6));
+ desc[3] = _mm_loadu_si128((__m128i *)(rused + 6));
+ _mm_storeu_si128((__m128i *)&rx_pkts[6], mbp[3]);
+
+ pkt_mb[1] = _mm_shuffle_epi8(desc[0], shuf_msk2);
+ pkt_mb[0] = _mm_shuffle_epi8(desc[0], shuf_msk1);
+ pkt_mb[1] = _mm_add_epi16(pkt_mb[1], len_adjust);
+ pkt_mb[0] = _mm_add_epi16(pkt_mb[0], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[1]->rx_descriptor_fields1,
+ pkt_mb[1]);
+ _mm_storeu_si128((void *)&rx_pkts[0]->rx_descriptor_fields1,
+ pkt_mb[0]);
+
+ pkt_mb[3] = _mm_shuffle_epi8(desc[1], shuf_msk2);
+ pkt_mb[2] = _mm_shuffle_epi8(desc[1], shuf_msk1);
+ pkt_mb[3] = _mm_add_epi16(pkt_mb[3], len_adjust);
+ pkt_mb[2] = _mm_add_epi16(pkt_mb[2], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[3]->rx_descriptor_fields1,
+ pkt_mb[3]);
+ _mm_storeu_si128((void *)&rx_pkts[2]->rx_descriptor_fields1,
+ pkt_mb[2]);
+
+ pkt_mb[5] = _mm_shuffle_epi8(desc[2], shuf_msk2);
+ pkt_mb[4] = _mm_shuffle_epi8(desc[2], shuf_msk1);
+ pkt_mb[5] = _mm_add_epi16(pkt_mb[5], len_adjust);
+ pkt_mb[4] = _mm_add_epi16(pkt_mb[4], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[5]->rx_descriptor_fields1,
+ pkt_mb[5]);
+ _mm_storeu_si128((void *)&rx_pkts[4]->rx_descriptor_fields1,
+ pkt_mb[4]);
+
+ pkt_mb[7] = _mm_shuffle_epi8(desc[3], shuf_msk2);
+ pkt_mb[6] = _mm_shuffle_epi8(desc[3], shuf_msk1);
+ pkt_mb[7] = _mm_add_epi16(pkt_mb[7], len_adjust);
+ pkt_mb[6] = _mm_add_epi16(pkt_mb[6], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[7]->rx_descriptor_fields1,
+ pkt_mb[7]);
+ _mm_storeu_si128((void *)&rx_pkts[6]->rx_descriptor_fields1,
+ pkt_mb[6]);
+
+ if (unlikely(nb_used <= RTE_VIRTIO_DESC_PER_LOOP)) {
+ if (sw_ring + nb_used <= sw_ring_end)
+ nb_pkts_received += nb_used;
+ else
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ if (unlikely(sw_ring + RTE_VIRTIO_DESC_PER_LOOP >=
+ sw_ring_end)) {
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ nb_pkts_received += RTE_VIRTIO_DESC_PER_LOOP;
+
+ rx_pkts += RTE_VIRTIO_DESC_PER_LOOP;
+ sw_ring += RTE_VIRTIO_DESC_PER_LOOP;
+ rused += RTE_VIRTIO_DESC_PER_LOOP;
+ nb_used -= RTE_VIRTIO_DESC_PER_LOOP;
+ }
+ }
+ }
+
+ rxvq->vq_used_cons_idx += nb_pkts_received;
+ rxvq->vq_free_cnt += nb_pkts_received;
+ rxvq->packets += nb_pkts_received;
+ return nb_pkts_received;
+}
+
+int __attribute__((cold))
+virtio_rxq_vec_setup(struct virtqueue *rxq)
+{
+ uintptr_t p;
+ struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+ mb_def.nb_segs = 1;
+ mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+ mb_def.port = rxq->port_id;
+ rte_mbuf_refcnt_set(&mb_def, 1);
+
+ /* prevent compiler reordering: rearm_data covers previous fields */
+ rte_compiler_barrier();
+ p = (uintptr_t)&mb_def.rearm_data;
+ rxq->mbuf_initializer = *(uint64_t *)p;
+
+ return 0;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6a1ec48..98a77d5 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -188,6 +188,7 @@ struct virtqueue {
*/
uint16_t vq_used_cons_idx;
uint16_t vq_avail_idx;
+ uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */
struct rte_mbuf **sw_ring; /**< RX software ring. */
--
1.8.1.4
Xie, Huawei
2015-10-22 05:48:47 UTC
Permalink
Post by Wang, Zhihong
Wonder if the prefetch will actually help here.
Will prefetching rx_pkts[i] be more helpful?
What is your concern prefetch the virtio ring?
rx_pkts is local array, Why do we need to prefetch it?
Huawei Xie
2015-10-20 15:30:07 UTC
Permalink
simple rx/tx func is enabled when user specifies single segment and no offload support.
merge-able should be disabled to use simple rxtx.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 947fc46..71f8cd4 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -62,6 +62,10 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0)
#endif

+
+#define VIRTIO_SIMPLE_FLAGS ((uint32_t)ETH_TXQ_FLAGS_NOMULTSEGS | \
+ ETH_TXQ_FLAGS_NOOFFLOADS)
+
static int use_simple_rxtx;

static void
@@ -471,6 +475,14 @@ virtio_dev_tx_queue_setup(struct rte_eth_dev *dev,
return -EINVAL;
}

+ /* Use simple rx/tx func if single segment and no offloads */
+ if ((tx_conf->txq_flags & VIRTIO_SIMPLE_FLAGS) == VIRTIO_SIMPLE_FLAGS) {
+ PMD_INIT_LOG(INFO, "Using simple rx/tx path");
+ dev->tx_pkt_burst = virtio_xmit_pkts_simple;
+ dev->rx_pkt_burst = virtio_recv_pkts_vec;
+ use_simple_rxtx = 1;
+ }
+
ret = virtio_dev_queue_setup(dev, VTNET_TQ, queue_idx, vtpci_queue_idx,
nb_desc, socket_id, &vq);
if (ret < 0) {
--
1.8.1.4
Tan, Jianfeng
2015-10-22 02:50:36 UTC
Permalink
-----Original Message-----
Sent: Tuesday, October 20, 2015 11:30 PM
Subject: [dpdk-dev] [PATCH v3 7/7] virtio: pick simple rx/tx func
simple rx/tx func is enabled when user specifies single segment and no offload support.
merge-able should be disabled to use simple rxtx.
---
drivers/net/virtio/virtio_rxtx.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 947fc46..71f8cd4 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -62,6 +62,10 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0) #endif
+
+#define VIRTIO_SIMPLE_FLAGS ((uint32_t)ETH_TXQ_FLAGS_NOMULTSEGS | \
+ ETH_TXQ_FLAGS_NOOFFLOADS)
+
static int use_simple_rxtx;
static void
@@ -471,6 +475,14 @@ virtio_dev_tx_queue_setup(struct rte_eth_dev *dev,
return -EINVAL;
}
+ /* Use simple rx/tx func if single segment and no offloads */
+ if ((tx_conf->txq_flags & VIRTIO_SIMPLE_FLAGS) ==
VIRTIO_SIMPLE_FLAGS) {
+ PMD_INIT_LOG(INFO, "Using simple rx/tx path");
+ dev->tx_pkt_burst = virtio_xmit_pkts_simple;
+ dev->rx_pkt_burst = virtio_recv_pkts_vec;
Whether recv side mergeable is supported is controlled by virtio_negotiate_feature().
So "dev->rx_pkt_burst = virtio_recv_pkts_vec" should be restricted by
hw->guest_features & VIRTIO_NET_F_MRG_RXBUF, right?
+ use_simple_rxtx = 1;
+ }
+
ret = virtio_dev_queue_setup(dev, VTNET_TQ, queue_idx,
vtpci_queue_idx,
nb_desc, socket_id, &vq);
if (ret < 0) {
--
1.8.1.4
Xie, Huawei
2015-10-22 11:40:29 UTC
Permalink
Post by Tan, Jianfeng
-----Original Message-----
Sent: Tuesday, October 20, 2015 11:30 PM
Subject: [dpdk-dev] [PATCH v3 7/7] virtio: pick simple rx/tx func
simple rx/tx func is enabled when user specifies single segment and no offload support.
merge-able should be disabled to use simple rxtx.
---
drivers/net/virtio/virtio_rxtx.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 947fc46..71f8cd4 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -62,6 +62,10 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0) #endif
+
+#define VIRTIO_SIMPLE_FLAGS ((uint32_t)ETH_TXQ_FLAGS_NOMULTSEGS | \
+ ETH_TXQ_FLAGS_NOOFFLOADS)
+
static int use_simple_rxtx;
static void
@@ -471,6 +475,14 @@ virtio_dev_tx_queue_setup(struct rte_eth_dev *dev,
return -EINVAL;
}
+ /* Use simple rx/tx func if single segment and no offloads */
+ if ((tx_conf->txq_flags & VIRTIO_SIMPLE_FLAGS) ==
VIRTIO_SIMPLE_FLAGS) {
+ PMD_INIT_LOG(INFO, "Using simple rx/tx path");
+ dev->tx_pkt_burst = virtio_xmit_pkts_simple;
+ dev->rx_pkt_burst = virtio_recv_pkts_vec;
Whether recv side mergeable is supported is controlled by virtio_negotiate_feature().
So "dev->rx_pkt_burst = virtio_recv_pkts_vec" should be restricted by
hw->guest_features & VIRTIO_NET_F_MRG_RXBUF, right?
Add this check in next version. However it will still be put here as we
want to leave us a chance to dynamically choose normal/simple rx function.
Post by Tan, Jianfeng
+ use_simple_rxtx = 1;
+ }
+
ret = virtio_dev_queue_setup(dev, VTNET_TQ, queue_idx,
vtpci_queue_idx,
nb_desc, socket_id, &vq);
if (ret < 0) {
--
1.8.1.4
Huawei Xie
2015-10-22 12:09:44 UTC
Permalink
Changes in v2:
- Remove the configure macro
- Enable simple R/TX processing when user specifies simple txq flags
- Reword some comments and commit messages

Changes in v3:
- Remove unnecessary NULL test for rte_free
- Remove unnecessary assign of local var after free
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
- Reword some commit messages
- Add TODO in the commit message of simple tx patch

Changes in v4:
- Fix the error in virtio tx ring layout ascii chart in the commit message
- move virtio_xmit_cleanup ahead to free descriptors earlier
- Test merge-able feature when select simple rx/tx functions

In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on other different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it need to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring, so avail ring will
always be the same during the run.
This removes L1M cache transfer from virtio core to vhost core for avail ring.
(Note we couldn't avoid the cache transfer for descriptors).
Besides, descriptor allocation and free operation is eliminated.
This also makes vector procesing possible to further accelerate the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++


Performance boost could be observed only if the virtio backend isn't the bottleneck or in VM2VM
case.
There are also several vhost optimization patches to be submitted later.

Huawei Xie (7):
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: choose simple rx/tx func

drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 12 +-
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 56 ++++-
drivers/net/virtio/virtio_rxtx.h | 39 ++++
drivers/net/virtio/virtio_rxtx_simple.c | 401 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 5 +
7 files changed, 516 insertions(+), 4 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx.h
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c
--
1.8.1.4
Huawei Xie
2015-10-22 12:09:45 UTC
Permalink
Would move all rx/tx related declarations into this header file in future.
Add RTE_VIRTIO_PMD_MAX_BURST.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 1 +
drivers/net/virtio/virtio_rxtx.c | 1 +
drivers/net/virtio/virtio_rxtx.h | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 36 insertions(+)
create mode 100644 drivers/net/virtio/virtio_rxtx.h

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 465d3cd..79a3640 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -61,6 +61,7 @@
#include "virtio_pci.h"
#include "virtio_logs.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"


static int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index c5b53bb..9324f7f 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -54,6 +54,7 @@
#include "virtio_logs.h"
#include "virtio_ethdev.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"

#ifdef RTE_LIBRTE_VIRTIO_DEBUG_DUMP
#define VIRTIO_DUMP_PACKET(m, len) rte_pktmbuf_dump(stdout, m, len)
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
new file mode 100644
index 0000000..a10aa69
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -0,0 +1,34 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
--
1.8.1.4
Huawei Xie
2015-10-22 12:09:47 UTC
Permalink
Changes in V4:
- fix the error in tx ring layout chart in this commit message.

In DPDK based switching envrioment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it need to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring, so avail ring will
always be the same during the run.
This removes L1M cache transfer from virtio core to vhost core for avail ring.
(Note we couldn't avoid the cache transfer for descriptors).
Besides, descriptor allocation and free operation is eliminated.
This also makes vector procesing possible to further accelerate the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 128 | 129 | ... | 255 || 128 | 129 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5c00e9d..7c82a6a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -302,6 +302,12 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
nbufs = 0;
error = ENOSPC;

+ if (use_simple_rxtx)
+ for (i = 0; i < vq->vq_nentries; i++) {
+ vq->vq_ring.avail->ring[i] = i;
+ vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
+ }
+
memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
@@ -332,6 +338,24 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else if (queue_type == VTNET_TQ) {
+ if (use_simple_rxtx) {
+ int mid_idx = vq->vq_nentries >> 1;
+ for (i = 0; i < mid_idx; i++) {
+ vq->vq_ring.avail->ring[i] = i + mid_idx;
+ vq->vq_ring.desc[i + mid_idx].next = i;
+ vq->vq_ring.desc[i + mid_idx].addr =
+ vq->virtio_net_hdr_mem +
+ mid_idx * vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].len =
+ vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].flags =
+ VRING_DESC_F_NEXT;
+ vq->vq_ring.desc[i].flags = 0;
+ }
+ for (i = mid_idx; i < vq->vq_nentries; i++)
+ vq->vq_ring.avail->ring[i] = i;
+ }
+
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
--
1.8.1.4
Huawei Xie
2015-10-22 12:09:46 UTC
Permalink
Changes in v3:
- Remove unnecessary NULL test for rte_free
- Remove unnecessary assign of local var vq after free

Add software RX ring in virtqueue.
Add fake_mbuf in virtqueue for wraparound processing.
Use global simple_rxtx to indicate whether simple rxtx is enabled

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 11 ++++++++++-
drivers/net/virtio/virtio_rxtx.c | 7 +++++++
drivers/net/virtio/virtqueue.h | 4 ++++
3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 79a3640..82676d3 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -247,8 +247,8 @@ virtio_dev_queue_release(struct virtqueue *vq) {
VIRTIO_WRITE_REG_2(hw, VIRTIO_PCI_QUEUE_SEL, vq->queue_id);
VIRTIO_WRITE_REG_4(hw, VIRTIO_PCI_QUEUE_PFN, 0);

+ rte_free(vq->sw_ring);
rte_free(vq);
- vq = NULL;
}
}

@@ -292,6 +292,9 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
dev->data->port_id, queue_idx);
vq = rte_zmalloc(vq_name, sizeof(struct virtqueue) +
vq_size * sizeof(struct vq_desc_extra), RTE_CACHE_LINE_SIZE);
+ vq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
+ (RTE_PMD_VIRTIO_RX_MAX_BURST + vq_size) *
+ sizeof(vq->sw_ring[0]), RTE_CACHE_LINE_SIZE, socket_id);
} else if (queue_type == VTNET_TQ) {
snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d",
dev->data->port_id, queue_idx);
@@ -308,6 +311,12 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
PMD_INIT_LOG(ERR, "%s: Can not allocate virtqueue", __func__);
return (-ENOMEM);
}
+ if (queue_type == VTNET_RQ && vq->sw_ring == NULL) {
+ PMD_INIT_LOG(ERR, "%s: Can not allocate RX soft ring",
+ __func__);
+ rte_free(vq);
+ return -ENOMEM;
+ }

vq->hw = hw;
vq->port_id = dev->data->port_id;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 9324f7f..5c00e9d 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -62,6 +62,8 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0)
#endif

+static int use_simple_rxtx;
+
static void
vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx)
{
@@ -299,6 +301,11 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/* Allocate blank mbufs for the each rx descriptor */
nbufs = 0;
error = ENOSPC;
+
+ memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
+ for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
+ vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
+
while (!virtqueue_full(vq)) {
m = rte_rxmbuf_alloc(vq->mpool);
if (m == NULL)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 7789411..6a1ec48 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -190,6 +190,10 @@ struct virtqueue {
uint16_t vq_avail_idx;
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

+ struct rte_mbuf **sw_ring; /**< RX software ring. */
+ /* dummy mbuf, for wraparound when processing RX ring. */
+ struct rte_mbuf fake_mbuf;
+
/* Statistics */
uint64_t packets;
uint64_t bytes;
--
1.8.1.4
Huawei Xie
2015-10-22 12:09:48 UTC
Permalink
fill avail ring with blank mbufs in virtio_dev_vring_start

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_rxtx.c | 6 ++-
drivers/net/virtio/virtio_rxtx.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 84 +++++++++++++++++++++++++++++++++
4 files changed, 92 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 930b60f..43835ba 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -50,7 +50,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtqueue.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_pci.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
-
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c

# this lib depends upon:
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 7c82a6a..5162ce6 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -320,8 +320,10 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/******************************************
* Enqueue allocated buffers *
*******************************************/
- error = virtqueue_enqueue_recv_refill(vq, m);
-
+ if (use_simple_rxtx)
+ error = virtqueue_enqueue_recv_refill_simple(vq, m);
+ else
+ error = virtqueue_enqueue_recv_refill(vq, m);
if (error) {
rte_pktmbuf_free(m);
break;
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index a10aa69..7d2d8fe 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -32,3 +32,6 @@
*/

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
+
+int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
new file mode 100644
index 0000000..cac5b9f
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -0,0 +1,84 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <tmmintrin.h>
+
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_mbuf.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_prefetch.h>
+#include <rte_string_fns.h>
+#include <rte_errno.h>
+#include <rte_byteorder.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "virtio_rxtx.h"
+
+int __attribute__((cold))
+virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *cookie)
+{
+ struct vq_desc_extra *dxp;
+ struct vring_desc *start_dp;
+ uint16_t desc_idx;
+
+ desc_idx = vq->vq_avail_idx & (vq->vq_nentries - 1);
+ dxp = &vq->vq_descx[desc_idx];
+ dxp->cookie = (void *)cookie;
+ vq->sw_ring[desc_idx] = cookie;
+
+ start_dp = vq->vq_ring.desc;
+ start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[desc_idx].len = cookie->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+
+ vq->vq_free_cnt--;
+ vq->vq_avail_idx++;
+
+ return 0;
+}
--
1.8.1.4
Tan, Jianfeng
2015-10-23 05:56:56 UTC
Permalink
-----Original Message-----
Sent: Thursday, October 22, 2015 8:10 PM
Subject: [dpdk-dev] [PATCH v4 4/7] virtio: fill RX avail ring with blank mbufs
+int __attribute__((cold))
+virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *cookie)
+{
+ struct vq_desc_extra *dxp;
+ struct vring_desc *start_dp;
+ uint16_t desc_idx;
+
+ desc_idx = vq->vq_avail_idx & (vq->vq_nentries - 1);
+ dxp = &vq->vq_descx[desc_idx];
+ dxp->cookie = (void *)cookie;
+ vq->sw_ring[desc_idx] = cookie;
+
+ start_dp = vq->vq_ring.desc;
+ start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie-
Post by Huawei Xie
buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
Please use RTE_MBUF_DATA_DMA_ADDR instead of "buf_physaddr + RTE_PKTMBUF_HEADROOM".
+ start_dp[desc_idx].len = cookie->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+
+ vq->vq_free_cnt--;
+ vq->vq_avail_idx++;
+
+ return 0;
+}
--
1.8.1.4
Xie, Huawei
2015-10-25 15:40:43 UTC
Permalink
Post by Tan, Jianfeng
-----Original Message-----
Sent: Thursday, October 22, 2015 8:10 PM
Subject: [dpdk-dev] [PATCH v4 4/7] virtio: fill RX avail ring with blank mbufs
+int __attribute__((cold))
+virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *cookie)
+{
+ struct vq_desc_extra *dxp;
+ struct vring_desc *start_dp;
+ uint16_t desc_idx;
+
+ desc_idx = vq->vq_avail_idx & (vq->vq_nentries - 1);
+ dxp = &vq->vq_descx[desc_idx];
+ dxp->cookie = (void *)cookie;
+ vq->sw_ring[desc_idx] = cookie;
+
+ start_dp = vq->vq_ring.desc;
+ start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie-
Post by Huawei Xie
buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
Please use RTE_MBUF_DATA_DMA_ADDR instead of "buf_physaddr + RTE_PKTMBUF_HEADROOM".
RTE_MBUF_DATA_DMA_ADDR is used for tx. For rx, we should use
RTE_MBUF_DATA_DMA_DEFAULT.
We could use a separate patch to fix all this in virito code.
I remember there is a patch to move
RTE_MBUF_DATA_DMA_ADDR/RTE_MBUF_DATA_DMA_DEFAULT definition into the
common header file.
Post by Tan, Jianfeng
+ start_dp[desc_idx].len = cookie->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+
+ vq->vq_free_cnt--;
+ vq->vq_avail_idx++;
+
+ return 0;
+}
--
1.8.1.4
Huawei Xie
2015-10-22 12:09:49 UTC
Permalink
With fixed avail ring, we don't need to get desc idx from avail ring.
virtio driver only has to deal with desc ring.
This patch uses vector instruction to accelerate processing desc ring.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 2 +
drivers/net/virtio/virtio_rxtx.c | 3 +
drivers/net/virtio/virtio_rxtx.h | 2 +
drivers/net/virtio/virtio_rxtx_simple.c | 224 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 1 +
5 files changed, 232 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 9026d42..d7797ab 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,8 @@ uint16_t virtio_recv_mergeable_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts);

/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5162ce6..947fc46 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -432,6 +432,9 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
vq->mpool = mp;

dev->data->rx_queues[queue_idx] = vq;
+
+ virtio_rxq_vec_setup(vq);
+
return 0;
}

diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index 7d2d8fe..831e492 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -33,5 +33,7 @@

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64

+int virtio_rxq_vec_setup(struct virtqueue *rxq);
+
int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index cac5b9f..ef17562 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -58,6 +58,10 @@
#include "virtqueue.h"
#include "virtio_rxtx.h"

+#define RTE_VIRTIO_VPMD_RX_BURST 32
+#define RTE_VIRTIO_DESC_PER_LOOP 8
+#define RTE_VIRTIO_VPMD_RX_REARM_THRESH RTE_VIRTIO_VPMD_RX_BURST
+
int __attribute__((cold))
virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *cookie)
@@ -82,3 +86,223 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,

return 0;
}
+
+static inline void
+virtio_rxq_rearm_vec(struct virtqueue *rxvq)
+{
+ int i;
+ uint16_t desc_idx;
+ struct rte_mbuf **sw_ring;
+ struct vring_desc *start_dp;
+ int ret;
+
+ desc_idx = rxvq->vq_avail_idx & (rxvq->vq_nentries - 1);
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ start_dp = &rxvq->vq_ring.desc[desc_idx];
+
+ ret = rte_mempool_get_bulk(rxvq->mpool, (void **)sw_ring,
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH);
+ if (unlikely(ret)) {
+ rte_eth_devices[rxvq->port_id].data->rx_mbuf_alloc_failed +=
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ return;
+ }
+
+ for (i = 0; i < RTE_VIRTIO_VPMD_RX_REARM_THRESH; i++) {
+ uintptr_t p;
+
+ p = (uintptr_t)&sw_ring[i]->rearm_data;
+ *(uint64_t *)p = rxvq->mbuf_initializer;
+
+ start_dp[i].addr =
+ (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[i].len = sw_ring[i]->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+ }
+
+ rxvq->vq_avail_idx += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ rxvq->vq_free_cnt -= RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ vq_update_avail_idx(rxvq);
+}
+
+/* virtio vPMD receive routine, only accept(nb_pkts >= RTE_VIRTIO_DESC_PER_LOOP)
+ *
+ * This routine is for non-mergable RX, one desc for each guest buffer.
+ * This routine is based on the RX ring layout optimization. Each entry in the
+ * avail ring points to the desc with the same index in the desc ring and this
+ * will never be changed in the driver.
+ *
+ * - nb_pkts < RTE_VIRTIO_DESC_PER_LOOP, just return no packet
+ */
+uint16_t
+virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *rxvq = rx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_used_elem *rused;
+ struct rte_mbuf **sw_ring;
+ struct rte_mbuf **sw_ring_end;
+ uint16_t nb_pkts_received;
+ __m128i shuf_msk1, shuf_msk2, len_adjust;
+
+ shuf_msk1 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 5, 4, /* dat len */
+ 0xFF, 0xFF, 5, 4, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+
+ );
+
+ shuf_msk2 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 13, 12, /* dat len */
+ 0xFF, 0xFF, 13, 12, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+ );
+
+ /* Substract the header length.
+ * In which case do we need the header length in used->len ?
+ */
+ len_adjust = _mm_set_epi16(
+ 0, 0,
+ 0,
+ (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, 0);
+
+ if (unlikely(nb_pkts < RTE_VIRTIO_DESC_PER_LOOP))
+ return 0;
+
+ nb_used = *(volatile uint16_t *)&rxvq->vq_ring.used->idx -
+ rxvq->vq_used_cons_idx;
+
+ rte_compiler_barrier();
+
+ if (unlikely(nb_used == 0))
+ return 0;
+
+ nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_VIRTIO_DESC_PER_LOOP);
+ nb_used = RTE_MIN(nb_used, nb_pkts);
+
+ desc_idx = (uint16_t)(rxvq->vq_used_cons_idx & (rxvq->vq_nentries - 1));
+ rused = &rxvq->vq_ring.used->ring[desc_idx];
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ sw_ring_end = &rxvq->sw_ring[rxvq->vq_nentries];
+
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
+
+ if (rxvq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
+ virtio_rxq_rearm_vec(rxvq);
+ if (unlikely(virtqueue_kick_prepare(rxvq)))
+ virtqueue_notify(rxvq);
+ }
+
+ for (nb_pkts_received = 0;
+ nb_pkts_received < nb_used;) {
+ __m128i desc[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i mbp[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i pkt_mb[RTE_VIRTIO_DESC_PER_LOOP];
+
+ mbp[0] = _mm_loadu_si128((__m128i *)(sw_ring + 0));
+ desc[0] = _mm_loadu_si128((__m128i *)(rused + 0));
+ _mm_storeu_si128((__m128i *)&rx_pkts[0], mbp[0]);
+
+ mbp[1] = _mm_loadu_si128((__m128i *)(sw_ring + 2));
+ desc[1] = _mm_loadu_si128((__m128i *)(rused + 2));
+ _mm_storeu_si128((__m128i *)&rx_pkts[2], mbp[1]);
+
+ mbp[2] = _mm_loadu_si128((__m128i *)(sw_ring + 4));
+ desc[2] = _mm_loadu_si128((__m128i *)(rused + 4));
+ _mm_storeu_si128((__m128i *)&rx_pkts[4], mbp[2]);
+
+ mbp[3] = _mm_loadu_si128((__m128i *)(sw_ring + 6));
+ desc[3] = _mm_loadu_si128((__m128i *)(rused + 6));
+ _mm_storeu_si128((__m128i *)&rx_pkts[6], mbp[3]);
+
+ pkt_mb[1] = _mm_shuffle_epi8(desc[0], shuf_msk2);
+ pkt_mb[0] = _mm_shuffle_epi8(desc[0], shuf_msk1);
+ pkt_mb[1] = _mm_add_epi16(pkt_mb[1], len_adjust);
+ pkt_mb[0] = _mm_add_epi16(pkt_mb[0], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[1]->rx_descriptor_fields1,
+ pkt_mb[1]);
+ _mm_storeu_si128((void *)&rx_pkts[0]->rx_descriptor_fields1,
+ pkt_mb[0]);
+
+ pkt_mb[3] = _mm_shuffle_epi8(desc[1], shuf_msk2);
+ pkt_mb[2] = _mm_shuffle_epi8(desc[1], shuf_msk1);
+ pkt_mb[3] = _mm_add_epi16(pkt_mb[3], len_adjust);
+ pkt_mb[2] = _mm_add_epi16(pkt_mb[2], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[3]->rx_descriptor_fields1,
+ pkt_mb[3]);
+ _mm_storeu_si128((void *)&rx_pkts[2]->rx_descriptor_fields1,
+ pkt_mb[2]);
+
+ pkt_mb[5] = _mm_shuffle_epi8(desc[2], shuf_msk2);
+ pkt_mb[4] = _mm_shuffle_epi8(desc[2], shuf_msk1);
+ pkt_mb[5] = _mm_add_epi16(pkt_mb[5], len_adjust);
+ pkt_mb[4] = _mm_add_epi16(pkt_mb[4], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[5]->rx_descriptor_fields1,
+ pkt_mb[5]);
+ _mm_storeu_si128((void *)&rx_pkts[4]->rx_descriptor_fields1,
+ pkt_mb[4]);
+
+ pkt_mb[7] = _mm_shuffle_epi8(desc[3], shuf_msk2);
+ pkt_mb[6] = _mm_shuffle_epi8(desc[3], shuf_msk1);
+ pkt_mb[7] = _mm_add_epi16(pkt_mb[7], len_adjust);
+ pkt_mb[6] = _mm_add_epi16(pkt_mb[6], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[7]->rx_descriptor_fields1,
+ pkt_mb[7]);
+ _mm_storeu_si128((void *)&rx_pkts[6]->rx_descriptor_fields1,
+ pkt_mb[6]);
+
+ if (unlikely(nb_used <= RTE_VIRTIO_DESC_PER_LOOP)) {
+ if (sw_ring + nb_used <= sw_ring_end)
+ nb_pkts_received += nb_used;
+ else
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ if (unlikely(sw_ring + RTE_VIRTIO_DESC_PER_LOOP >=
+ sw_ring_end)) {
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ nb_pkts_received += RTE_VIRTIO_DESC_PER_LOOP;
+
+ rx_pkts += RTE_VIRTIO_DESC_PER_LOOP;
+ sw_ring += RTE_VIRTIO_DESC_PER_LOOP;
+ rused += RTE_VIRTIO_DESC_PER_LOOP;
+ nb_used -= RTE_VIRTIO_DESC_PER_LOOP;
+ }
+ }
+ }
+
+ rxvq->vq_used_cons_idx += nb_pkts_received;
+ rxvq->vq_free_cnt += nb_pkts_received;
+ rxvq->packets += nb_pkts_received;
+ return nb_pkts_received;
+}
+
+int __attribute__((cold))
+virtio_rxq_vec_setup(struct virtqueue *rxq)
+{
+ uintptr_t p;
+ struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+ mb_def.nb_segs = 1;
+ mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+ mb_def.port = rxq->port_id;
+ rte_mbuf_refcnt_set(&mb_def, 1);
+
+ /* prevent compiler reordering: rearm_data covers previous fields */
+ rte_compiler_barrier();
+ p = (uintptr_t)&mb_def.rearm_data;
+ rxq->mbuf_initializer = *(uint64_t *)p;
+
+ return 0;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6a1ec48..98a77d5 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -188,6 +188,7 @@ struct virtqueue {
*/
uint16_t vq_used_cons_idx;
uint16_t vq_avail_idx;
+ uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

struct rte_mbuf **sw_ring; /**< RX software ring. */
--
1.8.1.4
Huawei Xie
2015-10-22 12:09:50 UTC
Permalink
Changes in v4:
- move virtio_xmit_cleanup ahead to free descriptors earlier

Changes in v3:
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
bulk free of mbufs when clean used ring.
shift operation of idx could be saved if vq_free_cnt means
free slots rather than free descriptors.

TODO: rearrange vq data structure, pack the stats var together so that we
could use one vec instruction to update all of them.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 93 +++++++++++++++++++++++++++++++++
2 files changed, 96 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index d7797ab..ae2d47d 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -111,6 +111,9 @@ uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts);
+
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
* frames larger than 1514 bytes. We do not yet support software LRO
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index ef17562..79b4f7f 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -288,6 +288,99 @@ virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
return nb_pkts_received;
}

+#define VIRTIO_TX_FREE_THRESH 32
+#define VIRTIO_TX_MAX_FREE_BUF_SZ 32
+#define VIRTIO_TX_FREE_NR 32
+/* TODO: vq->tx_free_cnt could mean num of free slots so we could avoid shift */
+static inline void
+virtio_xmit_cleanup(struct virtqueue *vq)
+{
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+}
+
+uint16_t
+virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *txvq = tx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_desc *start_dp;
+ uint16_t nb_tail, nb_commit;
+ int i;
+ uint16_t desc_idx_max = (txvq->vq_nentries >> 1) - 1;
+
+ nb_used = VIRTQUEUE_NUSED(txvq);
+ rte_compiler_barrier();
+
+ if (nb_used >= VIRTIO_TX_FREE_THRESH)
+ virtio_xmit_cleanup(tx_queue);
+
+ nb_commit = nb_pkts = RTE_MIN((txvq->vq_free_cnt >> 1), nb_pkts);
+ desc_idx = (uint16_t) (txvq->vq_avail_idx & desc_idx_max);
+ start_dp = txvq->vq_ring.desc;
+ nb_tail = (uint16_t) (desc_idx_max + 1 - desc_idx);
+
+ if (nb_commit >= nb_tail) {
+ for (i = 0; i < nb_tail; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_tail; i++) {
+ start_dp[desc_idx].addr =
+ RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+ nb_commit -= nb_tail;
+ desc_idx = 0;
+ }
+ for (i = 0; i < nb_commit; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_commit; i++) {
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+
+ rte_compiler_barrier();
+
+ txvq->vq_free_cnt -= (uint16_t)(nb_pkts << 1);
+ txvq->vq_avail_idx += nb_pkts;
+ txvq->vq_ring.avail->idx = txvq->vq_avail_idx;
+ txvq->packets += nb_pkts;
+
+ if (likely(nb_pkts)) {
+ if (unlikely(virtqueue_kick_prepare(txvq)))
+ virtqueue_notify(txvq);
+ }
+
+ return nb_pkts;
+}
+
int __attribute__((cold))
virtio_rxq_vec_setup(struct virtqueue *rxq)
{
--
1.8.1.4
Stephen Hemminger
2015-10-22 16:57:30 UTC
Permalink
On Thu, 22 Oct 2015 20:09:50 +0800
Post by Huawei Xie
- move virtio_xmit_cleanup ahead to free descriptors earlier
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
bulk free of mbufs when clean used ring.
shift operation of idx could be saved if vq_free_cnt means
free slots rather than free descriptors.
TODO: rearrange vq data structure, pack the stats var together so that we
could use one vec instruction to update all of them.
---
drivers/net/virtio/virtio_ethdev.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 93 +++++++++++++++++++++++++++++++++
2 files changed, 96 insertions(+)
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index d7797ab..ae2d47d 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -111,6 +111,9 @@ uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);
+uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts);
+
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
* frames larger than 1514 bytes. We do not yet support software LRO
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index ef17562..79b4f7f 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -288,6 +288,99 @@ virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
return nb_pkts_received;
}
+#define VIRTIO_TX_FREE_THRESH 32
+#define VIRTIO_TX_MAX_FREE_BUF_SZ 32
+#define VIRTIO_TX_FREE_NR 32
+/* TODO: vq->tx_free_cnt could mean num of free slots so we could avoid shift */
+static inline void
+virtio_xmit_cleanup(struct virtqueue *vq)
+{
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+}
I think you need to handle refcount, here is a similar patch
for ixgbe.

Subject: ixgbe: speed up transmit

Coalesce transmit buffers and put them back into the pool
in one burst.

Signed-off-by: Stephen Hemminger <***@networkplumber.org>

--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -120,12 +120,16 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
* Check for descriptors with their DD bit set and free mbufs.
* Return the total number of buffers freed.
*/
+#define TX_FREE_BULK 32
+
static inline int __attribute__((always_inline))
ixgbe_tx_free_bufs(struct ixgbe_tx_queue *txq)
{
struct ixgbe_tx_entry *txep;
uint32_t status;
- int i;
+ int i, n = 0;
+ struct rte_mempool *txpool = NULL;
+ struct rte_mbuf *free_list[TX_FREE_BULK];

/* check DD bit on threshold descriptor */
status = txq->tx_ring[txq->tx_next_dd].wb.status;
@@ -138,20 +142,26 @@ ixgbe_tx_free_bufs(struct ixgbe_tx_queue
*/
txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);

- /* free buffers one at a time */
- if ((txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT) != 0) {
- for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
- txep->mbuf->next = NULL;
- rte_mempool_put(txep->mbuf->pool, txep->mbuf);
- txep->mbuf = NULL;
- }
- } else {
- for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
- rte_pktmbuf_free_seg(txep->mbuf);
- txep->mbuf = NULL;
+ for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
+ struct rte_mbuf *m;
+
+ /* free buffers one at a time */
+ m = __rte_pktmbuf_prefree_seg(txep->mbuf);
+ txep->mbuf = NULL;
+
+ if (n >= TX_FREE_BULK ||
+ (n > 0 && m->pool != txpool)) {
+ rte_mempool_put_bulk(txpool, (void **)free_list, n);
+ n = 0;
}
+
+ txpool = m->pool;
+ free_list[n++] = m;
}

+ if (n > 0)
+ rte_mempool_put_bulk(txpool, (void **)free_list, n);
+
/* buffers were freed, update counters */
txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
Xie, Huawei
2015-10-23 02:17:08 UTC
Permalink
Post by Stephen Hemminger
On Thu, 22 Oct 2015 20:09:50 +0800
Post by Huawei Xie
- move virtio_xmit_cleanup ahead to free descriptors earlier
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
bulk free of mbufs when clean used ring.
shift operation of idx could be saved if vq_free_cnt means
free slots rather than free descriptors.
TODO: rearrange vq data structure, pack the stats var together so that we
could use one vec instruction to update all of them.
---
drivers/net/virtio/virtio_ethdev.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 93 +++++++++++++++++++++++++++++++++
2 files changed, 96 insertions(+)
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index d7797ab..ae2d47d 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -111,6 +111,9 @@ uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);
+uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts);
+
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
* frames larger than 1514 bytes. We do not yet support software LRO
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index ef17562..79b4f7f 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -288,6 +288,99 @@ virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
return nb_pkts_received;
}
+#define VIRTIO_TX_FREE_THRESH 32
+#define VIRTIO_TX_MAX_FREE_BUF_SZ 32
+#define VIRTIO_TX_FREE_NR 32
+/* TODO: vq->tx_free_cnt could mean num of free slots so we could avoid shift */
+static inline void
+virtio_xmit_cleanup(struct virtqueue *vq)
+{
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+}
I think you need to handle refcount, here is a similar patch
for ixgbe.
ok, like this:

m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
if (likely(m != NULL)) {
...
Post by Stephen Hemminger
Subject: ixgbe: speed up transmit
Coalesce transmit buffers and put them back into the pool
in one burst.
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -120,12 +120,16 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
* Check for descriptors with their DD bit set and free mbufs.
* Return the total number of buffers freed.
*/
+#define TX_FREE_BULK 32
+
static inline int __attribute__((always_inline))
ixgbe_tx_free_bufs(struct ixgbe_tx_queue *txq)
{
struct ixgbe_tx_entry *txep;
uint32_t status;
- int i;
+ int i, n = 0;
+ struct rte_mempool *txpool = NULL;
+ struct rte_mbuf *free_list[TX_FREE_BULK];
/* check DD bit on threshold descriptor */
status = txq->tx_ring[txq->tx_next_dd].wb.status;
@@ -138,20 +142,26 @@ ixgbe_tx_free_bufs(struct ixgbe_tx_queue
*/
txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);
- /* free buffers one at a time */
- if ((txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT) != 0) {
- for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
- txep->mbuf->next = NULL;
- rte_mempool_put(txep->mbuf->pool, txep->mbuf);
- txep->mbuf = NULL;
- }
- } else {
- for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
- rte_pktmbuf_free_seg(txep->mbuf);
- txep->mbuf = NULL;
+ for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
+ struct rte_mbuf *m;
+
+ /* free buffers one at a time */
+ m = __rte_pktmbuf_prefree_seg(txep->mbuf);
+ txep->mbuf = NULL;
+
+ if (n >= TX_FREE_BULK ||
check whether m is NULL here.
Post by Stephen Hemminger
+ (n > 0 && m->pool != txpool)) {
+ rte_mempool_put_bulk(txpool, (void **)free_list, n);
+ n = 0;
}
+
+ txpool = m->pool;
+ free_list[n++] = m;
}
+ if (n > 0)
+ rte_mempool_put_bulk(txpool, (void **)free_list, n);
+
/* buffers were freed, update counters */
txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
Xie, Huawei
2015-10-23 02:20:52 UTC
Permalink
Post by Huawei Xie
Post by Stephen Hemminger
On Thu, 22 Oct 2015 20:09:50 +0800
Post by Huawei Xie
- move virtio_xmit_cleanup ahead to free descriptors earlier
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
bulk free of mbufs when clean used ring.
shift operation of idx could be saved if vq_free_cnt means
free slots rather than free descriptors.
TODO: rearrange vq data structure, pack the stats var together so that we
could use one vec instruction to update all of them.
---
drivers/net/virtio/virtio_ethdev.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 93 +++++++++++++++++++++++++++++++++
2 files changed, 96 insertions(+)
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index d7797ab..ae2d47d 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -111,6 +111,9 @@ uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);
+uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts);
+
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
* frames larger than 1514 bytes. We do not yet support software LRO
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index ef17562..79b4f7f 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -288,6 +288,99 @@ virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
return nb_pkts_received;
}
+#define VIRTIO_TX_FREE_THRESH 32
+#define VIRTIO_TX_MAX_FREE_BUF_SZ 32
+#define VIRTIO_TX_FREE_NR 32
+/* TODO: vq->tx_free_cnt could mean num of free slots so we could avoid shift */
+static inline void
+virtio_xmit_cleanup(struct virtqueue *vq)
+{
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ free[0] = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ nb_free = 1;
+
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool, (void **)free,
+ nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+}
I think you need to handle refcount, here is a similar patch
for ixgbe.
m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
Missed a line
m = __rte_pktmbuf_prefree_seg(m)
Post by Huawei Xie
if (likely(m != NULL)) {
...
Post by Stephen Hemminger
Subject: ixgbe: speed up transmit
Coalesce transmit buffers and put them back into the pool
in one burst.
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -120,12 +120,16 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
* Check for descriptors with their DD bit set and free mbufs.
* Return the total number of buffers freed.
*/
+#define TX_FREE_BULK 32
+
static inline int __attribute__((always_inline))
ixgbe_tx_free_bufs(struct ixgbe_tx_queue *txq)
{
struct ixgbe_tx_entry *txep;
uint32_t status;
- int i;
+ int i, n = 0;
+ struct rte_mempool *txpool = NULL;
+ struct rte_mbuf *free_list[TX_FREE_BULK];
/* check DD bit on threshold descriptor */
status = txq->tx_ring[txq->tx_next_dd].wb.status;
@@ -138,20 +142,26 @@ ixgbe_tx_free_bufs(struct ixgbe_tx_queue
*/
txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);
- /* free buffers one at a time */
- if ((txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT) != 0) {
- for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
- txep->mbuf->next = NULL;
- rte_mempool_put(txep->mbuf->pool, txep->mbuf);
- txep->mbuf = NULL;
- }
- } else {
- for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
- rte_pktmbuf_free_seg(txep->mbuf);
- txep->mbuf = NULL;
+ for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
+ struct rte_mbuf *m;
+
+ /* free buffers one at a time */
+ m = __rte_pktmbuf_prefree_seg(txep->mbuf);
+ txep->mbuf = NULL;
+
+ if (n >= TX_FREE_BULK ||
check whether m is NULL here.
Post by Stephen Hemminger
+ (n > 0 && m->pool != txpool)) {
+ rte_mempool_put_bulk(txpool, (void **)free_list, n);
+ n = 0;
}
+
+ txpool = m->pool;
+ free_list[n++] = m;
}
+ if (n > 0)
+ rte_mempool_put_bulk(txpool, (void **)free_list, n);
+
/* buffers were freed, update counters */
txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
Huawei Xie
2015-10-22 12:09:51 UTC
Permalink
Changes in v4:
Check merge-able feature when select simple rx/tx functions.

simple rx/tx func is chose when merge-able rx is disabled and user specifies single segment and
no offload support.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 947fc46..0f1daf2 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -53,6 +53,7 @@

#include "virtio_logs.h"
#include "virtio_ethdev.h"
+#include "virtio_pci.h"
#include "virtqueue.h"
#include "virtio_rxtx.h"

@@ -62,6 +63,10 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0)
#endif

+
+#define VIRTIO_SIMPLE_FLAGS ((uint32_t)ETH_TXQ_FLAGS_NOMULTSEGS | \
+ ETH_TXQ_FLAGS_NOOFFLOADS)
+
static int use_simple_rxtx;

static void
@@ -459,6 +464,7 @@ virtio_dev_tx_queue_setup(struct rte_eth_dev *dev,
const struct rte_eth_txconf *tx_conf)
{
uint8_t vtpci_queue_idx = 2 * queue_idx + VTNET_SQ_TQ_QUEUE_IDX;
+ struct virtio_hw *hw = dev->data->dev_private;
struct virtqueue *vq;
uint16_t tx_free_thresh;
int ret;
@@ -471,6 +477,15 @@ virtio_dev_tx_queue_setup(struct rte_eth_dev *dev,
return -EINVAL;
}

+ /* Use simple rx/tx func if single segment and no offloads */
+ if ((tx_conf->txq_flags & VIRTIO_SIMPLE_FLAGS) == VIRTIO_SIMPLE_FLAGS &&
+ !vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+ PMD_INIT_LOG(INFO, "Using simple rx/tx path");
+ dev->tx_pkt_burst = virtio_xmit_pkts_simple;
+ dev->rx_pkt_burst = virtio_recv_pkts_vec;
+ use_simple_rxtx = 1;
+ }
+
ret = virtio_dev_queue_setup(dev, VTNET_TQ, queue_idx, vtpci_queue_idx,
nb_desc, socket_id, &vq);
if (ret < 0) {
--
1.8.1.4
Stephen Hemminger
2015-10-22 16:58:56 UTC
Permalink
On Thu, 22 Oct 2015 20:09:51 +0800
Post by Huawei Xie
+ /* Use simple rx/tx func if single segment and no offloads */
+ if ((tx_conf->txq_flags & VIRTIO_SIMPLE_FLAGS) == VIRTIO_SIMPLE_FLAGS &&
+ !vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF))
Since with QEMU/KVM the code will negotiate to use MRG_RXBUF, this
code path will not get used in common case anyway.
Xie, Huawei
2015-10-23 01:38:11 UTC
Permalink
Post by Stephen Hemminger
On Thu, 22 Oct 2015 20:09:51 +0800
Post by Huawei Xie
+ /* Use simple rx/tx func if single segment and no offloads */
+ if ((tx_conf->txq_flags & VIRTIO_SIMPLE_FLAGS) == VIRTIO_SIMPLE_FLAGS &&
+ !vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF))
Since with QEMU/KVM the code will negotiate to use MRG_RXBUF, this
code path will not get used in common case anyway.
Yes, the common configuration is merge-able enabled.
We need to add mrg_rxbuf=off(in qemu command line), or disable
merge-able feature(in switch application) through vhost API for
non-mergable.
Huawei Xie
2015-10-25 15:34:57 UTC
Permalink
Changes in v5:
- Call __rte_pktmbuf_prefree_seg to check refcnt when free mbufs

Changes in v4:
- Fix the error in virtio tx ring layout ascii chart in the commit message
- Move virtio_xmit_cleanup ahead to free descriptors earlier
- Test merge-able feature when select simple rx/tx functions

Changes in v3:
- Remove unnecessary NULL test for rte_free
- Remove unnecessary assign of local var after free
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
- Reword some commit messages
- Add TODO in the commit message of simple tx patch

Changes in v2:
- Remove the configure macro
- Enable simple R/TX processing when user specifies simple txq flags
- Reword some comments and commit messages

In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on other different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it need to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring, so avail ring will
always be the same during the run.
This removes L1M cache transfer from virtio core to vhost core for avail ring.
(Note we couldn't avoid the cache transfer for descriptors).
Besides, descriptor allocation and free operation is eliminated.
This also makes vector procesing possible to further accelerate the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++


Performance boost could be observed only if the virtio backend isn't the bottleneck or in VM2VM
case.
There are also several vhost optimization patches to be submitted later.


Huawei Xie (7):
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: pick simple rx/tx func

drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 12 +-
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 56 ++++-
drivers/net/virtio/virtio_rxtx.h | 39 +++
drivers/net/virtio/virtio_rxtx_simple.c | 414 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 5 +
7 files changed, 529 insertions(+), 4 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx.h
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c
--
1.8.1.4
Huawei Xie
2015-10-25 15:34:58 UTC
Permalink
Would move all rx/tx related declarations into this header file in future.
Add RTE_VIRTIO_PMD_MAX_BURST.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 1 +
drivers/net/virtio/virtio_rxtx.c | 1 +
drivers/net/virtio/virtio_rxtx.h | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 36 insertions(+)
create mode 100644 drivers/net/virtio/virtio_rxtx.h

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 465d3cd..79a3640 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -61,6 +61,7 @@
#include "virtio_pci.h"
#include "virtio_logs.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"


static int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index c5b53bb..9324f7f 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -54,6 +54,7 @@
#include "virtio_logs.h"
#include "virtio_ethdev.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"

#ifdef RTE_LIBRTE_VIRTIO_DEBUG_DUMP
#define VIRTIO_DUMP_PACKET(m, len) rte_pktmbuf_dump(stdout, m, len)
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
new file mode 100644
index 0000000..a10aa69
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -0,0 +1,34 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
--
1.8.1.4
Huawei Xie
2015-10-25 15:34:59 UTC
Permalink
Changes in v3:
- Remove unnecessary NULL test for rte_free
- Remove unnecessary assign of local var vq after free

Add software RX ring in virtqueue.
Add fake_mbuf in virtqueue for wraparound processing.
Use global simple_rxtx to indicate whether simple rxtx is enabled

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 11 ++++++++++-
drivers/net/virtio/virtio_rxtx.c | 7 +++++++
drivers/net/virtio/virtqueue.h | 4 ++++
3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 79a3640..82676d3 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -247,8 +247,8 @@ virtio_dev_queue_release(struct virtqueue *vq) {
VIRTIO_WRITE_REG_2(hw, VIRTIO_PCI_QUEUE_SEL, vq->queue_id);
VIRTIO_WRITE_REG_4(hw, VIRTIO_PCI_QUEUE_PFN, 0);

+ rte_free(vq->sw_ring);
rte_free(vq);
- vq = NULL;
}
}

@@ -292,6 +292,9 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
dev->data->port_id, queue_idx);
vq = rte_zmalloc(vq_name, sizeof(struct virtqueue) +
vq_size * sizeof(struct vq_desc_extra), RTE_CACHE_LINE_SIZE);
+ vq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
+ (RTE_PMD_VIRTIO_RX_MAX_BURST + vq_size) *
+ sizeof(vq->sw_ring[0]), RTE_CACHE_LINE_SIZE, socket_id);
} else if (queue_type == VTNET_TQ) {
snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d",
dev->data->port_id, queue_idx);
@@ -308,6 +311,12 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
PMD_INIT_LOG(ERR, "%s: Can not allocate virtqueue", __func__);
return (-ENOMEM);
}
+ if (queue_type == VTNET_RQ && vq->sw_ring == NULL) {
+ PMD_INIT_LOG(ERR, "%s: Can not allocate RX soft ring",
+ __func__);
+ rte_free(vq);
+ return -ENOMEM;
+ }

vq->hw = hw;
vq->port_id = dev->data->port_id;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 9324f7f..5c00e9d 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -62,6 +62,8 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0)
#endif

+static int use_simple_rxtx;
+
static void
vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx)
{
@@ -299,6 +301,11 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/* Allocate blank mbufs for the each rx descriptor */
nbufs = 0;
error = ENOSPC;
+
+ memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
+ for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
+ vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
+
while (!virtqueue_full(vq)) {
m = rte_rxmbuf_alloc(vq->mpool);
if (m == NULL)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 7789411..6a1ec48 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -190,6 +190,10 @@ struct virtqueue {
uint16_t vq_avail_idx;
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

+ struct rte_mbuf **sw_ring; /**< RX software ring. */
+ /* dummy mbuf, for wraparound when processing RX ring. */
+ struct rte_mbuf fake_mbuf;
+
/* Statistics */
uint64_t packets;
uint64_t bytes;
--
1.8.1.4
Huawei Xie
2015-10-25 15:35:00 UTC
Permalink
Changes in V4:
- fix the error in tx ring layout chart in this commit message.

In DPDK based switching envrioment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it need to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring, so avail ring will
always be the same during the run.
This removes L1M cache transfer from virtio core to vhost core for avail ring.
(Note we couldn't avoid the cache transfer for descriptors).
Besides, descriptor allocation and free operation is eliminated.
This also makes vector procesing possible to further accelerate the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 128 | 129 | ... | 255 || 128 | 129 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5c00e9d..7c82a6a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -302,6 +302,12 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
nbufs = 0;
error = ENOSPC;

+ if (use_simple_rxtx)
+ for (i = 0; i < vq->vq_nentries; i++) {
+ vq->vq_ring.avail->ring[i] = i;
+ vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
+ }
+
memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
@@ -332,6 +338,24 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else if (queue_type == VTNET_TQ) {
+ if (use_simple_rxtx) {
+ int mid_idx = vq->vq_nentries >> 1;
+ for (i = 0; i < mid_idx; i++) {
+ vq->vq_ring.avail->ring[i] = i + mid_idx;
+ vq->vq_ring.desc[i + mid_idx].next = i;
+ vq->vq_ring.desc[i + mid_idx].addr =
+ vq->virtio_net_hdr_mem +
+ mid_idx * vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].len =
+ vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].flags =
+ VRING_DESC_F_NEXT;
+ vq->vq_ring.desc[i].flags = 0;
+ }
+ for (i = mid_idx; i < vq->vq_nentries; i++)
+ vq->vq_ring.avail->ring[i] = i;
+ }
+
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
--
1.8.1.4
Huawei Xie
2015-10-25 15:35:01 UTC
Permalink
fill avail ring with blank mbufs in virtio_dev_vring_start

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_rxtx.c | 6 ++-
drivers/net/virtio/virtio_rxtx.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 84 +++++++++++++++++++++++++++++++++
4 files changed, 92 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 930b60f..43835ba 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -50,7 +50,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtqueue.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_pci.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
-
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c

# this lib depends upon:
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 7c82a6a..5162ce6 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -320,8 +320,10 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/******************************************
* Enqueue allocated buffers *
*******************************************/
- error = virtqueue_enqueue_recv_refill(vq, m);
-
+ if (use_simple_rxtx)
+ error = virtqueue_enqueue_recv_refill_simple(vq, m);
+ else
+ error = virtqueue_enqueue_recv_refill(vq, m);
if (error) {
rte_pktmbuf_free(m);
break;
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index a10aa69..7d2d8fe 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -32,3 +32,6 @@
*/

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
+
+int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
new file mode 100644
index 0000000..cac5b9f
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -0,0 +1,84 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <tmmintrin.h>
+
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_mbuf.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_prefetch.h>
+#include <rte_string_fns.h>
+#include <rte_errno.h>
+#include <rte_byteorder.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "virtio_rxtx.h"
+
+int __attribute__((cold))
+virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *cookie)
+{
+ struct vq_desc_extra *dxp;
+ struct vring_desc *start_dp;
+ uint16_t desc_idx;
+
+ desc_idx = vq->vq_avail_idx & (vq->vq_nentries - 1);
+ dxp = &vq->vq_descx[desc_idx];
+ dxp->cookie = (void *)cookie;
+ vq->sw_ring[desc_idx] = cookie;
+
+ start_dp = vq->vq_ring.desc;
+ start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[desc_idx].len = cookie->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+
+ vq->vq_free_cnt--;
+ vq->vq_avail_idx++;
+
+ return 0;
+}
--
1.8.1.4
Huawei Xie
2015-10-25 15:35:02 UTC
Permalink
With fixed avail ring, we don't need to get desc idx from avail ring.
virtio driver only has to deal with desc ring.
This patch uses vector instruction to accelerate processing desc ring.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 2 +
drivers/net/virtio/virtio_rxtx.c | 3 +
drivers/net/virtio/virtio_rxtx.h | 2 +
drivers/net/virtio/virtio_rxtx_simple.c | 224 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 1 +
5 files changed, 232 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 9026d42..d7797ab 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,8 @@ uint16_t virtio_recv_mergeable_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts);

/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5162ce6..947fc46 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -432,6 +432,9 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
vq->mpool = mp;

dev->data->rx_queues[queue_idx] = vq;
+
+ virtio_rxq_vec_setup(vq);
+
return 0;
}

diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index 7d2d8fe..831e492 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -33,5 +33,7 @@

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64

+int virtio_rxq_vec_setup(struct virtqueue *rxq);
+
int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index cac5b9f..ef17562 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -58,6 +58,10 @@
#include "virtqueue.h"
#include "virtio_rxtx.h"

+#define RTE_VIRTIO_VPMD_RX_BURST 32
+#define RTE_VIRTIO_DESC_PER_LOOP 8
+#define RTE_VIRTIO_VPMD_RX_REARM_THRESH RTE_VIRTIO_VPMD_RX_BURST
+
int __attribute__((cold))
virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *cookie)
@@ -82,3 +86,223 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,

return 0;
}
+
+static inline void
+virtio_rxq_rearm_vec(struct virtqueue *rxvq)
+{
+ int i;
+ uint16_t desc_idx;
+ struct rte_mbuf **sw_ring;
+ struct vring_desc *start_dp;
+ int ret;
+
+ desc_idx = rxvq->vq_avail_idx & (rxvq->vq_nentries - 1);
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ start_dp = &rxvq->vq_ring.desc[desc_idx];
+
+ ret = rte_mempool_get_bulk(rxvq->mpool, (void **)sw_ring,
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH);
+ if (unlikely(ret)) {
+ rte_eth_devices[rxvq->port_id].data->rx_mbuf_alloc_failed +=
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ return;
+ }
+
+ for (i = 0; i < RTE_VIRTIO_VPMD_RX_REARM_THRESH; i++) {
+ uintptr_t p;
+
+ p = (uintptr_t)&sw_ring[i]->rearm_data;
+ *(uint64_t *)p = rxvq->mbuf_initializer;
+
+ start_dp[i].addr =
+ (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[i].len = sw_ring[i]->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+ }
+
+ rxvq->vq_avail_idx += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ rxvq->vq_free_cnt -= RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ vq_update_avail_idx(rxvq);
+}
+
+/* virtio vPMD receive routine, only accept(nb_pkts >= RTE_VIRTIO_DESC_PER_LOOP)
+ *
+ * This routine is for non-mergable RX, one desc for each guest buffer.
+ * This routine is based on the RX ring layout optimization. Each entry in the
+ * avail ring points to the desc with the same index in the desc ring and this
+ * will never be changed in the driver.
+ *
+ * - nb_pkts < RTE_VIRTIO_DESC_PER_LOOP, just return no packet
+ */
+uint16_t
+virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *rxvq = rx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_used_elem *rused;
+ struct rte_mbuf **sw_ring;
+ struct rte_mbuf **sw_ring_end;
+ uint16_t nb_pkts_received;
+ __m128i shuf_msk1, shuf_msk2, len_adjust;
+
+ shuf_msk1 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 5, 4, /* dat len */
+ 0xFF, 0xFF, 5, 4, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+
+ );
+
+ shuf_msk2 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 13, 12, /* dat len */
+ 0xFF, 0xFF, 13, 12, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+ );
+
+ /* Substract the header length.
+ * In which case do we need the header length in used->len ?
+ */
+ len_adjust = _mm_set_epi16(
+ 0, 0,
+ 0,
+ (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, 0);
+
+ if (unlikely(nb_pkts < RTE_VIRTIO_DESC_PER_LOOP))
+ return 0;
+
+ nb_used = *(volatile uint16_t *)&rxvq->vq_ring.used->idx -
+ rxvq->vq_used_cons_idx;
+
+ rte_compiler_barrier();
+
+ if (unlikely(nb_used == 0))
+ return 0;
+
+ nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_VIRTIO_DESC_PER_LOOP);
+ nb_used = RTE_MIN(nb_used, nb_pkts);
+
+ desc_idx = (uint16_t)(rxvq->vq_used_cons_idx & (rxvq->vq_nentries - 1));
+ rused = &rxvq->vq_ring.used->ring[desc_idx];
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ sw_ring_end = &rxvq->sw_ring[rxvq->vq_nentries];
+
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
+
+ if (rxvq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
+ virtio_rxq_rearm_vec(rxvq);
+ if (unlikely(virtqueue_kick_prepare(rxvq)))
+ virtqueue_notify(rxvq);
+ }
+
+ for (nb_pkts_received = 0;
+ nb_pkts_received < nb_used;) {
+ __m128i desc[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i mbp[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i pkt_mb[RTE_VIRTIO_DESC_PER_LOOP];
+
+ mbp[0] = _mm_loadu_si128((__m128i *)(sw_ring + 0));
+ desc[0] = _mm_loadu_si128((__m128i *)(rused + 0));
+ _mm_storeu_si128((__m128i *)&rx_pkts[0], mbp[0]);
+
+ mbp[1] = _mm_loadu_si128((__m128i *)(sw_ring + 2));
+ desc[1] = _mm_loadu_si128((__m128i *)(rused + 2));
+ _mm_storeu_si128((__m128i *)&rx_pkts[2], mbp[1]);
+
+ mbp[2] = _mm_loadu_si128((__m128i *)(sw_ring + 4));
+ desc[2] = _mm_loadu_si128((__m128i *)(rused + 4));
+ _mm_storeu_si128((__m128i *)&rx_pkts[4], mbp[2]);
+
+ mbp[3] = _mm_loadu_si128((__m128i *)(sw_ring + 6));
+ desc[3] = _mm_loadu_si128((__m128i *)(rused + 6));
+ _mm_storeu_si128((__m128i *)&rx_pkts[6], mbp[3]);
+
+ pkt_mb[1] = _mm_shuffle_epi8(desc[0], shuf_msk2);
+ pkt_mb[0] = _mm_shuffle_epi8(desc[0], shuf_msk1);
+ pkt_mb[1] = _mm_add_epi16(pkt_mb[1], len_adjust);
+ pkt_mb[0] = _mm_add_epi16(pkt_mb[0], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[1]->rx_descriptor_fields1,
+ pkt_mb[1]);
+ _mm_storeu_si128((void *)&rx_pkts[0]->rx_descriptor_fields1,
+ pkt_mb[0]);
+
+ pkt_mb[3] = _mm_shuffle_epi8(desc[1], shuf_msk2);
+ pkt_mb[2] = _mm_shuffle_epi8(desc[1], shuf_msk1);
+ pkt_mb[3] = _mm_add_epi16(pkt_mb[3], len_adjust);
+ pkt_mb[2] = _mm_add_epi16(pkt_mb[2], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[3]->rx_descriptor_fields1,
+ pkt_mb[3]);
+ _mm_storeu_si128((void *)&rx_pkts[2]->rx_descriptor_fields1,
+ pkt_mb[2]);
+
+ pkt_mb[5] = _mm_shuffle_epi8(desc[2], shuf_msk2);
+ pkt_mb[4] = _mm_shuffle_epi8(desc[2], shuf_msk1);
+ pkt_mb[5] = _mm_add_epi16(pkt_mb[5], len_adjust);
+ pkt_mb[4] = _mm_add_epi16(pkt_mb[4], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[5]->rx_descriptor_fields1,
+ pkt_mb[5]);
+ _mm_storeu_si128((void *)&rx_pkts[4]->rx_descriptor_fields1,
+ pkt_mb[4]);
+
+ pkt_mb[7] = _mm_shuffle_epi8(desc[3], shuf_msk2);
+ pkt_mb[6] = _mm_shuffle_epi8(desc[3], shuf_msk1);
+ pkt_mb[7] = _mm_add_epi16(pkt_mb[7], len_adjust);
+ pkt_mb[6] = _mm_add_epi16(pkt_mb[6], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[7]->rx_descriptor_fields1,
+ pkt_mb[7]);
+ _mm_storeu_si128((void *)&rx_pkts[6]->rx_descriptor_fields1,
+ pkt_mb[6]);
+
+ if (unlikely(nb_used <= RTE_VIRTIO_DESC_PER_LOOP)) {
+ if (sw_ring + nb_used <= sw_ring_end)
+ nb_pkts_received += nb_used;
+ else
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ if (unlikely(sw_ring + RTE_VIRTIO_DESC_PER_LOOP >=
+ sw_ring_end)) {
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ nb_pkts_received += RTE_VIRTIO_DESC_PER_LOOP;
+
+ rx_pkts += RTE_VIRTIO_DESC_PER_LOOP;
+ sw_ring += RTE_VIRTIO_DESC_PER_LOOP;
+ rused += RTE_VIRTIO_DESC_PER_LOOP;
+ nb_used -= RTE_VIRTIO_DESC_PER_LOOP;
+ }
+ }
+ }
+
+ rxvq->vq_used_cons_idx += nb_pkts_received;
+ rxvq->vq_free_cnt += nb_pkts_received;
+ rxvq->packets += nb_pkts_received;
+ return nb_pkts_received;
+}
+
+int __attribute__((cold))
+virtio_rxq_vec_setup(struct virtqueue *rxq)
+{
+ uintptr_t p;
+ struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+ mb_def.nb_segs = 1;
+ mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+ mb_def.port = rxq->port_id;
+ rte_mbuf_refcnt_set(&mb_def, 1);
+
+ /* prevent compiler reordering: rearm_data covers previous fields */
+ rte_compiler_barrier();
+ p = (uintptr_t)&mb_def.rearm_data;
+ rxq->mbuf_initializer = *(uint64_t *)p;
+
+ return 0;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6a1ec48..98a77d5 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -188,6 +188,7 @@ struct virtqueue {
*/
uint16_t vq_used_cons_idx;
uint16_t vq_avail_idx;
+ uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

struct rte_mbuf **sw_ring; /**< RX software ring. */
--
1.8.1.4
Wang, Zhihong
2015-10-26 08:34:50 UTC
Permalink
-----Original Message-----
Sent: Sunday, October 25, 2015 11:35 PM
Subject: [dpdk-dev] [PATCH v5 5/7] virtio: virtio vec rx
With fixed avail ring, we don't need to get desc idx from avail ring.
virtio driver only has to deal with desc ring.
This patch uses vector instruction to accelerate processing desc ring.
Acked-by: Wang, Zhihong <***@intel.com>
Huawei Xie
2015-10-25 15:35:03 UTC
Permalink
Changes in v5:
- call __rte_pktmbuf_prefree_seg to check refcnt when free mbufs

Changes in v4:
- move virtio_xmit_cleanup ahead to free descriptors earlier

Changes in v3:
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup

bulk free of mbufs when clean used ring.
shift operation of idx could be saved if vq_free_cnt means
free slots rather than free descriptors.

TODO: rearrange vq data structure, pack the stats var together so that we
could use one vec instruction to update all of them.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 3 +
drivers/net/virtio/virtio_rxtx_simple.c | 106 ++++++++++++++++++++++++++++++++
2 files changed, 109 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index d7797ab..ae2d47d 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -111,6 +111,9 @@ uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts);
+
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
* frames larger than 1514 bytes. We do not yet support software LRO
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index ef17562..624e789 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -288,6 +288,112 @@ virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
return nb_pkts_received;
}

+#define VIRTIO_TX_FREE_THRESH 32
+#define VIRTIO_TX_MAX_FREE_BUF_SZ 32
+#define VIRTIO_TX_FREE_NR 32
+/* TODO: vq->tx_free_cnt could mean num of free slots so we could avoid shift */
+static inline void
+virtio_xmit_cleanup(struct virtqueue *vq)
+{
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ m = __rte_pktmbuf_prefree_seg(m);
+ if (likely(m != NULL)) {
+ free[0] = m;
+ nb_free = 1;
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ m = __rte_pktmbuf_prefree_seg(m);
+ if (likely(m != NULL)) {
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool,
+ (void **)free, nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+ }
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ } else {
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ m = __rte_pktmbuf_prefree_seg(m);
+ if (m != NULL)
+ rte_mempool_put(m->pool, m);
+ }
+ }
+
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+}
+
+uint16_t
+virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *txvq = tx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_desc *start_dp;
+ uint16_t nb_tail, nb_commit;
+ int i;
+ uint16_t desc_idx_max = (txvq->vq_nentries >> 1) - 1;
+
+ nb_used = VIRTQUEUE_NUSED(txvq);
+ rte_compiler_barrier();
+
+ if (nb_used >= VIRTIO_TX_FREE_THRESH)
+ virtio_xmit_cleanup(tx_queue);
+
+ nb_commit = nb_pkts = RTE_MIN((txvq->vq_free_cnt >> 1), nb_pkts);
+ desc_idx = (uint16_t) (txvq->vq_avail_idx & desc_idx_max);
+ start_dp = txvq->vq_ring.desc;
+ nb_tail = (uint16_t) (desc_idx_max + 1 - desc_idx);
+
+ if (nb_commit >= nb_tail) {
+ for (i = 0; i < nb_tail; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_tail; i++) {
+ start_dp[desc_idx].addr =
+ RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+ nb_commit -= nb_tail;
+ desc_idx = 0;
+ }
+ for (i = 0; i < nb_commit; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_commit; i++) {
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+
+ rte_compiler_barrier();
+
+ txvq->vq_free_cnt -= (uint16_t)(nb_pkts << 1);
+ txvq->vq_avail_idx += nb_pkts;
+ txvq->vq_ring.avail->idx = txvq->vq_avail_idx;
+ txvq->packets += nb_pkts;
+
+ if (likely(nb_pkts)) {
+ if (unlikely(virtqueue_kick_prepare(txvq)))
+ virtqueue_notify(txvq);
+ }
+
+ return nb_pkts;
+}
+
int __attribute__((cold))
virtio_rxq_vec_setup(struct virtqueue *rxq)
{
--
1.8.1.4
Huawei Xie
2015-10-25 15:35:04 UTC
Permalink
Changes in v4:
Check merge-able feature when select simple rx/tx functions.

simple rx/tx func is chose when merge-able rx is disabled and user specifies single segment and
no offload support.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 947fc46..0f1daf2 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -53,6 +53,7 @@

#include "virtio_logs.h"
#include "virtio_ethdev.h"
+#include "virtio_pci.h"
#include "virtqueue.h"
#include "virtio_rxtx.h"

@@ -62,6 +63,10 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0)
#endif

+
+#define VIRTIO_SIMPLE_FLAGS ((uint32_t)ETH_TXQ_FLAGS_NOMULTSEGS | \
+ ETH_TXQ_FLAGS_NOOFFLOADS)
+
static int use_simple_rxtx;

static void
@@ -459,6 +464,7 @@ virtio_dev_tx_queue_setup(struct rte_eth_dev *dev,
const struct rte_eth_txconf *tx_conf)
{
uint8_t vtpci_queue_idx = 2 * queue_idx + VTNET_SQ_TQ_QUEUE_IDX;
+ struct virtio_hw *hw = dev->data->dev_private;
struct virtqueue *vq;
uint16_t tx_free_thresh;
int ret;
@@ -471,6 +477,15 @@ virtio_dev_tx_queue_setup(struct rte_eth_dev *dev,
return -EINVAL;
}

+ /* Use simple rx/tx func if single segment and no offloads */
+ if ((tx_conf->txq_flags & VIRTIO_SIMPLE_FLAGS) == VIRTIO_SIMPLE_FLAGS &&
+ !vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+ PMD_INIT_LOG(INFO, "Using simple rx/tx path");
+ dev->tx_pkt_burst = virtio_xmit_pkts_simple;
+ dev->rx_pkt_burst = virtio_recv_pkts_vec;
+ use_simple_rxtx = 1;
+ }
+
ret = virtio_dev_queue_setup(dev, VTNET_TQ, queue_idx, vtpci_queue_idx,
nb_desc, socket_id, &vq);
if (ret < 0) {
--
1.8.1.4
Tan, Jianfeng
2015-10-27 01:44:09 UTC
Permalink
-----Original Message-----
Sent: Sunday, October 25, 2015 11:35 PM
Subject: [dpdk-dev] [PATCH v5 0/7] virtio ring layout optimization and simple
rx/tx processing
- Call __rte_pktmbuf_prefree_seg to check refcnt when free mbufs
- Fix the error in virtio tx ring layout ascii chart in the commit message
- Move virtio_xmit_cleanup ahead to free descriptors earlier
- Test merge-able feature when select simple rx/tx functions
- Remove unnecessary NULL test for rte_free
- Remove unnecessary assign of local var after free
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
- Reword some commit messages
- Add TODO in the commit message of simple tx patch
- Remove the configure macro
- Enable simple R/TX processing when user specifies simple txq flags
- Reword some comments and commit messages
In DPDK based switching enviroment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on other different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor
When vhost fetches the avail ring, it need to fetch the modified L1 cache
from virtio core, which is a heavy cost in current CPU implementation.
allocate the fixed descriptor for each entry of avail ring, so avail ring will
always be the same during the run.
This removes L1M cache transfer from virtio core to vhost core for avail ring.
(Note we couldn't avoid the cache transfer for descriptors).
Besides, descriptor allocation and free operation is eliminated.
This also makes vector procesing possible to further accelerate the processing.
This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+
This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.
++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 127 | 128 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++
Performance boost could be observed only if the virtio backend isn't the
bottleneck or in VM2VM case.
There are also several vhost optimization patches to be submitted later.
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: pick simple rx/tx func
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 12 +-
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 56 ++++-
drivers/net/virtio/virtio_rxtx.h | 39 +++
drivers/net/virtio/virtio_rxtx_simple.c | 414
++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 5 +
7 files changed, 529 insertions(+), 4 deletions(-) create mode 100644
drivers/net/virtio/virtio_rxtx.h create mode 100644
drivers/net/virtio/virtio_rxtx_simple.c
--
1.8.1.4
Acked-by Jianfeng Tan <***@intel.com>

Thanks,
Jianfeng
Yuanhan Liu
2015-10-27 02:15:01 UTC
Permalink
-----Original Message-----
Sent: Sunday, October 25, 2015 11:35 PM
Subject: [dpdk-dev] [PATCH v5 0/7] virtio ring layout optimization and simple
rx/tx processing
- Call __rte_pktmbuf_prefree_seg to check refcnt when free mbufs
- Fix the error in virtio tx ring layout ascii chart in the commit message
- Move virtio_xmit_cleanup ahead to free descriptors earlier
- Test merge-able feature when select simple rx/tx functions
[...]
Jianfeng,

I often see a reply like this, just dropping an ACK at the end of
long email, and no more, which takes me (as well as others) some
time to scroll it many times to the bottom till see that.

TBH, it's always a bit frustrating that, after the scroll effort,
I just see such a reply that could have been put on the top of
the email so that I can see it with a glimpse only.

So, top reply would be good for this case, or you could reply like
what I did, removing other context to make your reply fit in one
screen.

--yliu
Bruce Richardson
2015-10-27 10:17:23 UTC
Permalink
Post by Yuanhan Liu
-----Original Message-----
Sent: Sunday, October 25, 2015 11:35 PM
Subject: [dpdk-dev] [PATCH v5 0/7] virtio ring layout optimization and simple
rx/tx processing
- Call __rte_pktmbuf_prefree_seg to check refcnt when free mbufs
- Fix the error in virtio tx ring layout ascii chart in the commit message
- Move virtio_xmit_cleanup ahead to free descriptors earlier
- Test merge-able feature when select simple rx/tx functions
[...]
Jianfeng,
I often see a reply like this, just dropping an ACK at the end of
long email, and no more, which takes me (as well as others) some
time to scroll it many times to the bottom till see that.
TBH, it's always a bit frustrating that, after the scroll effort,
I just see such a reply that could have been put on the top of
the email so that I can see it with a glimpse only.
So, top reply would be good for this case, or you could reply like
what I did, removing other context to make your reply fit in one
screen.
--yliu
+1

When ack'ing patches, please place the ack on the line underneath the signoff
and delete the rest of the email below, as it's unneeded.

/Bruce
Huawei Xie
2015-10-29 14:53:23 UTC
Permalink
Changes in v3:
- Remove unnecessary NULL test for rte_free
- Remove unnecessary assign of local var vq after free

Add software RX ring in virtqueue.
Add fake_mbuf in virtqueue for wraparound processing.
Use global simple_rxtx to indicate whether simple rxtx is enabled

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 11 ++++++++++-
drivers/net/virtio/virtio_rxtx.c | 7 +++++++
drivers/net/virtio/virtqueue.h | 4 ++++
3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 79a3640..82676d3 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -247,8 +247,8 @@ virtio_dev_queue_release(struct virtqueue *vq) {
VIRTIO_WRITE_REG_2(hw, VIRTIO_PCI_QUEUE_SEL, vq->queue_id);
VIRTIO_WRITE_REG_4(hw, VIRTIO_PCI_QUEUE_PFN, 0);

+ rte_free(vq->sw_ring);
rte_free(vq);
- vq = NULL;
}
}

@@ -292,6 +292,9 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
dev->data->port_id, queue_idx);
vq = rte_zmalloc(vq_name, sizeof(struct virtqueue) +
vq_size * sizeof(struct vq_desc_extra), RTE_CACHE_LINE_SIZE);
+ vq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
+ (RTE_PMD_VIRTIO_RX_MAX_BURST + vq_size) *
+ sizeof(vq->sw_ring[0]), RTE_CACHE_LINE_SIZE, socket_id);
} else if (queue_type == VTNET_TQ) {
snprintf(vq_name, sizeof(vq_name), "port%d_tvq%d",
dev->data->port_id, queue_idx);
@@ -308,6 +311,12 @@ int virtio_dev_queue_setup(struct rte_eth_dev *dev,
PMD_INIT_LOG(ERR, "%s: Can not allocate virtqueue", __func__);
return (-ENOMEM);
}
+ if (queue_type == VTNET_RQ && vq->sw_ring == NULL) {
+ PMD_INIT_LOG(ERR, "%s: Can not allocate RX soft ring",
+ __func__);
+ rte_free(vq);
+ return -ENOMEM;
+ }

vq->hw = hw;
vq->port_id = dev->data->port_id;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 9324f7f..5c00e9d 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -62,6 +62,8 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0)
#endif

+static int use_simple_rxtx;
+
static void
vq_ring_free_chain(struct virtqueue *vq, uint16_t desc_idx)
{
@@ -299,6 +301,11 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/* Allocate blank mbufs for the each rx descriptor */
nbufs = 0;
error = ENOSPC;
+
+ memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
+ for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
+ vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
+
while (!virtqueue_full(vq)) {
m = rte_rxmbuf_alloc(vq->mpool);
if (m == NULL)
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 7789411..6a1ec48 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -190,6 +190,10 @@ struct virtqueue {
uint16_t vq_avail_idx;
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

+ struct rte_mbuf **sw_ring; /**< RX software ring. */
+ /* dummy mbuf, for wraparound when processing RX ring. */
+ struct rte_mbuf fake_mbuf;
+
/* Statistics */
uint64_t packets;
uint64_t bytes;
--
1.8.1.4
Thomas Monjalon
2015-10-30 18:13:28 UTC
Permalink
Post by Huawei Xie
+static int use_simple_rxtx;
error: ‘use_simple_rxtx’ defined but not used

I'll try to move it in the next patch.
Huawei Xie
2015-10-29 14:53:27 UTC
Permalink
Changes in v5:
- call __rte_pktmbuf_prefree_seg to check refcnt when free mbufs

Changes in v4:
- move virtio_xmit_cleanup ahead to free descriptors earlier

Changes in v3:
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup

bulk free of mbufs when clean used ring.
shift operation of idx could be saved if vq_free_cnt means
free slots rather than free descriptors.

TODO: rearrange vq data structure, pack the stats var together so that we
could use one vec instruction to update all of them.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 3 +
drivers/net/virtio/virtio_rxtx_simple.c | 106 ++++++++++++++++++++++++++++++++
2 files changed, 109 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index d7797ab..ae2d47d 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -111,6 +111,9 @@ uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts);
+
/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
* frames larger than 1514 bytes. We do not yet support software LRO
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index ef17562..624e789 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -288,6 +288,112 @@ virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
return nb_pkts_received;
}

+#define VIRTIO_TX_FREE_THRESH 32
+#define VIRTIO_TX_MAX_FREE_BUF_SZ 32
+#define VIRTIO_TX_FREE_NR 32
+/* TODO: vq->tx_free_cnt could mean num of free slots so we could avoid shift */
+static inline void
+virtio_xmit_cleanup(struct virtqueue *vq)
+{
+ uint16_t i, desc_idx;
+ int nb_free = 0;
+ struct rte_mbuf *m, *free[VIRTIO_TX_MAX_FREE_BUF_SZ];
+
+ desc_idx = (uint16_t)(vq->vq_used_cons_idx &
+ ((vq->vq_nentries >> 1) - 1));
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ m = __rte_pktmbuf_prefree_seg(m);
+ if (likely(m != NULL)) {
+ free[0] = m;
+ nb_free = 1;
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ m = __rte_pktmbuf_prefree_seg(m);
+ if (likely(m != NULL)) {
+ if (likely(m->pool == free[0]->pool))
+ free[nb_free++] = m;
+ else {
+ rte_mempool_put_bulk(free[0]->pool,
+ (void **)free, nb_free);
+ free[0] = m;
+ nb_free = 1;
+ }
+ }
+ }
+ rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+ } else {
+ for (i = 1; i < VIRTIO_TX_FREE_NR; i++) {
+ m = (struct rte_mbuf *)vq->vq_descx[desc_idx++].cookie;
+ m = __rte_pktmbuf_prefree_seg(m);
+ if (m != NULL)
+ rte_mempool_put(m->pool, m);
+ }
+ }
+
+ vq->vq_used_cons_idx += VIRTIO_TX_FREE_NR;
+ vq->vq_free_cnt += (VIRTIO_TX_FREE_NR << 1);
+}
+
+uint16_t
+virtio_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *txvq = tx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_desc *start_dp;
+ uint16_t nb_tail, nb_commit;
+ int i;
+ uint16_t desc_idx_max = (txvq->vq_nentries >> 1) - 1;
+
+ nb_used = VIRTQUEUE_NUSED(txvq);
+ rte_compiler_barrier();
+
+ if (nb_used >= VIRTIO_TX_FREE_THRESH)
+ virtio_xmit_cleanup(tx_queue);
+
+ nb_commit = nb_pkts = RTE_MIN((txvq->vq_free_cnt >> 1), nb_pkts);
+ desc_idx = (uint16_t) (txvq->vq_avail_idx & desc_idx_max);
+ start_dp = txvq->vq_ring.desc;
+ nb_tail = (uint16_t) (desc_idx_max + 1 - desc_idx);
+
+ if (nb_commit >= nb_tail) {
+ for (i = 0; i < nb_tail; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_tail; i++) {
+ start_dp[desc_idx].addr =
+ RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+ nb_commit -= nb_tail;
+ desc_idx = 0;
+ }
+ for (i = 0; i < nb_commit; i++)
+ txvq->vq_descx[desc_idx + i].cookie = tx_pkts[i];
+ for (i = 0; i < nb_commit; i++) {
+ start_dp[desc_idx].addr = RTE_MBUF_DATA_DMA_ADDR(*tx_pkts);
+ start_dp[desc_idx].len = (*tx_pkts)->pkt_len;
+ tx_pkts++;
+ desc_idx++;
+ }
+
+ rte_compiler_barrier();
+
+ txvq->vq_free_cnt -= (uint16_t)(nb_pkts << 1);
+ txvq->vq_avail_idx += nb_pkts;
+ txvq->vq_ring.avail->idx = txvq->vq_avail_idx;
+ txvq->packets += nb_pkts;
+
+ if (likely(nb_pkts)) {
+ if (unlikely(virtqueue_kick_prepare(txvq)))
+ virtqueue_notify(txvq);
+ }
+
+ return nb_pkts;
+}
+
int __attribute__((cold))
virtio_rxq_vec_setup(struct virtqueue *rxq)
{
--
1.8.1.4
Huawei Xie
2015-10-29 14:53:28 UTC
Permalink
Changes in v4:
Check merge-able feature when select simple rx/tx functions.

simple rx/tx func is chose when merge-able rx is disabled and user specifies single segment and
no offload support.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 947fc46..0f1daf2 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -53,6 +53,7 @@

#include "virtio_logs.h"
#include "virtio_ethdev.h"
+#include "virtio_pci.h"
#include "virtqueue.h"
#include "virtio_rxtx.h"

@@ -62,6 +63,10 @@
#define VIRTIO_DUMP_PACKET(m, len) do { } while (0)
#endif

+
+#define VIRTIO_SIMPLE_FLAGS ((uint32_t)ETH_TXQ_FLAGS_NOMULTSEGS | \
+ ETH_TXQ_FLAGS_NOOFFLOADS)
+
static int use_simple_rxtx;

static void
@@ -459,6 +464,7 @@ virtio_dev_tx_queue_setup(struct rte_eth_dev *dev,
const struct rte_eth_txconf *tx_conf)
{
uint8_t vtpci_queue_idx = 2 * queue_idx + VTNET_SQ_TQ_QUEUE_IDX;
+ struct virtio_hw *hw = dev->data->dev_private;
struct virtqueue *vq;
uint16_t tx_free_thresh;
int ret;
@@ -471,6 +477,15 @@ virtio_dev_tx_queue_setup(struct rte_eth_dev *dev,
return -EINVAL;
}

+ /* Use simple rx/tx func if single segment and no offloads */
+ if ((tx_conf->txq_flags & VIRTIO_SIMPLE_FLAGS) == VIRTIO_SIMPLE_FLAGS &&
+ !vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
+ PMD_INIT_LOG(INFO, "Using simple rx/tx path");
+ dev->tx_pkt_burst = virtio_xmit_pkts_simple;
+ dev->rx_pkt_burst = virtio_recv_pkts_vec;
+ use_simple_rxtx = 1;
+ }
+
ret = virtio_dev_queue_setup(dev, VTNET_TQ, queue_idx, vtpci_queue_idx,
nb_desc, socket_id, &vq);
if (ret < 0) {
--
1.8.1.4
Huawei Xie
2015-10-29 14:53:26 UTC
Permalink
With fixed avail ring, we don't need to get desc idx from avail ring.
virtio driver only has to deal with desc ring.
This patch uses vector instruction to accelerate processing desc ring.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.h | 2 +
drivers/net/virtio/virtio_rxtx.c | 3 +
drivers/net/virtio/virtio_rxtx.h | 2 +
drivers/net/virtio/virtio_rxtx_simple.c | 224 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 1 +
5 files changed, 232 insertions(+)

diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index 9026d42..d7797ab 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -108,6 +108,8 @@ uint16_t virtio_recv_mergeable_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);

+uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts);

/*
* The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5162ce6..947fc46 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -432,6 +432,9 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
vq->mpool = mp;

dev->data->rx_queues[queue_idx] = vq;
+
+ virtio_rxq_vec_setup(vq);
+
return 0;
}

diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index 7d2d8fe..831e492 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -33,5 +33,7 @@

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64

+int virtio_rxq_vec_setup(struct virtqueue *rxq);
+
int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
index cac5b9f..ef17562 100644
--- a/drivers/net/virtio/virtio_rxtx_simple.c
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -58,6 +58,10 @@
#include "virtqueue.h"
#include "virtio_rxtx.h"

+#define RTE_VIRTIO_VPMD_RX_BURST 32
+#define RTE_VIRTIO_DESC_PER_LOOP 8
+#define RTE_VIRTIO_VPMD_RX_REARM_THRESH RTE_VIRTIO_VPMD_RX_BURST
+
int __attribute__((cold))
virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
struct rte_mbuf *cookie)
@@ -82,3 +86,223 @@ virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,

return 0;
}
+
+static inline void
+virtio_rxq_rearm_vec(struct virtqueue *rxvq)
+{
+ int i;
+ uint16_t desc_idx;
+ struct rte_mbuf **sw_ring;
+ struct vring_desc *start_dp;
+ int ret;
+
+ desc_idx = rxvq->vq_avail_idx & (rxvq->vq_nentries - 1);
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ start_dp = &rxvq->vq_ring.desc[desc_idx];
+
+ ret = rte_mempool_get_bulk(rxvq->mpool, (void **)sw_ring,
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH);
+ if (unlikely(ret)) {
+ rte_eth_devices[rxvq->port_id].data->rx_mbuf_alloc_failed +=
+ RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ return;
+ }
+
+ for (i = 0; i < RTE_VIRTIO_VPMD_RX_REARM_THRESH; i++) {
+ uintptr_t p;
+
+ p = (uintptr_t)&sw_ring[i]->rearm_data;
+ *(uint64_t *)p = rxvq->mbuf_initializer;
+
+ start_dp[i].addr =
+ (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[i].len = sw_ring[i]->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+ }
+
+ rxvq->vq_avail_idx += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ rxvq->vq_free_cnt -= RTE_VIRTIO_VPMD_RX_REARM_THRESH;
+ vq_update_avail_idx(rxvq);
+}
+
+/* virtio vPMD receive routine, only accept(nb_pkts >= RTE_VIRTIO_DESC_PER_LOOP)
+ *
+ * This routine is for non-mergable RX, one desc for each guest buffer.
+ * This routine is based on the RX ring layout optimization. Each entry in the
+ * avail ring points to the desc with the same index in the desc ring and this
+ * will never be changed in the driver.
+ *
+ * - nb_pkts < RTE_VIRTIO_DESC_PER_LOOP, just return no packet
+ */
+uint16_t
+virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+ uint16_t nb_pkts)
+{
+ struct virtqueue *rxvq = rx_queue;
+ uint16_t nb_used;
+ uint16_t desc_idx;
+ struct vring_used_elem *rused;
+ struct rte_mbuf **sw_ring;
+ struct rte_mbuf **sw_ring_end;
+ uint16_t nb_pkts_received;
+ __m128i shuf_msk1, shuf_msk2, len_adjust;
+
+ shuf_msk1 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 5, 4, /* dat len */
+ 0xFF, 0xFF, 5, 4, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+
+ );
+
+ shuf_msk2 = _mm_set_epi8(
+ 0xFF, 0xFF, 0xFF, 0xFF,
+ 0xFF, 0xFF, /* vlan tci */
+ 13, 12, /* dat len */
+ 0xFF, 0xFF, 13, 12, /* pkt len */
+ 0xFF, 0xFF, 0xFF, 0xFF /* packet type */
+ );
+
+ /* Substract the header length.
+ * In which case do we need the header length in used->len ?
+ */
+ len_adjust = _mm_set_epi16(
+ 0, 0,
+ 0,
+ (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, (uint16_t) -sizeof(struct virtio_net_hdr),
+ 0, 0);
+
+ if (unlikely(nb_pkts < RTE_VIRTIO_DESC_PER_LOOP))
+ return 0;
+
+ nb_used = *(volatile uint16_t *)&rxvq->vq_ring.used->idx -
+ rxvq->vq_used_cons_idx;
+
+ rte_compiler_barrier();
+
+ if (unlikely(nb_used == 0))
+ return 0;
+
+ nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, RTE_VIRTIO_DESC_PER_LOOP);
+ nb_used = RTE_MIN(nb_used, nb_pkts);
+
+ desc_idx = (uint16_t)(rxvq->vq_used_cons_idx & (rxvq->vq_nentries - 1));
+ rused = &rxvq->vq_ring.used->ring[desc_idx];
+ sw_ring = &rxvq->sw_ring[desc_idx];
+ sw_ring_end = &rxvq->sw_ring[rxvq->vq_nentries];
+
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
+
+ if (rxvq->vq_free_cnt >= RTE_VIRTIO_VPMD_RX_REARM_THRESH) {
+ virtio_rxq_rearm_vec(rxvq);
+ if (unlikely(virtqueue_kick_prepare(rxvq)))
+ virtqueue_notify(rxvq);
+ }
+
+ for (nb_pkts_received = 0;
+ nb_pkts_received < nb_used;) {
+ __m128i desc[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i mbp[RTE_VIRTIO_DESC_PER_LOOP / 2];
+ __m128i pkt_mb[RTE_VIRTIO_DESC_PER_LOOP];
+
+ mbp[0] = _mm_loadu_si128((__m128i *)(sw_ring + 0));
+ desc[0] = _mm_loadu_si128((__m128i *)(rused + 0));
+ _mm_storeu_si128((__m128i *)&rx_pkts[0], mbp[0]);
+
+ mbp[1] = _mm_loadu_si128((__m128i *)(sw_ring + 2));
+ desc[1] = _mm_loadu_si128((__m128i *)(rused + 2));
+ _mm_storeu_si128((__m128i *)&rx_pkts[2], mbp[1]);
+
+ mbp[2] = _mm_loadu_si128((__m128i *)(sw_ring + 4));
+ desc[2] = _mm_loadu_si128((__m128i *)(rused + 4));
+ _mm_storeu_si128((__m128i *)&rx_pkts[4], mbp[2]);
+
+ mbp[3] = _mm_loadu_si128((__m128i *)(sw_ring + 6));
+ desc[3] = _mm_loadu_si128((__m128i *)(rused + 6));
+ _mm_storeu_si128((__m128i *)&rx_pkts[6], mbp[3]);
+
+ pkt_mb[1] = _mm_shuffle_epi8(desc[0], shuf_msk2);
+ pkt_mb[0] = _mm_shuffle_epi8(desc[0], shuf_msk1);
+ pkt_mb[1] = _mm_add_epi16(pkt_mb[1], len_adjust);
+ pkt_mb[0] = _mm_add_epi16(pkt_mb[0], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[1]->rx_descriptor_fields1,
+ pkt_mb[1]);
+ _mm_storeu_si128((void *)&rx_pkts[0]->rx_descriptor_fields1,
+ pkt_mb[0]);
+
+ pkt_mb[3] = _mm_shuffle_epi8(desc[1], shuf_msk2);
+ pkt_mb[2] = _mm_shuffle_epi8(desc[1], shuf_msk1);
+ pkt_mb[3] = _mm_add_epi16(pkt_mb[3], len_adjust);
+ pkt_mb[2] = _mm_add_epi16(pkt_mb[2], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[3]->rx_descriptor_fields1,
+ pkt_mb[3]);
+ _mm_storeu_si128((void *)&rx_pkts[2]->rx_descriptor_fields1,
+ pkt_mb[2]);
+
+ pkt_mb[5] = _mm_shuffle_epi8(desc[2], shuf_msk2);
+ pkt_mb[4] = _mm_shuffle_epi8(desc[2], shuf_msk1);
+ pkt_mb[5] = _mm_add_epi16(pkt_mb[5], len_adjust);
+ pkt_mb[4] = _mm_add_epi16(pkt_mb[4], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[5]->rx_descriptor_fields1,
+ pkt_mb[5]);
+ _mm_storeu_si128((void *)&rx_pkts[4]->rx_descriptor_fields1,
+ pkt_mb[4]);
+
+ pkt_mb[7] = _mm_shuffle_epi8(desc[3], shuf_msk2);
+ pkt_mb[6] = _mm_shuffle_epi8(desc[3], shuf_msk1);
+ pkt_mb[7] = _mm_add_epi16(pkt_mb[7], len_adjust);
+ pkt_mb[6] = _mm_add_epi16(pkt_mb[6], len_adjust);
+ _mm_storeu_si128((void *)&rx_pkts[7]->rx_descriptor_fields1,
+ pkt_mb[7]);
+ _mm_storeu_si128((void *)&rx_pkts[6]->rx_descriptor_fields1,
+ pkt_mb[6]);
+
+ if (unlikely(nb_used <= RTE_VIRTIO_DESC_PER_LOOP)) {
+ if (sw_ring + nb_used <= sw_ring_end)
+ nb_pkts_received += nb_used;
+ else
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ if (unlikely(sw_ring + RTE_VIRTIO_DESC_PER_LOOP >=
+ sw_ring_end)) {
+ nb_pkts_received += sw_ring_end - sw_ring;
+ break;
+ } else {
+ nb_pkts_received += RTE_VIRTIO_DESC_PER_LOOP;
+
+ rx_pkts += RTE_VIRTIO_DESC_PER_LOOP;
+ sw_ring += RTE_VIRTIO_DESC_PER_LOOP;
+ rused += RTE_VIRTIO_DESC_PER_LOOP;
+ nb_used -= RTE_VIRTIO_DESC_PER_LOOP;
+ }
+ }
+ }
+
+ rxvq->vq_used_cons_idx += nb_pkts_received;
+ rxvq->vq_free_cnt += nb_pkts_received;
+ rxvq->packets += nb_pkts_received;
+ return nb_pkts_received;
+}
+
+int __attribute__((cold))
+virtio_rxq_vec_setup(struct virtqueue *rxq)
+{
+ uintptr_t p;
+ struct rte_mbuf mb_def = { .buf_addr = 0 }; /* zeroed mbuf */
+
+ mb_def.nb_segs = 1;
+ mb_def.data_off = RTE_PKTMBUF_HEADROOM;
+ mb_def.port = rxq->port_id;
+ rte_mbuf_refcnt_set(&mb_def, 1);
+
+ /* prevent compiler reordering: rearm_data covers previous fields */
+ rte_compiler_barrier();
+ p = (uintptr_t)&mb_def.rearm_data;
+ rxq->mbuf_initializer = *(uint64_t *)p;
+
+ return 0;
+}
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index 6a1ec48..98a77d5 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -188,6 +188,7 @@ struct virtqueue {
*/
uint16_t vq_used_cons_idx;
uint16_t vq_avail_idx;
+ uint64_t mbuf_initializer; /**< value to init mbufs. */
phys_addr_t virtio_net_hdr_mem; /**< hdr for each xmit packet */

struct rte_mbuf **sw_ring; /**< RX software ring. */
--
1.8.1.4
Thomas Monjalon
2015-10-30 18:19:13 UTC
Permalink
Sorry, there is a clang error.
Post by Huawei Xie
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
virtio_rxtx_simple.c:197:2: error: cast from 'const void *' to
'void *' drops const qualifier
Xie, Huawei
2015-11-02 02:18:37 UTC
Permalink
Post by Thomas Monjalon
Sorry, there is a clang error.
Post by Huawei Xie
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
virtio_rxtx_simple.c:197:2: error: cast from 'const void *' to
'void *' drops const qualifier
This is weird. This conversion is actually from 'void *' to 'const void
*', not the opposite, so there should be no error.
I checked clang build, it doesn't report error.
clang version 3.3 (tags/RELEASE_33/rc2)
Thomas Monjalon
2015-11-02 07:28:51 UTC
Permalink
Post by Xie, Huawei
Post by Thomas Monjalon
Sorry, there is a clang error.
Post by Huawei Xie
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
virtio_rxtx_simple.c:197:2: error: cast from 'const void *' to
'void *' drops const qualifier
This is weird. This conversion is actually from 'void *' to 'const void
*', not the opposite, so there should be no error.
I checked clang build, it doesn't report error.
clang version 3.3 (tags/RELEASE_33/rc2)
I'm using clang 3.6.2.
Anybody else to check please?
Xie, Huawei
2015-11-02 08:49:12 UTC
Permalink
Post by Thomas Monjalon
Post by Xie, Huawei
Post by Thomas Monjalon
Sorry, there is a clang error.
Post by Huawei Xie
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
virtio_rxtx_simple.c:197:2: error: cast from 'const void *' to
'void *' drops const qualifier
This is weird. This conversion is actually from 'void *' to 'const void
*', not the opposite, so there should be no error.
I checked clang build, it doesn't report error.
clang version 3.3 (tags/RELEASE_33/rc2)
I'm using clang 3.6.2.
Anybody else to check please?
Thomas:

I checked clang-3.5 on Fedora 22 and clang-3.6 on Ubuntu 15.04.
Clang-3.6 reports warnings, but the definition of this macro doesn't change.

Why (const void*) conversion is used in the code is because when
__OPTIMIZE__ is defined, GCC defines first parameter to be "const void *".

Could you add the following macro(used in other vec pmds as well) before
virtqueue_enqueue_recv_refill_simple or do i need to submit a new patchset?

+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
Thomas Monjalon
2015-11-02 09:03:07 UTC
Permalink
Post by Xie, Huawei
Post by Thomas Monjalon
Post by Xie, Huawei
Post by Thomas Monjalon
Sorry, there is a clang error.
Post by Huawei Xie
+ _mm_prefetch((const void *)rused, _MM_HINT_T0);
virtio_rxtx_simple.c:197:2: error: cast from 'const void *' to
'void *' drops const qualifier
This is weird. This conversion is actually from 'void *' to 'const void
*', not the opposite, so there should be no error.
I checked clang build, it doesn't report error.
clang version 3.3 (tags/RELEASE_33/rc2)
I'm using clang 3.6.2.
Anybody else to check please?
I checked clang-3.5 on Fedora 22 and clang-3.6 on Ubuntu 15.04.
Clang-3.6 reports warnings, but the definition of this macro doesn't change.
Why (const void*) conversion is used in the code is because when
__OPTIMIZE__ is defined, GCC defines first parameter to be "const void *".
Could you add the following macro(used in other vec pmds as well) before
virtqueue_enqueue_recv_refill_simple or do i need to submit a new patchset?
+#ifndef __INTEL_COMPILER
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
OK I'll try it.
Huawei Xie
2015-10-29 14:53:25 UTC
Permalink
fill avail ring with blank mbufs in virtio_dev_vring_start

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_rxtx.c | 6 ++-
drivers/net/virtio/virtio_rxtx.h | 3 ++
drivers/net/virtio/virtio_rxtx_simple.c | 84 +++++++++++++++++++++++++++++++++
4 files changed, 92 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 930b60f..43835ba 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -50,7 +50,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtqueue.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_pci.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx.c
SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_ethdev.c
-
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_rxtx_simple.c

# this lib depends upon:
DEPDIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += lib/librte_eal lib/librte_ether
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 7c82a6a..5162ce6 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -320,8 +320,10 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
/******************************************
* Enqueue allocated buffers *
*******************************************/
- error = virtqueue_enqueue_recv_refill(vq, m);
-
+ if (use_simple_rxtx)
+ error = virtqueue_enqueue_recv_refill_simple(vq, m);
+ else
+ error = virtqueue_enqueue_recv_refill(vq, m);
if (error) {
rte_pktmbuf_free(m);
break;
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
index a10aa69..7d2d8fe 100644
--- a/drivers/net/virtio/virtio_rxtx.h
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -32,3 +32,6 @@
*/

#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
+
+int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *m);
diff --git a/drivers/net/virtio/virtio_rxtx_simple.c b/drivers/net/virtio/virtio_rxtx_simple.c
new file mode 100644
index 0000000..cac5b9f
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx_simple.c
@@ -0,0 +1,84 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <tmmintrin.h>
+
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_branch_prediction.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_mbuf.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+#include <rte_prefetch.h>
+#include <rte_string_fns.h>
+#include <rte_errno.h>
+#include <rte_byteorder.h>
+
+#include "virtio_logs.h"
+#include "virtio_ethdev.h"
+#include "virtqueue.h"
+#include "virtio_rxtx.h"
+
+int __attribute__((cold))
+virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
+ struct rte_mbuf *cookie)
+{
+ struct vq_desc_extra *dxp;
+ struct vring_desc *start_dp;
+ uint16_t desc_idx;
+
+ desc_idx = vq->vq_avail_idx & (vq->vq_nentries - 1);
+ dxp = &vq->vq_descx[desc_idx];
+ dxp->cookie = (void *)cookie;
+ vq->sw_ring[desc_idx] = cookie;
+
+ start_dp = vq->vq_ring.desc;
+ start_dp[desc_idx].addr = (uint64_t)((uintptr_t)cookie->buf_physaddr +
+ RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
+ start_dp[desc_idx].len = cookie->buf_len -
+ RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
+
+ vq->vq_free_cnt--;
+ vq->vq_avail_idx++;
+
+ return 0;
+}
--
1.8.1.4
Huawei Xie
2015-10-29 14:53:22 UTC
Permalink
Would move all rx/tx related declarations into this header file in future.
Add RTE_VIRTIO_PMD_MAX_BURST.

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_ethdev.c | 1 +
drivers/net/virtio/virtio_rxtx.c | 1 +
drivers/net/virtio/virtio_rxtx.h | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 36 insertions(+)
create mode 100644 drivers/net/virtio/virtio_rxtx.h

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 465d3cd..79a3640 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -61,6 +61,7 @@
#include "virtio_pci.h"
#include "virtio_logs.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"


static int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index c5b53bb..9324f7f 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -54,6 +54,7 @@
#include "virtio_logs.h"
#include "virtio_ethdev.h"
#include "virtqueue.h"
+#include "virtio_rxtx.h"

#ifdef RTE_LIBRTE_VIRTIO_DEBUG_DUMP
#define VIRTIO_DUMP_PACKET(m, len) rte_pktmbuf_dump(stdout, m, len)
diff --git a/drivers/net/virtio/virtio_rxtx.h b/drivers/net/virtio/virtio_rxtx.h
new file mode 100644
index 0000000..a10aa69
--- /dev/null
+++ b/drivers/net/virtio/virtio_rxtx.h
@@ -0,0 +1,34 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#define RTE_PMD_VIRTIO_RX_MAX_BURST 64
--
1.8.1.4
Huawei Xie
2015-10-29 14:53:24 UTC
Permalink
Changes in V4:
- fix the error in tx ring layout chart in this commit message.

In DPDK based switching envrioment, mostly vhost runs on a dedicated core
while virtio processing in guest VMs runs on different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it need to fetch the modified L1 cache from
virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring, so avail ring will
always be the same during the run.
This removes L1M cache transfer from virtio core to vhost core for avail ring.
(Note we couldn't avoid the cache transfer for descriptors).
Besides, descriptor allocation and free operation is eliminated.
This also makes vector procesing possible to further accelerate the processing.

This is the layout for the avail ring(take 256 ring entries for example), with
each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 128 | 129 | ... | 255 || 128 | 129 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++

Signed-off-by: Huawei Xie <***@intel.com>
---
drivers/net/virtio/virtio_rxtx.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 5c00e9d..7c82a6a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -302,6 +302,12 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
nbufs = 0;
error = ENOSPC;

+ if (use_simple_rxtx)
+ for (i = 0; i < vq->vq_nentries; i++) {
+ vq->vq_ring.avail->ring[i] = i;
+ vq->vq_ring.desc[i].flags = VRING_DESC_F_WRITE;
+ }
+
memset(&vq->fake_mbuf, 0, sizeof(vq->fake_mbuf));
for (i = 0; i < RTE_PMD_VIRTIO_RX_MAX_BURST; i++)
vq->sw_ring[vq->vq_nentries + i] = &vq->fake_mbuf;
@@ -332,6 +338,24 @@ virtio_dev_vring_start(struct virtqueue *vq, int queue_type)
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
vq->mz->phys_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT);
} else if (queue_type == VTNET_TQ) {
+ if (use_simple_rxtx) {
+ int mid_idx = vq->vq_nentries >> 1;
+ for (i = 0; i < mid_idx; i++) {
+ vq->vq_ring.avail->ring[i] = i + mid_idx;
+ vq->vq_ring.desc[i + mid_idx].next = i;
+ vq->vq_ring.desc[i + mid_idx].addr =
+ vq->virtio_net_hdr_mem +
+ mid_idx * vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].len =
+ vq->hw->vtnet_hdr_size;
+ vq->vq_ring.desc[i + mid_idx].flags =
+ VRING_DESC_F_NEXT;
+ vq->vq_ring.desc[i].flags = 0;
+ }
+ for (i = mid_idx; i < vq->vq_nentries; i++)
+ vq->vq_ring.avail->ring[i] = i;
+ }
+
VIRTIO_WRITE_REG_2(vq->hw, VIRTIO_PCI_QUEUE_SEL,
vq->vq_queue_index);
VIRTIO_WRITE_REG_4(vq->hw, VIRTIO_PCI_QUEUE_PFN,
--
1.8.1.4
Huawei Xie
2015-10-29 14:53:29 UTC
Permalink
Update release notes about virtio ring layout optimization,
vector rx and simple tx support

Signed-off-by: Huawei Xie <***@intel.com>
---
doc/guides/rel_notes/release_2_2.rst | 3 +++
1 file changed, 3 insertions(+)

diff --git a/doc/guides/rel_notes/release_2_2.rst b/doc/guides/rel_notes/release_2_2.rst
index 682f468..5d0dd6f 100644
--- a/doc/guides/rel_notes/release_2_2.rst
+++ b/doc/guides/rel_notes/release_2_2.rst
@@ -4,6 +4,9 @@ DPDK Release 2.2
New Features
------------

+* **Enhanced support for virtio driver.**
+
+ * Virtio ring layout optimization(fixed avail ring), vector RX and simple tx support

Resolved Issues
---------------
--
1.8.1.4
Tan, Jianfeng
2015-10-30 02:05:53 UTC
Permalink
-----Original Message-----
Sent: Thursday, October 29, 2015 10:53 PM
Subject: [dpdk-dev] [PATCH v6 0/8] virtio ring layout optimization and simple
rx/tx processing
- Update release notes
- Fix the error in virtio tx ring layout ascii chart in the cover-letter
......
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: pick simple rx/tx func
doc: update release notes 2.2 about virtio performance optimization
doc/guides/rel_notes/release_2_2.rst | 3 +
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 12 +-
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 56 ++++-
drivers/net/virtio/virtio_rxtx.h | 39 +++
drivers/net/virtio/virtio_rxtx_simple.c | 414
++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 5 +
8 files changed, 532 insertions(+), 4 deletions(-) create mode 100644
drivers/net/virtio/virtio_rxtx.h create mode 100644
drivers/net/virtio/virtio_rxtx_simple.c
--
1.8.1.4
Acked-by: Jianfeng Tan<***@intel.com>
Thomas Monjalon
2015-11-02 22:09:49 UTC
Permalink
Applied with the modifications discussed in this thread, thanks.
Thomas Monjalon
2015-11-02 22:10:49 UTC
Permalink
Post by Thomas Monjalon
Applied with the modifications discussed in this thread, thanks.
Please Huawei,
Could you share some numbers for these optimizations?
Xie, Huawei
2015-11-03 10:30:01 UTC
Permalink
Post by Thomas Monjalon
Post by Thomas Monjalon
Applied with the modifications discussed in this thread, thanks.
Please Huawei,
Could you share some numbers for these optimizations?
Thomas, thanks for the effort.
My optimization is based on highly optimized dpdkvhost(currently vhost
is the bottleneck) of old version dpdk. I would re-prepare a new
optimized dpdkvhost, and measure the performance again.
Xu, Qian Q
2015-11-27 06:03:43 UTC
Permalink
Some virtio-pmd optimization performance data sharing:
1. Use simplified vhost-sample, only doing the dequeuer and free, so virtio only tx, then test the virtio tx performance improvement. Then in the VM, using one virtio to do the txonly, and let the virtio tx working. Also modified the txonly file to remove the memory copy part, then check the virtio TX rate. The performance of optimized virtio-pmd will have ~2x performance than the non-optimized virtio-pmd.
2. Similarly as item1, but use the default txonly file, so with memory copy, then the performance of optimized virtio-pmd will have ~37% performance improvement than the non-optimized virtio-pmd.
3. In the OVS test scenario, one physical NIC + one virtio in the VM, then let the virtio do the loopback(having rx and tx), running testpmd in the VM, then the performance will have 60% performance improvement than the non-optimized virtio-pmd.



Thanks
Qian

-----Original Message-----
From: dev [mailto:dev-***@dpdk.org] On Behalf Of Huawei Xie
Sent: Thursday, October 29, 2015 10:53 PM
To: ***@dpdk.org
Subject: [dpdk-dev] [PATCH v6 0/8] virtio ring layout optimization and simple rx/tx processing

Changes in v6:
- Update release notes
- Fix the error in virtio tx ring layout ascii chart in the cover-letter

Changes in v5:
- Call __rte_pktmbuf_prefree_seg to check refcnt when free mbufs

Changes in v4:
- Fix the error in virtio tx ring layout ascii chart in the commit message
- Move virtio_xmit_cleanup ahead to free descriptors earlier
- Test merge-able feature when select simple rx/tx functions

Changes in v3:
- Remove unnecessary NULL test for rte_free
- Remove unnecessary assign of local var after free
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
- Reword some commit messages
- Add TODO in the commit message of simple tx patch

Changes in v2:
- Remove the configure macro
- Enable simple R/TX processing when user specifies simple txq flags
- Reword some comments and commit messages

In DPDK based switching enviroment, mostly vhost runs on a dedicated core while virtio processing in guest VMs runs on other different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor

When vhost fetches the avail ring, it need to fetch the modified L1 cache from virtio core, which is a heavy cost in current CPU implementation.

This idea of this optimization is:
allocate the fixed descriptor for each entry of avail ring, so avail ring will always be the same during the run.
This removes L1M cache transfer from virtio core to vhost core for avail ring.
(Note we couldn't avoid the cache transfer for descriptors).
Besides, descriptor allocation and free operation is eliminated.
This also makes vector procesing possible to further accelerate the processing.

This is the layout for the avail ring(take 256 ring entries for example), with each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+

This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.

++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 128 | 129 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++


Performance boost could be observed only if the virtio backend isn't the bottleneck or in VM2VM case.
There are also several vhost optimization patches to be submitted later.


Huawei Xie (8):
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: pick simple rx/tx func
doc: update release notes 2.2 about virtio performance optimization

doc/guides/rel_notes/release_2_2.rst | 3 +
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 12 +-
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 56 ++++-
drivers/net/virtio/virtio_rxtx.h | 39 +++
drivers/net/virtio/virtio_rxtx_simple.c | 414 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 5 +
8 files changed, 532 insertions(+), 4 deletions(-) create mode 100644 drivers/net/virtio/virtio_rxtx.h create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c

--
1.8.1.4
Xie, Huawei
2015-12-17 05:22:54 UTC
Permalink
Post by Xu, Qian Q
1. Use simplified vhost-sample, only doing the dequeuer and free, so virtio only tx, then test the virtio tx performance improvement. Then in the VM, using one virtio to do the txonly, and let the virtio tx working. Also modified the txonly file to remove the memory copy part, then check the virtio TX rate. The performance of optimized virtio-pmd will have ~2x performance than the non-optimized virtio-pmd.
2. Similarly as item1, but use the default txonly file, so with memory copy, then the performance of optimized virtio-pmd will have ~37% performance improvement than the non-optimized virtio-pmd.
3. In the OVS test scenario, one physical NIC + one virtio in the VM, then let the virtio do the loopback(having rx and tx), running testpmd in the VM, then the performance will have 60% performance improvement than the non-optimized virtio-pmd.
Thomas:
You ever asked about the performance data.
Another thing is how about adding a simple vhost performance example,
like the vring bench which is used to test virtio performance, so that
each time we have some performance related patches, we could use this
benchmark to report the performance difference?
Post by Xu, Qian Q
Thanks
Qian
-----Original Message-----
Sent: Thursday, October 29, 2015 10:53 PM
Subject: [dpdk-dev] [PATCH v6 0/8] virtio ring layout optimization and simple rx/tx processing
- Update release notes
- Fix the error in virtio tx ring layout ascii chart in the cover-letter
- Call __rte_pktmbuf_prefree_seg to check refcnt when free mbufs
- Fix the error in virtio tx ring layout ascii chart in the commit message
- Move virtio_xmit_cleanup ahead to free descriptors earlier
- Test merge-able feature when select simple rx/tx functions
- Remove unnecessary NULL test for rte_free
- Remove unnecessary assign of local var after free
- Remove return at the end of void function
- Remove always_inline attribute for virtio_xmit_cleanup
- Reword some commit messages
- Add TODO in the commit message of simple tx patch
- Remove the configure macro
- Enable simple R/TX processing when user specifies simple txq flags
- Reword some comments and commit messages
In DPDK based switching enviroment, mostly vhost runs on a dedicated core while virtio processing in guest VMs runs on other different cores.
Take RX for example, with generic implementation, for each guest buffer,
a) virtio driver allocates a descriptor from free descriptor list
b) modify the entry of avail ring to point to allocated descriptor
c) after packet is received, free the descriptor
When vhost fetches the avail ring, it need to fetch the modified L1 cache from virtio core, which is a heavy cost in current CPU implementation.
allocate the fixed descriptor for each entry of avail ring, so avail ring will always be the same during the run.
This removes L1M cache transfer from virtio core to vhost core for avail ring.
(Note we couldn't avoid the cache transfer for descriptors).
Besides, descriptor allocation and free operation is eliminated.
This also makes vector procesing possible to further accelerate the processing.
This is the layout for the avail ring(take 256 ring entries for example), with each entry pointing to the descriptor with the same index.
avail
idx
+
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | ... | 254 | 255 | avail ring
+-+--+-+--+-+-+---------+---+--+---+
| | | | | |
| | | | | |
v v v | v v
+-+--+-+--+-+-+---------+---+--+---+
| 0 | 1 | 2 | ... | 254 | 255 | desc ring
+----+----+---+-------------+------+
|
|
+----+----+---+-------------+------+
| 0 | 1 | 2 | | 254 | 255 | used ring
+----+----+---+-------------+------+
|
+
This is the ring layout for TX.
As we need one virtio header for each xmit packet, we have 128 slots available.
++
||
||
+-----+-----+-----+--------------+------+------+------+
| 0 | 1 | ... | 127 || 128 | 129 | ... | 255 | avail ring
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 128 | 129 | ... | 255 || 127 | 128 | ... | 255 | desc ring for virtio_net_hdr
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| | | || | | |
v v v || v v v
+--+--+--+--+-----+---+------+---+--+---+------+--+---+
| 0 | 1 | ... | 127 || 0 | 1 | ... | 127 | desc ring for tx dat
+-----+-----+-----+--------------+------+------+------+
||
||
++
Performance boost could be observed only if the virtio backend isn't the bottleneck or in VM2VM case.
There are also several vhost optimization patches to be submitted later.
virtio: add virtio_rxtx.h header file
virtio: add software rx ring, fake_buf into virtqueue
virtio: rx/tx ring layout optimization
virtio: fill RX avail ring with blank mbufs
virtio: virtio vec rx
virtio: simple tx routine
virtio: pick simple rx/tx func
doc: update release notes 2.2 about virtio performance optimization
doc/guides/rel_notes/release_2_2.rst | 3 +
drivers/net/virtio/Makefile | 2 +-
drivers/net/virtio/virtio_ethdev.c | 12 +-
drivers/net/virtio/virtio_ethdev.h | 5 +
drivers/net/virtio/virtio_rxtx.c | 56 ++++-
drivers/net/virtio/virtio_rxtx.h | 39 +++
drivers/net/virtio/virtio_rxtx_simple.c | 414 ++++++++++++++++++++++++++++++++
drivers/net/virtio/virtqueue.h | 5 +
8 files changed, 532 insertions(+), 4 deletions(-) create mode 100644 drivers/net/virtio/virtio_rxtx.h create mode 100644 drivers/net/virtio/virtio_rxtx_simple.c
--
1.8.1.4
Thomas Monjalon
2015-12-17 09:08:59 UTC
Permalink
Post by Xie, Huawei
You ever asked about the performance data.
Another thing is how about adding a simple vhost performance example,
like the vring bench which is used to test virtio performance, so that
each time we have some performance related patches, we could use this
benchmark to report the performance difference?
The examples are part of the doc and should be enough simple to be used
in a tutorial.
The tool to test the DPDK drivers is testpmd. There is also the simpler
unit tests in app/test. If something more complex is needed (with qemu
options automated), maybe that dts is a better choice.

Loading...